CS8803: Introduction to Operating Systems Course Notes: OS mechanisms, abstractions concepts and essentials
Last Updated: December 20, 2018 by Pepe Sandoval
If you find the information in this page useful and want to show your support, you can make a donation
Use PayPal
This will help me create more stuff and fix the existent content...
An OS is a special piece of software that abstracts (the HW usage and interaction) and arbitrates (manage/oversee HW use) the use of a computer system
The OS manages the HW on behalf of the applications
OS main operations:
OS elements
file
, socket
and memory page
(which correspond to HW resources the OS manages) or process
and thread
(which correspond to actual application the OS execute)create
and schedule
(which are used to launch or execute applications) or open
, read
, write
and allocate
(used to interact with HW abstractions)
Interaction between application running in user-mode and OS running in kernel model is done through system calls to request services from the OS and signals to pass OS notifications into the applications
System Call general flow:
System calls can be synchronous (wait until the system call completes) or asynchronous
User to kernel transitions cause overhead due to switches in memory locality and cache, kernel code execution, etc.
A process is an instance of an executing program, it represents the execution state of an active application
An application is stored on some memory as a static entity when is loaded in memory and executed it becomes an active entity or a process
Process properties:
A process address space is an OS abstraction used to encapsulate all of the process state which includes stack
, heap
, data
(variables) and text
(code)
data
(variables) and text
(code) are a type of static state (available since the process loads)stack
and heap
, are dynamic state since the process creates them during executionprocess address space refers to the virtual address space used by a process to store its state
The memory management HW and operating system memory components responsible for memory management, like page tables, maintain a mapping between virtual and physical addresses
The OS dynamically decides which portion of which address space will be present where in physical memory, processes may have some portions of their address space not present in main memory but temporarily swaped on disk
The OS makes process think they have access to all memory so even if two process are running at the same time the virtual address space ranges may/will be the same for example from 0 to 64Kb but mapped to different physical addresses
PC
, CPU registers, SP
, memory limits and mapping information, scheduling info and priority etc.) about a process
A process can create child process and in general all processes will come from a single root process, some process are priviliged process or root process. In general how this works is tha once the boot process is done and the OS is loaded, the OS will create a number of initial processes, then when a user logs in a user shell process is created then the user can create other processes from there for exampling when executing commands
A process can create a child process through the fork
and exec
mechanisms:
fork
: copies the parent PCB into the new child PCB and child continues execution at instruction after the forkexec
: Takes PCB structure already created and replace values for a new programWhen you execute a program the OS actually is doing a
fork
followed by anexec
, thefork
is used to create a process PCB block and the the exec replaces the child image with the image of the new program
in the original UNIX based systems init
is considered the parent of all process so all processes can be trace back to it
The scheduler is an OS component tha manages how processes use the CPU resources and decides which of the ready process will be run by the CPU and how long it should run for. General scheduler process:
For a process to use a multi-core system it needs to have multiple execution contexts it needs to have multiple execution contexts, these individual execution context are called threads
A multi-thread process is a single process with many execution contexts (Threads)
Threads
Thread can speed up execution due to
In a multi-threaded process the PCB needs to hold separate information about every thread (specifically it needs each of the execution contexts) that is part of that process
Threads share address space! because of this thread communications are implemented via shared variables in the same address space they are more efficient than interprocess communication and since we dont need to allocate address space for each thread they are more memory efficient as well
Thread can be useful when # threads
> # CPUs
or even in single CPU systems if for example the idle time of thread is much greater than the context switch time (t_idle > 2*t_ctx_switch
)
When process execute they each operate within their own address space while threasd share the same virtual-to-physical address mappings
A common paradigm in multi-threaded applications is to have a parent thread that only creates other threads and waits until they finish
Synchronization mechanisms, Concurrency Control & Coordination
Multithreaded Patterns
Supporting threads at the kernel level means the OS kernel is multi-threaded and performs scheduling and synchronization operaions to allow these kernel threads to share resources
Supporting threads at the user level means there is a user-level library that is linked with the application and provides all the management and runtime support for threads, the library provides scheduling and sync synchronization
For a user level thread to be executed it first needs to be associated with a kernel level thread so first the OS scheduler schedules that kernel level thread onto the CPU that will end up executing the user level thread. Association models between kernel and user threads:
The User level thread library decides which user-level thread will be scheduled onto the underlying kernel level threads
The User level threads managed by the user-level threading library tracks all the user-level threads that represent a single process
The process control block - pcb that represents the address space needs to keep a relationship between user level threads and kernel level threads because for each process we need to keep track of what are the kernel-level threads that execute on behalf of this process
When the kernel needs to schedule or context switch among kernel-level threads that belong to different processes it needs to determine they point to a different PCB to restore the entire PCB
Adaptive mutexes refer to mutexes that can become blocking or spining mutexes deoending on the situation for example spin for short critical sections and block for long critical sections
The information contained in the PCB is divided
kthread_worker
defined ininclude/linux/kthread.h
can be considered the kernel thread structure in linux, it provides a simple interface for creating and stopping kernel threads and within this thetask_struct
is used to describe the process the kernel thred is running and system needs at least 20 threads to boot the system, it could be more and the value is stored inmax_threads
PThreads
POSIX (Portable Operating System Interface) - is a standard that describes the system call interface that operating systems need to support to be compliant with this standard which intents to increase interoperability among OSs
PThreads == POSIX Threads. Pthreads is a part of the POSIX standard that describes the API an OS needs to support in order to create, use and manage threads, this includes threads and synchronization and coordination constructs like mutexes
A pthread_t
is joinable by default which means a parent can create children threads and must be joined
to the parent (parent must wait for them to finish)
otherwise it can create "zombie" threads. We can also set an attribute so a pthread
can be detachable in this case parent created tchildren threds but this can not be joined
PThreads
API provides the pthread_mutex_t
mutex structure that supports lock, unlock and trylock (check mutex if free lock if not free do not block just notify of this) operations
PThreads
API also provides the pthread_cond_t
data structure that represents a condition variable and supports wait, signal and broadcast operations
During init calls to pthread_t
, pthread_mutex_t
and pthread_cond_t
we can provide data structures with different attributes, using pthread_attr_t
, pthread_mutexattr_t
and pthread_condattr_t
, to specify threads ans mutexes behavior
You must init (use init struct macros or the init functions) and destroy pthread_mutex_t
mutexes and pthread_cond_t
condition variables
pthread_setconcurrency
sets the concurrency level, in other words informs the application's desired concurrency level, a value of 0
instructs to manage concurrency as it deems approriate
Interrupts asynchronous events generated by external components (like I/O devices, timers, other CPUs), they represent some type of notification to the CPU that some external event has ocurred, the interrupts are determined by the system/platform (architecture and HW in general)
Signals are events triggered by the software running on the CPU, signals depend on the OS or running software in general, they can be synchronous (occur in response to specific action took by the CPU) and asynchronous
signals()
and sigaction()
.n
signals are pending or just one the handler will be executed only once then the signal is disabled so the handler does not execute again,
if it is needed it can be renabled to execute the handler again once one or mutiple signals are detectedn
signals are received, the handler is called n
timesInterrupts and Signals:
Masking interrupts can be useful to avoid threads being interrupted on critical sections (which can cause issues like deadlock), but masking can also cause pending interrupts to accumulate while being masked which could cause only one interrupt to be handled
On muticore systems interrupts can be directed to any CPU that has them enabled but we can specify only one CPU will handle interrupts
Two identical platforms will have the same interrupts but different signals can occur depending on the OS that is running (or SW in general that is running)
The abstraction to represent an execution context is called a task and in code is described in the task_struct
in Linux,
it represents the execution context of a kernel level thread (KLT)
A single threaded process will have 1 task and a multithreaded will have many tasks,
The pid_t pid;
field in this task
struct is actually the thread identifier (this is due to historic reasons)
but if we have a single threaded process the task/thread ID and process ID are the same, in a multithreaded process the process ID
and the first task/thread that was created when the process was first created have the same ID which is also stored in the task group ID tgid
(= process ID).
The task_struct
has a linked list of tasks that are part of the process through the list_head tasks;
field
To create a new task Linux support the clone
operation that can share everything or some parts of the parent thread by using thr sharing_flags
argument,
fork
is internally implemented via clone
(fork
is just a special case of a clone
call). The main use of clone
is to implement threads,
multiple threads of control in a program that run concurrently in a shared memory space.
Linux Threads is called Native POSIX Threads Library (NPTL) that supports a 1:1 model so there is a kernel levekl task for each user level thread. With this model Kernel sees all user-level thread info
A metric is some measurable quantity that we can use to reason about the behavior and performance of a system
Concurrency paradigms:
Asynchronous I/O Operations / Asynchronous System Call / Async Calls
Types of Scheduling
Scheduling metrics
Throughput = jobs_completed / time_to_complete_all_job
Avg. Completion Time = sum_of_times_to_complete_each_job / jobs_completed
Avg. Wait Time = (t1_wait_time + t2_wait_time + ... + tn_wait_time) / jobs_completed
CPU bound tasks prefer longer timeslices because with this we have less context switch overhead
IO bound tasks prefer short timeslices because with this IO operations can be issued earlier and gives a more responsive application
The Scheduling Algorithm/Policy determines which task from the tasks that are ready to run should be selected with the intention to Maximize Resources (CPU, devices, memory, etc):
First-Come First-Served (FCFS):
Shortest Job First (SJF):
Priority:
Multi-level Feedback Queue: Uses multiple priority queues. The higher priority tasks (placed on a high priority queue) have the shorter timeslices and lower priority tasks have longer timeslices
The Linux O(1) Scheduler was introduced in kernel 2.5 and it is able to perform task management operations (like select to run, add task to be run, etc) in constant time regarding of the total number of tasks in the system.
Linux O(1) is a preemptive and priority-based scheduler, with 140
priority levels from 0
(highest priority) to 139
(lowest priority)
0
to 99
):100
to 139
):120
that can be adjusted with a modifer that goes from -20
to 19
(to result in priority between 100
and 139
)Linux O(1) associates different timeslice values with different priority levels and has a feedback loop that adjusts tasks priority level based on how a task behaved in the past
priority-5
) to get longer timeslicepriority+5
)The structure to roganize the tasks consists of 2 arrays (Active & Expired), each element of each array points to the first element of a task list corresponding to the priority level
The Linux CFS was introduced in kernel 2.6.23 to replace the Linux O(1) due to fairness and jitter issues, it schedules non real-time task, the real-time tasks uses a real time scheduler
Uses a Red-Black Tree structure to organize tasks, tasks are ordered based on the amount of time spend on the CPU (vruntime
),
nodes to the left represent tasks that had less time on the CPU (so need to be scheduled sooner) and nodes on the right represent taks that had more time on the CPU (so no need to be scheduled as quickly)
CFE always schedules leftmost node (task that had least amount of CPU time)
Periodically vruntime
(in nanoseconds ns
) of the task running on the CPU is incremented/adjusted by the scheduler and
at that point it compares the runtime of the currently executing task with the leftmost task in the tree
if current vruntime < leftmost vruntime
continue running, if current vruntime > leftmost vruntime
preempt and place accordingly in the tree.
In other words current task has run less than task in leftmost node in the tree keep running current task otherwise preempts,
place current task appropriately in the tree and run leftmost node
The periodic update of vruntime
depends on the priority of the tasks
vruntime
increase faster for low priority tasks, this gives them less amount of CPU timevruntime
increase slower for high priority tasks, this gives them more CPU timeOnly uses one Red-Black Tree structure for all the priority levels and takes constant amount of time to select a task but adding a task takes log time depending on the number of tasks in the systems, this is due to tree insertion
Shared Memory Multiprocessor (SMP) / Symmetric Multiprocessor Architecture
Multicore Architecture
A scheduler sees each core as entities into which it can schedule an execution context (threads) so for the OS each of the cores of a multi-core CPU or a core in a single-core CPU are all seen as "CPUs" where it can schedule some workload
cache affinity reffers to trying to schedule the thread on the same CPU where it executed before because it is more likely that is cache is hot (has the information needed by the thread), in other words keep task on the same CPU as much as possible. This can be achieved by a load balancer that divides task among CPUs then having a per CPU scheduler repeatedly schedules those same tasks on the CPU
On Non-Uniform Memory Access (NUMA) Systems there can be multiple memory nodes (which are interconnected with some type like QuickPath Interconnect - QPI) and some can be closer (faster to access) to a socket of multiple processors so an OS that is NUMA-aware would try to shcedules tasks on CPUs that are closer to the memory nodes where ther state is
Hyperthreading or Simultaneous Multithreading (SMT) refers to hardware support for multiple execution contexts this means a CPU with multiple sets of registers in which each set can hold an execution context, only one thread is executing a time but due to multiple register sets, context switch between threads is fast, it is just a switch between using one set of registers to another set, save and restore is not needed as often
SMT systems can be used to hide memory instructions latency accesses by mixing CPU and memory bound tasks, to determine whether a tasks is CPU or memory the OS scheduler can use HW counters (like cache misses counters, CPI counter) that estimate what kind of resources a task needs
memory instructions can take more than a single CPU instruction cycle, instruction may need to wait on some stage in the processor pipeline but this means thread is NOT sleeping, so memory instructiosn may have higher CPI (cycles-per-instruction)
Memory is used to capture the state of processing actions, the Memory Management responsibilities/actions:
Processes sees virtual addresses that are translated to physical addrees where actual data is store, the range of virtual addresses establishes the amount of virtual memory visible in the system and this can be much larger than the actual amount of physical memory
To manage memory the OS must be able to:
Because Physical memory is smaller that virtual memory, data needed in the virtual address space by a process may not be present in the physical memory, it may be in a secondary storage, so the OS needs mechanisms to decide how to replace the contents on the physical memory. replace mechanism refer to the mechanism that determine which content should be broought in from disk and which content should be stored on disk
x86_32
uses segmentation + paging and in x86_64
defaults to just paging but segmentation + paging is also supportedfor backward compatibility)Virtual Address == Logical Address
Virtual Address space is divided into fixed sized called pages, physical memory is divided into page frames of the same size, so in terms of allocation the OS need to map pages to page frames and in terms of arbitration it uses page-tables
Virtual memory pages and physical memory pages frames are the same size
The OS memory management component is the one that decides which page frame are replace so it must update the page table accordingly
The OS creates a page table for every process that it runs so on a process context switch the OS needs to also switch to the corresponding table
The virtual Page Number (VPN) which is the first portion of the virtual address is used to index into the page table and the rest of the virtual address is the offset. Combining this offset with the Physical Frame Number (PFN), which is stored in the location indexed by the VPN, results on the physical address
For each virtual address a page table entry is used to determine the actual physical location that corresponds to the virtual address
CR3
on x86
)Page tables are designed in hierachical multi-level structure that consists of an top page table (outer page table) referred as the page table directory, this doesnt store pointers to pages it store pointers to page tables, the internal page tables has page tables as its components that actually points to pages, their entires have the PFNs and bit flags
The Internal page tables exist only for valid virtual memory regions
Holes in the virtual memory space will result in lack of internal page tables (which we DONT need to store in memory)
Additional layers can be added as we add more page directories and internal page tables end up covering smaller regions of the virtual address space but require more more accesses (increased translation latency)
The offset field is the actual index into the actual physical page table and the offset size determines page size
4kB
(default - 12-bits offset), 2MB
(large - 21-bits offset), 1GB
(huge - 30-bits offset)Page table size = (Size of Entry) + (Number of Page table Entries)
and Number of Page table Entries = 2^(# VFN-bits)
32-bits
) memory with page size of 4KB Offset = 12-bits
then
Num of Page table entries = 2^(32-12) = 2^20 = 1M
and Page Table Size = (4 Bytes)*(1M) = 4MB
Number of acceses:
n+1
accesses to memory, n
to page table entries and 1 access to
actual memory at the correct physical address, to reduce these accesses the MMU integrates a HW cache (TLB) dedicated for
address translations on each address translation first the TLB is referenced and if the address is in cache accesses to all the
other accesses to the table levels are not neededVirtual Address space is divided into flexible sized segments that can be mapped to a region of physical memory and swapped between main physical memory and secondary storage, in terms of allocation the OS keeps this mapping, in terms of arbitration the OS uses segment registers to determine if its accesing a valid memory section
The virtual address is divided into a Segment Selector and the Offset, the Selector is used to get from aDescriptor table a Segment Descriptor which is then used to access the actual physical address of the segment
The segment is usually a contiguous portion physical memory and its size is defined by Segment Size = Segment Base + Limit Register
The memory allocator is an essential part of the memory management subsystem of an OS that is in charge of allocating particular portions of the memory to a process
The memory allocator has mechanism that decide what are the physical pages that will be allocated to a particular virtual memory region
Memory allocators exists at the kernel level and at a user-level
Linux Kernel uses the Buddy Allocator and Slab Allocator, Buddy used for requests demanding
2^x
chunks and Slab for requests of sizes with a granularity different than2^x
Paging / Demand Paging refers to the process of saving and restoring pages to/from secondary storage where a swap partition resides so in general pages are swapped between memory and the swap partion.
Periodically when the amount of occupied memory reaches a certain threshold, the OS will run some page (out) daemon that will lookup for pages to be swapped or freed
To decide which pages will be swapped out usually the OS uses an history-based prediction algorithm like Least-Recently Used (LRU)
Paging General process flow:
Inter-Process Communication (IPC) are mechanisms for process interaction which help to transfer data between address spaces and mantain protection and isolation of processes, they provide a standard and flexible way of sharing data while also considering performance
IPC refers to a set of mechanisms the OS supports to allow communication and coordination among processes
IPC requires synchronization mechanism between processes like mutexes
IPC in general could be of two types Message-Passing IPC and Shared Memory IPC. For sharing large amounts of data Shared Memory IPC may be better and for small amounts of data using Message-Passing IPC may be more convenient
Shared memory can be access by all the process that share access to that shared memory region so accesses must be synchronize (sync for processes).
pthreads
), appropriately setting them for processes and not threads,
for example setting the right fields in the pthread_mutexattr_t
and pthread_condattr_t
structures when creating a mutex or condition variable (usign PTHREAD_PROCESS_SHARED
)Processes exchange messages, pass messages via mailboxes, sockets, pipes, message queues, etc
OS provides communication channel like a shared buffer and processes uses standard write
/send
and read
/recv
operations to send/receive messages
OS is responsible for creating and maintaining the channel of communication (Buffer, FIFO queue, etc.)
and provides an interface (usually called a port) to communication opearations (processes write
/send
to a port and read
/recv
from a port)
Because the OS is the one that provides the interface for communication send
/recv
it is the one that does a system call and copying data each call of these operations
IPC is simple and easy to use because channel handles sync but can be costly due to kernel crossing and data copying (cause overhead)
Forms of Message-Passing IPC
POSIX
APISysV
and POSIX
are APIs that suppport Message queuessocket()
call to create kernel buffer used as the channel and the uses send()
and recv()
operations to
pass message buffers through the channel. It associates kernel-level processing for example for TCP/IP sockets associates all
the TCP/IP protocol stack data movement (which is a kernel-level processing) will be performed when doing a send()
or recv()
.
Sockets are mostly used for processes on different systems/machinesProcesses have a portion of physical shared memory and data that both need to access will be place in this shared memory, the shared memory could be in the form of physical memory pages or memory mapped files (multiple process read and write from the same file)
OS establishes a shared memory space and maps it into each process address space, processes directly read and write memory, OS is not involded in this communication step
OS only establishes shared channel between processes by mapping the same physical pages to the virtual address spaces of the processes In other words VA of Process 1 and VA of Process 2 are mapped to the same physical addresses
It is efficient because OS only sets shared memory space so system calls are only needed during setup but implementation can be complex due to sync issues and establishment of a communication protocol
SysV
API and POSIX
Shared Memory API can be used to implement Shared Memory IPC
SysV
Shared Memory APISysV
Operationsshmget(shmid, size, flag)
: Allocates memory (segment) and needs a unique key (shmid
uses hash ftok(pathname, prc_id)
to create it) to identidy the shared memoryshmat(shmid, addr, flag)
: maps a virtual address of a process to a physical address using the key used with the create operation (virtual address can be specified using addr
)shmgdt(shmid)
: Invalidate an address mapping from a process to the shared memory, dettachs a process from the shared memoryshmctl(shmid, cmd, buf)
: Actually destroying/releasing the shared memory (segment) by calling shmctl
with cmd == IPC_RMID
POSIX
Shared Memory APItmpfs
file system (which looks like a file system but is just data in volatile memory)POSIX
Operationsshm_open()
: Returns file descriptor to file in tmpfs
used for read/write operations to shared memorymmap()
& unmmap()
: use to map and unmap virtual address of a process to a physical addressshm_close()
: Removes/uninit the file descriptor so you can longer access the file but wont actually delete the shared memoryshm_unlink()
: Actually destroy shared memory fileTo guarantee the correctness of the synchronization construct, they need atomic instructions HW support
An Atomics Instruction is a instruction that is guaranteed to be executed uninterrupted and
provide mutual exclusion which means only one instruction at a time will be allowed to perform the operation while other concurrent atomic
instructions will have to wait. test_and_set
is an example of an atomic instruction
Atomic operations bypass cache to avoid synchronization cache-coherency issues so atomics are issued directly to memory (to the memory controller) this guarantees synchronization but causes performance due to memory access an also trigger coherence traffic (make sure to update values on cache)
Sempahore blocks or allows access to a resource providing exclusion like a mutex but are a general implementation of mutex, they are counting semaphores. binary semaphore == mutex
A Sempahore is basically a counter that is init to a max positive count, then a thread waits on the semaphore if the count is non-zero it decrements count and proceeds, if its zero it blocks the thread, when finishing accesing the shared resource the thread will post which increments the count
because they are initialized to a max count, they allow multiple thread to access the resource at the same time, the number of threads that can access the resource are equal to the max count value
#include <semaphore.h>
sem_t sem;
sem_init(sem_t* sem, int pshared, int count);
sem_wait(sem_t* sem);
sem_post(sem_t* sem);
RWlocks (rwlock_t
) also called shared locks or exclusive locks also restrict access to a
resource like a semaphore besides allowing to specify the type of access:
You can use read_lock()
and read_unlock
if you plan to access the resource without modifying it which allows other threads to access it as well
meanwhile you can use write_lock()
and write_unlock
if you want exclusive access to the resource
Monitors: You specify the shared reource being protected, the entry operations (like read, write, check, etc) and the possible condition varibles that need to be signaled or modified. On entry operation it locks and perform other checking and on exit it unloks, checks and does all the signaling
Serializers make it possible to define priorities and hide tge needs to explicitly signal and use of condition variables
Path Expressions you specify a regular expression like expression that that defines the correct sync behavior (E.g. many reads and one write)
Barriers or Rendezvous points are sync cronstructs that block until a number of threads arrive to a certain point defined by the barrier
wait-free sync (RCU) refer to no wait syncronization constrcuts that bet on the fact that it wont be conflicts due to certain concurrent operations
OS must provide an interface to interact with IO devices and specific components (like device drivers, interrupt handlers) that are responsible for the I/O management with the goal of abstracting the I/O devices details from applications
Any device can be abstracted as an Interface and other Internals
PCI - Peripheral Ccomponent Interconnect is one of the standard and most common methods for connecting devices to the CPU
Operating systems support devices via Device Drivers which are device specific software components that are responsible for device access, device management and device control which in general refers to the logic that determines how can requests be send and how responses and events can be notified
An OS usually includes a device driver for each different HW device that supports
Operating systems standardize their intercaes to device drivers by providing a framework so device manufacturers, who develop the device drivers, can develop a device driver using that framework, in other words within the framework using the specific interfaces that the OS supports
read
/write
) on entire blocks of flixed sized of data, and this data is delivered in and out between the device and CPU.get
/put
data in a character granularityOS usually uses a special file abstraction to represent each of the devices available on the system, so standard operations and mechanisms to interact with files
can also work with devices, for example read, write or other operations on the file representing the device will be handled in some device-specific manner.
On UNIX-like systems these representations can be found underneath the /dev
directory and are treated by special file system (devfs
& tmpfs
), not stored on a real file system
Pseudo/Virtual devices do not represent actual HW but provide special functionality that can be used by the system
CPU to device (CPU -> Device) access can be done through Memory Mapped I/O or I/O Port Model while path from a device to CPU (Device -> CPU) can take two routes: devices can generate Interrupts to signal an event to the CPU or the trhough Polling the CPU polls the device by reading some of its registers
Memory Mapped I/O part of the physical memory dedicated for device intercations, this portion of memory is controlled by a specific type of registers, the Base Address Registers (BARs) these defines how much memory and startst at which address will be used by a particualr device, these registers are configured during boot by the BIOS
PCI interconnect devices to the CPU through memory mapped devices, access to devices registers appear to the CPU as memory
locations at a specific physical address that the CPU can access (load
/store
<-> read
/write
) so when the CPU reads or writes
to these locations the interconnect realizes the access should be routed to the apporpiate device
I/O Port Model - CPU can also access devices via special in/out instructions, specifying the target device (via I/O port)
We can also access the device directly from the user-level bypassing the OS, in this case we need a user-level driver/library
inodes
A file is a logical storage unit mapped to a physical location
Block devices are typically used for storage and for storage related interaction processes uses files, while the FileSystem that is part of the kernel knows how to interpret the read/writes coming from the application and determine where is the file, what particular portion needs to be accessed, etc. then passes the request to a generic block layer which standardized all types of block devices and eventually calls the specific device driver
Virtual File System (VFS): Hide from applications all details regarding the underlying file system, its data structures are maintained by the OS Filesystem component
inode
which is data structure that holds a list of all the data blocks of a file besides (inodes
must be on persistant storage)
the file persmissions, size, device it belongs, etcinodes
inode
which are uniquely numbered, a filenam eis mapped to an inode
To reduce disk accesses disk rely on buffer cache in main memory and and prefetching to this cache, so the file is read/write from the cache and periodically flushed to disk (fsync()
), also
the OS could try to reduce disk accesses by doing IO scheduling or in other words ordering how disks accesses will be scheduled, maximizing sequential access
ext2
Extended Filesystem version 2
Organization of a partition used as an ext2
linux file system
inodes
)inodes
data structures to hold a list of all the data blocks of a file, in this section there can be from 1 to a max number of inodes
(in ext2 each inode
is 128 bytes), one per file,
Virtualization Allows concurrent execution of multiple OSs (and their applications) on the same physical machine, in order to run very diverse workloads or application on the same physical HW.
With Virtualization each OS thins that it own the HW resources but in reality it owns only virtual resources, the term Virtual machine (VM) is used to refer to the combination of OS, applications and virtual resources (the guest) running on a platform with actual physical resources (the host)
To support virtualization a platform needs a virtualization layer that provides the functionality needed to manage the physical HW and guarantees isolation between the VMs running, this is also called Virtual Machine Monitor (VMM) or Hypervisor
Consolidation refers to the ability to run multiple virtual machines on a single platform
Migration refers to the ability to easily clone a VM to another platform, providing greater availability
a VMM or Hypervisor is responsible for the management of the physical resources and supports execution of the VMs, usually integrading a privileged VM that runs a standard OS that has full HW privileges, runs all device drivers controlling how the devices are used besides performing other configuration and magement tasks.
Xen and VMware use this model
Uses a complete Host OS that manages all the HW and integrates a Virtual Machine Monitor module (Hypervisor Module) which provides hardware interfaces to the VMs, the VMM module will invoke drivers and otehr componets of the Host OS to access the HW.
This module allows to run VMs along with native application directly on the Host OS
KVM (Kernel-Base VM) based on Linux use this model, needs the KVM kernel module and QEMU for HW virtualization
guest non-privileged instructions are executed directly by the hardware the hypervisor does not interfer with every single instruction issues by the guest OS, as long as the guest OS operates within the resources that were allocated by the hypervisor everything is safe
privileged instructions are trapped to switch control the hypervisor (that runs in a privileged level) which determines whether the operations is allowed, if is legal it emulates the behavior the guest OS was expecting from the HW in case it is illegal it can terminate the VM send an error or take other action
Binary Translation (Full virtualzation) refers to the idea of dynamically translating a VM binary in order to never issue certain priviledge instructions, this is done via dynamic binary translation and qith the intention of dont actually modify the VM guest OS code which means that a sequence of instructions that are about to be executed are dynamically translated if needed. VMWare uses this approach
Paravirtualization refers to the idea of modifying the VM guest OS code so it knows it's running in a virtualized enviroment, in general modify system calls to call the hypervisor (hypercalls). Xen and Citrix use this approach
For full virtualization it is expected that all guests see contiguous physical memory starting at address 0
, for this
guest has a page table that map virtual to physical address and the hypervisor has table that maps physical to machine addresses
For Paravirtualization since the guest is aware of virtualization it explictly registers page tables with hypervisor
RPC is an special IPC mechanism that specifies process interaction via a procedure call interface (standarized interface)
General IPC mechanisms focus on providing the basic capability for moving data among address space without defining any protocol or standarized interface
RPC is intendedn to simplify the development of cross-address space and cross-machine interactions
RPC offers high-level interface that capture all aspects of data movement and communication of a client-server type of interactions, this includes: communication establishment, requests, responses, acks, etc. Abstracting all of this from the application that uses RPCs and hiding complexities of cross machine interactions
Fuction under a sync call semantics in which a call to a procedure call blocks the thread waiting for a response
RPC can handle type checking and cross-machine conversions (like Endianness) and support the usage of different transport protocols
RPCs are like function calls that are performed/executed remotely
RPC steps:
RPCs rely on an Interface Definition Language (IDL) which serve as a protocol to define what are the procedures supported
by a server and what arguments it needs for the various operations, in summary an IDL is used to describe the interfaces the server exports (supports).
IDL needs to specify: procedure name, procedure version number, arguments and result types, these are usually in a standard non-programming languge
(like XDR .x
files) and then are converted to a specific programming language
An OS that support RPCs needs to have a access to a registry of all the available services, this can be distributed, machine specific
Pointer arguments in a remote procedure call are a special case that result in RPCs not supporting pointer arguments or serializing all data the pointer points to send as part of procedure call
RPCs have a special error notification that intents to catch all possible ways in which the RPC can fail
Systems hide the fact that there may be different types of storage devices by using a virtual file system interface, with this it can also hide the fact storage is maintained on a remote file system that is being accessed over the network
Enviroments in where there are multiple machines that are involved in the delivery of the file system service, together form a Distributed File System (DFS)
In genereal any filesystem that can be organized as follows is a DFS
In general a DFS allow clients to store part of the files (cache files) it wants to access locally (blocks) and frequently interact with the server to report changes
Clients access server over a network via a VFS layer using the same types of file descriptor used to access local files, VFS determines whether the file belongs to the local FS or it needs pass control to the NFS client that interacts via RPCs with the NFS server which issues an operation to the virtual file system to acces the actual file in the server
When opening a file a handle is generated in the server that identifies the server machine and local file in the server, this handle is send to the client, the NFS client maintains and uses this handle when performing operations
NFSv3 (stateless) and NFSv4 (stateful) are the current NFS versions that uses a session-based (update on open file and flush/update on file close) approach when non-concurrent access are needed in general but also include periodic updates (default: 3s for files and 30s for directories) and for concurrent access support locking to files using a lease-based approach that defines validity of the lock
Distributed Shared Memory (DSM) is a service that manages the memory across multiple nodes so that applications running on top will have an illusion that they are running on shared memory machines,
Each node/system owns some portion of memory so holds some state and allows access (reads/writes) from any node and must comply with a consistency protocol to make sure the shared accesses to the state makes sense, acceses in general must be ordered and observed by all the nodes as if all where accesing the same physical memory
DSM perform distributed state management operations, the state stored in memory is shared among different processes that are running on different machines.
DSM is responsible for managing locally stored state and providing access to to this local stored state but also at the same time accessing the state that is stored by other nodes/machines
DSM permits scaling beyound single machine memory limits
DSM can be implemented fully in HW relying on commodity interconnect technologies like Remote DMA (RDMA) which can connect machines and provide low latency accessing remote memories, in this implementation the OS is under the impression that it has much larger physical memory so the OS can establish virtual to physical memory mappings that correspond to memory locations in other nodes
DSM can be implemented in SW (at the OS level or application level), in this case the software would have to detec memory accesses that are local and which are remote to create and send messages the appropiate node when necessary and be involved in in the memory sharing and consistency support
DMS besides read/writes introduce sync
which make sure all the updates that have happened prior to a sync will become visible,
this operation must be performed by the process who wants to signals and the process who wants to guarantee they have the latest updates
A file is a good abstraction of either a sink for, or source of, data. so it is an excellent abstraction of all the devices one might attach to the computer
The standard library of a UNIX platform is generically referred to as libc
open()
, read()
, write()
and printf()
/dev
. In UNIX-like systems, so-called device-nodes have what are termed a major and a minor number, these numbers allow the kernel to associate particular nodes with their underlying driver
pipe
gcc
)gcc
, almost completely implements the C99 standardgcc
can be directed to adhere strictly to the standard using the -std=c99
flag which will warn or create an error when things that are not in the standard are doneThe general principle behind all computer operations is The CPU executes instructions read from memory which can load/store from memory or operate on values stored in registers (like add, substract, bitwise operations, etc). Executing a single instruction usually consists of a particular cycle of events: fetch, decode, execute and store/write back
The CPU keeps a record of the next instruction to be executed in the instruction pointer which usually increment sequentially but can be changed by branching instructions
latency refers to to time the CPU takes to access data in main memory
Instructions in a RISC processor are much more simple
logic of ordering instructions can be moved from the processor to the compiler. This means that compiler writers need to be smarter to try and find the best ordering of code for the processor.
Computer code generally exhibits two forms of locality
The cache is a very fast copy of the slower main system memory, small chunks of mirrored main memory.
The communication channel between the processor and the peripherals is called a bus.
An interrupt allows the device to literally interrupt the processor to flag some information
The general principle is that the operating system has configured an interrupt descriptor table which pairs each of the possible interrupts with a code address to jump to when the interrupt is received
Most drivers will split up handling of interrupts into bottom and top halves
It is generally the job of the operating system to ensure that upon entry to the interrupt handler, it saves any state and restore it when returning from the interrupt handler
The most common form of IO is so called memory mapped IO where registers of a device are mapped into memory
If you find the information in this page useful and want to show your support, you can make a donation
Use PayPal
This will help me create more stuff and fix the existent content...