Skip to content

Operating system basics and required CPU substructure



The architecture of modern CPUs is not understandable without some understanding of the structure of modern  multitasking operating systems. CPU architecture is evolved in a particular way because complicated and multitasking OS’s had required it so, not the other way around. This was not so with the first microprocessors like 8080/6800/Z80 etc, which were not designed with OS’s in mind. Back then it was the OS writer’s duty to shape his code to fit the microprocessor’s limitations. Hence, with processors like 8086, the limit was OS’s like DOS. Things started to change with 80386, where complicated memory management mechanisms were incorporated into the hardware, and a privilege mechanism was adopted to separate kernel code from user programs. Only after 80386, complicated OS’s like windows or Linux became viable.

In this note, we will survey some of the basic demands of multitasking OS’s made on CPU architecture. Later on, we will see how each of these demands is translated into an architectural detail of modern CPUs.

Our choice is to study modern intel CPU’s. The upside is they are nearly universal. The downside is they carry a lot of historical baggage, hence they are unnecessarily complicated. Indeed, many of the architectural details of Intel processors are not used at all by OS writers. For example, segmentation provided by intel is not used at all. Same with TSS’s and call gates. Only two of the four privilege levels are used. Examples can be multiplied. In these notes, outfashioned features that are not used by modern OS’s will not be mentioned. We will not talk about I/O space and in and out instructions. We will only present a partial and “cleansed” picture of these processors. (Actually, some of these unused features like call gates found favour with virus writers. Some viruses like Stuxnet actively use them).

Our choice for OS is Linux, because it is open source. We will use version 2.6.11.

We assume that you already had a course on OS’s and cmp arch and familiar with the basic concepts.

Basic OS Concepts in a Nutshell

The discussion in this section will be generic, and not limited to Linux kernel or Intel processors. For the sake of simplicity,  we will assume a processor with 8 bit data bus and 16 bit addres bus.

Basic constituents of a Computer system

A computer system is built on five closely related concepts:

  1. Kernel
  2. User programs
  3. I/O devices and their drivers in memory
  4. Files
  5. Filesystems

We will examine each of these concepts below.


A kernel is a program whose main duty is to form a bridge between the user programs and hardware resources. It allocates and deallocates limited hardware resources like disk space, CPU time, memory and IO devices between user programs. It also detects and kills harmful user programs.

Except for a tiny part, which is written in assembly, most kernels are written in C.

User Programs

A user program (equivalently, an application program) is some code which performs a specific task on a computer for a user.

A process is an instance of a user program that is being loaded into computer and executed. Two different instances of the same program can execute as two different processes. If these two instances also share some resources, they are called “threads” of the same process.

Processes do not directly interact with kernel to ask for resources. When they need to interact, such as requesting disk I/O, keyboard I/O, or new memory allocation, they instead notify the kernel, via a specific function call known as a “system call’‘. Each such specific interaction has its own unique system call. For example, malloc() allocates memory, fopen() opens a file etc. Kernel, after receiving the system call, does the necessary work and sends the results back to the calling process.

The only resource which forms an exception to the picture given above is the CPU time. User processes do not receive CPU time via system calls. Instead, kernel stops the currently running user process at regular intervals (called scheduler tick) and starts another user process. The part of the kernel that does this is called “the scheduler”.

In most operating systems, kernel holds a specific data structure for each actively running process which stores all the bookkeeping information about that process..  In Linux, this is a 8K structure called a task_struct. When a new process is created, a task struct is allocated for it. And, when the process is destroyed, its task struct is also destroyed. Task structs of all active processes are kept in a doubly linked queue called runqueue. A variable called runqueue points to the start of this queue.

Kernel itself is not a process, and it does not have an associated task structure. Each running user process sees the kernel as part of itself. Or, more precisely, it sees kernel as something similar to one of its libraries, whose functions can be called via system calls. In contrast, a user process is not aware of the existence of other user processes running on the same computer.

I/O Devices

An I/O device is a piece of hardware that the computer uses to communicate with the outside world (keyboard, monitor, mouse, printer etc) or store data (secondary storage devices like hard disk, tape, usb stick etc). Kernel sees an I/O device as a set of locations in memory. Hence, I/O is usually done as a set of memory writes/reads. Example: The memory area starting from 0xB800 is actually the video memory. Anything written to this address goes to screen.

Files and Filesystems

Files are containers of information on the secondary storage which can be read (into RAM) or written (from RAM) by processes. Files has names, and processes
read or write files by referring them with their names. For example, if we want to read 517th character of the file foo.txt, we can use the following code segment:

char x;
int fp=fopen(“foo.txt”,“read”);

The file foo.txt is stored on a block device. Block devices does not know anything about files or filenames. Instead, block devices are composed of contiguous blocks of size 512 bytes, and each such block has a number which can be used to acces it. When a file is saved on a block device, sufficient number of blocks are allocated to that file, according to that file’s size.

As an example, assume that our file foo.txt is stored on a 1M flash disk. In this disk, there is 2K blocks, numbered from 0 to 2047. If we assume that our file is 2100
bytes long, 5 blocks will be required to store it, and 5*512-2100 bytes will be wasted as a result. Let

  • The first 512 bytes of foo.txt be stored in block 47
  • The second 512 bytes of foo.txt be stored in block 14
  • The third 512 bytes of foo.txt be stored in block 735
  • The fourth 512 bytes of foo.txt be stored in block 338
  • The remaining 52 bytes of foo.txt be stored in block 92

As can be seen from this example, blocks belonging to a file can be spread on the disk in a completely disordered way.

In block devices minimum unit of transaction is a block. We have to read/write a whole block to/from RAM from a block device even if we need only a single character in it. Also, we need to allocate a whle block even when we need to store only a single byte to the block device.

Recall that a user process wants to read 517th byte of foo.txt. But the block device does not know anything about foo.txt. Instead, the kernel must first reserve 512 bytes of RAM, then reads the whole block 14 of the flash disk into that 512 bytes of RAM buffer. And then, it should extract the 5th byte of the 512 bytes that is read, and return it to the user process.

Actually, the real proces is a bit more complicated. Note that in our example there is a mapping between the file name foo.txt and ordered set of block numbers 47, 14, 735, 338, 92. For each filename there must be one such ordered set of block numbers. Generally the first few blocks of each block device, (say, blocks 0-10 in our flash disk,) is reserved to keep this mapping. So when a user process wants to read 517th byte of foo.txt, kernel first searches blocks 0-20 to find which block numbers foo.txt is mapped to. Only after this, it reads the relevant block into memory and extracts the relevant byte to return to the user program..

The metadata at the first few blocks of the disk which maintains the filename/block numbers mapping (and the part of the kernel which manipulates this data) is known as a filesystem. There are many different ways to keep this data, and hence there are many different filesystems. The best known are FAT, used by windows, and ext2/ext3, used by UNIX/Linux.

Basic tasks of the kernel

Kernel has four basic tasks:

  1. Handling system calls.
  2. Handling interrupts.
  3. Handling certain error conditions of user programs, called exceptions
  4. Scheduling

Below, we will investigate the first three in detail. The fourth will be investigated in its own chapter. But let us note in advance that all of these will turn out to be a special case of “handling interrupts”.

System Calls

When user processes request a service from the kernel, they use a system call. Linux has approximately 300 different system calls. Examples are malloc(), getch(), putch(), fopen(), fclose() etc.

Error Conditions (Exceptions)

A second way for kernel-process interaction is exceptions. This interaction is usually (but not always) fatal for the process.

When a process runs, the CPU hardware supervises the execution of each instruction. When an instruction does something illegal, like trying to divide an integer with zero, or trying to execute an invalid opcode, or, more dangerously, trying to read or write over the kernel data, the CPU triggers an “exception“ ie, it suspends the execution of the current process and calls a kernel function which which handles that particular error condition. Such kernel functions are known as exception handlers or exception service routines (ESR’s).

Each processor brand has its own exception list . For example, intel architecture has defined 20 exceptions. Each of these exceptions have an associated number between 0 and 19, which is called an exception vector.  For example, divide by zero error exception has vector 0. Invalid opcode exception has vector 6. Whenever an intel processor catches an error, it classifies this error into one of these 20 error categories. For each exception vector, there is a corresponding ESR loaded in memory which handles that error.  In total, there are 20 ESR’S in memory, each policing one of the 20 exceptions. After the exception is detected and classified, CPU suspends the current program calls the corresponding ESR.

The ESR, which is usually a C program, inquires the situation and decides what to do with the transgressing process. Sometimes it does some changes, and then returns back to the program to continue running from where it left. Most of the time, it kills the offending process and calls the scheduler to schedule a new process…

Note the ”division of labour“ between the kernel and hardware: exceptions are detected and classified by CPU (ie, hardware). Each executed instruction is checked automatically for various errors by the CPU. CPU also classifies the error and calls the correct ESR. But ESR’s corresponding to these exceptions are provided by the kernel (ie, software).

Exceptions lie at the heart of a multitasking kernel. They allow the hardware to detect an erring process quickly and notify the relevant ESR before it corrupts other peoples data, or, worse still, it corrupts kernel data (which results in immediate crash) or disk data (which results in permanent loss of information).

We have said that exceptions are error conditions. Only one exception (exception 14, page fault) does not fit into this classification. Page fault exceptions lie at the heart of the demand paging system, hence they are not error conditions. As we will give a lot of information about this extremely important exception in the coming pages, we will not explain it here.

Hardware Interrupts

I/O devices communicate with kernel via hardware interrupts. Each I/O device is assigned a number which is called an  interrupt vector. In intel architecture, the interrupt vectors are 8-bit numbers which vary between 32-255. For example, timer is assigned the vector 32, keyboard is assigned the vector 33, hard disk is assigned the vector 46 etc. We will assume that these assignments are unique (Sometimes different I/O devices can share a single interrupt vector. But this is a highly undesirable situation). When a hardware device needs to communicate with kernel, it sends an interrupt signal, followed by its interrupt vector to the CPU. CPU receives and checks this interrupt vector and finds out which hardware device needs its services.  All active hardware devices must have a corresponding interrupt service routine (ISR) loaded in memory. As soon as CPU receives the interrupt vector, it knows which device raised the interrupt, and therefore which ISR is to run. CPU suspends the currently executing program and calls the ISR of that particular I/O device.  For example, if we press the keyboard, the processor will receive the interrupt vector 33. It will suspend the current process and calls the
keyboard’s ISR. Keyboard’s ISR will read which key we have pressed.

Each ISR finishes with an IRET instuction, which resumes the current process.

One of the most important interrupts is the timer interrupt, with vector 32, which, among other things, periodicaly evokes the scheduler. Yes, scheduler is basically an ISR.

Revisiting System Calls

Under some conditions, it is possible to evoke hardware interrupts from software via INT k “software interrupt” instruction, where k corresponds to an interrupt vector. For example, the instruction INT 33 will run the ISR corresponding to keyboard.
It is possible to enable or disable software interrupts on a vector by vector basis.
In Linux, only the software interrupt corresponding to vector 128 (ie, INT 128 instruction) is allowed. Any other INT k instructions are ilegal under Linux configuration.

INT 128 instruction is very important, as all system call functions are called by it. In other words, all system call functions are subfunctions of a sigle ISR corresponding to the interrupt vector 128. Hence return from system calls are done in the same way as with return from interrupts, ie, via an IRET instruction.

Interrupt descriptor table (IDT)

A cpu spends most of its time running user programs, and the kernel is only activated when the cpu receives a system call, exception or hardware interrupt. In all these three activations, the mechanism is the same:

  1. CPU suspends the currently running process, checks the interrupt/exception vector at hand,  and calls the relevant ISR/ESR. Recall that a system call is actually an interrupt with an interrupt vector 128.
  2. The associated ISR/ESR quickly does what is required of him.
  3. When finished, it revives the user programs via an IRET instruction.
  4. If this is a timer interrupt, the timer ISR may also switch user processes by manipulating the kernel data. How exactly it does this will be seen later. But when the ISR executes IRET, it returns back and revives a different user program. This is how a context switch is done. Some system calls also works this way, like sleep().
  5. Or if an exception, the ESR may kill the user processs and schedule a new one. In this case also, when the ESR executes IRET, it will return to a differen program. Details will come later.

I hope that at this stage it is apparent that interrupts, exceptions and system calls (which is interrupt 128) are just three faces of the same thing:  They are all interrupts. And all are managed from a table called “interrupt descriptor table”.
In intel, there are 256 rows in this table.

  • Rows 0-19 corresponds to the 20 exceptions, with row i being a function pointer to ESR i.
  • Rows 32-255 corresponds to hardware interrupts, with table element i being a function pointer pointing to ISR i. In particular, row 128 points to the ISR which manages the system calls.
  • Rows 20-31 are not used.

Because of the IDT, it is always possible to locate ISR/ESR’s in memory when one knows the interrupt/excaption vector. But how to locate the IDT itself? In x86, there is a specific register, called IDTR, in CPU, which keeps the address of IDT in CPU. Hence

  • ISR/ESR’s can be placed anywhere in memory, provided that their address in IDT is kept properly,
  • IDT  can be located anywhere in memory, provided that its address in IDTR is kept properly

Now we can describe the the entry/exit to the kernel with more detail..

  1. User process runs
  2. An interrupt/exception/INT 128 instruction is encountered.
  3. The CPU finds out the relevant interrupt/exception vector.  For system calls, this is always 128 (explain +20 here)
  4. CPU adds the 4*(interrupt/exception vector) to IDTR and finds the relevant row of IDT
  5. CPU reads out the address of the relevant ISR/ESR from that row.
  6. CPU suspends the current process and calls that address
  7. ISR/ESR executes.
  8. All ISR/ESR code are terminated by an IRET instruction. This IRET instruction resusciate the suspended current user program.

Note that steps 2-6 is executed by CPU. The only role of the kernel is to set up the relevant data structures correctly during the system initialization, ie, loading ESR/ISR’s into memory and setting up the IDT and IDTR correctly.

Kernel mode and User mode

Most modern CPU’s have two modes of operation:

  • High previlege mode
  • Low privilege mode

When the CPU runs in the high privilege mode, it doesn’t have any constraints on it. In contrast, when it runs in low privilege mode, it runs under many constraints.
For example: In x86, IDTR register is modified with a LIDT (load interrupt description table register) instruction.  LIDT can only be executed when the CPU is  high privilege mode.

In Linux:

  • Low privilege mode is also called the “user mode”. It is exclusively used when running the user code.
  • High privilege mode is also called the “kernel mode”. It is exclusively used when running the kernel code.

How does a CPU goes from kernel mode to user mode and vice versa?

  • A CPU will pass automaticaly from the user to the kernel mode when it receives an hardware interrupt, exception, or encounters an INT 128 instruction (ie, a system call).
  • If CPU receives an interrupt when it is in kernel mode it will remain in the kernel mode.
  • Correspondingly, when the processor executes an IRET instruction, it will exit from the kernel mode and return to user mode.

The reason for dividing the CPU time into two modes is protection is this: User programs are assumed to be malicious, buggy and not to be trusted.  Hence they are only allowed to run under more restricted conditions. Kernel code, on the other hand, is assumed to be bug free. If an analogy is permitted, kernel can be likened to an adult and users can be considered as children. So all matches, knifes, glass must be locked away when children are around. This problem is most severe under the modern multitasking systems: one misbehaving ”kid“ may bring down the entire system.

For example if I am a user in a multi-user system and I want to destroy the system to give vent to my malicious feelings, I’ll just execute a LIDT instruction and load junk into IDTR. Then all the interrupts will go awry and the system will collapse, with possible permanent loss of data. But in a previleged system, LIDT instruction cannot be executed when the CPU is in the user mode. If executed, a general protecton error (exception 13) will result and our program will be killed by the exception 13’s ESR. In windows, the result is the famous blue screen. The end result of this is that a user cannot use LIDTR instruction, and cannot mess up with IDT. Only kernel can use LIDT.

Kernel space and User space

In Linux (and many other modern OS’s) computer virtual memory is divided into two parts: kernel space and user space. Under Linux/Intel,

  • Virtual memory in the address range 0-3G belongs to user space
  • Virtual memory in the address range 3G-4G belongs to kernel space

The terrm “virtual memory” will be explained in the next section.

The data/code in the kernel space can only be used when the CPU is in the kernel mode. If the CPU is in the user mode and

  • tries to ld/st an address that is above 3GB boundary,
  • call a function whose address is above 3G boundary,
  • jump/branch to an address which is above 3G boundary,

a general protection exception (exception 13) will result, with a possible blue screen. On the other hand, when CPU is in the kernel mode, it can reach any address.

Kernel space contains all the code and data structures of the kernel. It exclusively belongs to kernel.

  • All ISR’s
  • All Esr’s
  • All system call routines
  • IDT
  • page tables(see below)
  • runqueue (contains the kernel mode stack and page table of all processes)

are located in kernel space. Naturally, all IDT elements must be larger than 3G.

User space belongs to the user process. It contains user code/data and can be run at both kernel and user modes. (?)

As a result of this arrangement, the kernel space is protected from the user access. The exact mechanism of this protection is implemented by page tables, which will be described in the next section.

A Detailed look at Interrupts

This is a highly sanitized version of intel interrupt processing sequence. I have suppressed every detail relating to TSS.

  1. CPU reads the interrupt vector. It adds the interrupt vector to IDTR to find the address of the required ISR.
  2. If the CPU received the interrupt while it is executing the kernel code (ie, while it is in kernel mode), it continues to remain in the kernel mode. But if it received the interrupt while it is executing the user code (ie, while it is in user mode), CPU passes to the kernel mode.
  3. If the transition from the user mode to the kernel mode occured, the CPU must perform an operation known as “stack switch”.  In intel architecture, each process has two stacks: A user stack, allocated in the user space, and a kernel stack, allocated within the task_struct of each user process, in the kernel space. If the interrupt comes in user mode, the CPU stops using the user mode stack and start using the kernel mode stack. It does this by reading the address of the kernel mode stack from the task_struct and loading it to the SP register.
  4. After the stack switch, CPU pushes the previous contents of the SP into the new kernel mode stack.
  5. pushes flags
  6. pushes pc into the kernel mode stack. Note that PC contains the address of the next instruction to be executed.
  7. Loads the address of the ISR, as calculated in step 1, into PC.  This starts the execution of ISR.

A Detailed look at Exceptions

Exceptions mostly proceed in the same way with interrupts. Again, this is the sanitized version

  1. CPU reads the exception vector. It adds the interrupt vector to IDTR to find the address of the required ESR.
  2. If the exception occured while the CPU is executing the kernel code (ie, while it is in kernel mode), it continues to remain in the kernel mode. But if it received the interrupt while it is executing the user code (ie, while it is in user mode), CPU passes to the kernel mode.
  3. If the transition from the user mode to the kernel mode occured, a stack switch occurs exactly as described above.
  4. After the stack switch, CPU pushes the previous contents of the SP into the new kernel mode stack.
  5. pushes flags
  6. pushes pc into the kernel mode stack, if if the exception is of abort or trap type. In this case, we want to execute the next instruction at the return from the exception. But if the exception is of fault type, we push the address of the current instruction. In this case, at the return from the exception we want to execute the current offending instruction.
  7. If the exception carries a hardware error code, it pushes it to the stack
  8. Loads the address of the ESR, as calculated in step 1, into PC. This starts the execution of ESR.

Return from Interrupts and exceptions

  1. Pop cs and flags from the kernel mode stack. If a hardware error code has been pushed, it must be popped before executing iret.
  2. Check whether the CPL of the handler is equal to the value contained in the two least significant bits of cs (this means the interrupted process was running at the same privilege level as the handler). If so, iret concludes execution; otherwise, go to the next step.
  3. Pop the sp from the stack. This will bring back the user mode stack.

Nested interrupts and Exceptions

Kernel code is assumed to be faultless and immune to exceptions. Therefore, exceptions is only assumed to occur in user mode. As soon as the exception occurs, the system will pass into the ESR (kernel mode), and a second exception becomes impossible until that ESR terminates and the system returns back to the user mode.. Therefore, only one ESR can run at a given time, and these are always triggered from the user mode.

The only exception to the above picture is page fault exception. Page fault exceptions can be triggered from within the system calls, hence, from within the kernel mode. Therefore, we have the picture

  • A single ESR is running on top of the user code
  • A page fault ESR is running on top of a system call on top of the user code.

Interrupts can interrupt almost anything: user code, other interrupts, system calls and exceptions, hence they complicate the picture. The only exception is interrupts with the same interrupt vector. These are “serialized”, ie, if, say, interrupt with interrupt 47 is running, no other interrupt with interrupt vector 47 is allowed until the first one is terminated. As they are kernel code, ISRs will not give rise to exceptions and system calls.

Therefore, at a given time, we have one of the following.

  1. User code is running
  2. An ESR is running on top of the user code.
  3. A system call is running on top of the user code.
  4. Page fault ESR is running on top of a system call on top of the user code.
  5. Arbitrary number of ISR’s is active on top of the user code.
  6. Arbitrary number of ISR’s is active on top of an ESR on top of the user code.
  7. Arbitrary number of ISR’s is active on top of a system call on top of the user code.
  8. Arbitrary number of ISR’s is active on top of the PFE ISR on top of a system call on top of the user code.

For future reference: A context switch is only possible in cases 1-4. In cases 5-8, the process is said to be in “interrupt context”, ie, some other process is using our kernel mode stack. In such cases it is not possible to context switch..

Memory Management

Note: There are many possible strategies in memory management. Two of these are implemented in x86 processors

  • Paging
  • Segmentation

Out of these two, segmentation is the basic one, and paging works on top of segmentation..

While intel architects preferred segmentation over paging, Linux and Windows people begged to differ. They decided that paging is the better of the two, and segmentation is just a nuisance. Hence they made decisions which removed segmentation totally out of picture. Linux memory managemet relies only on paging..

In the following notes, we will only describe paging and ignore segmentation.

Why memory management?

Many problems arise when we try to load a executable into computer memory and run.  Consider the simplest possible self-contained program in some high level language:

Loop: goto Loop

Even though this single-line program is completely useless, it is nevertheless a legal program and it serves to ilustrate many problems that arise in loading a program into computer memory and running it concurrently with other programs. When we compile this program, the following assembly language code wil be generated:

loop jmp loop

When we assemble this code and generate the executable, we have to specify the memory location to which this code wil be loaded.  If we are decided to load this code to memory location F1204B1Ah (which corresponds to assigning F1204B1Ah to symbol “Loop”) our machine code will become:

01 F1 20 4B 1A

(we have assumed that opcode for jmp is 01) This code must be loaded into memory location F1204B1A to work correctly. No other memorylocation wil work.
Here we face our first problem: there are machines with limited memory, and there is a possibility that the memory location F1204B1A does not contain any physical RAM in the machine we want to run this program.

Assuming that we have circumvented this problem, we face an another problem: If we want our code to run in a multi-tasking environment, the memory location F1204B1A may already be occupied by an another program.

It seems that we cannot run our .exe program in all computers, and when we can, we cannot run it concurrently with other programs that are compiled and assembled to use the same memory addresses that our program use.

Virtual memory has been developed to solve these two closely related problems.

Basic Concepts of Virtual Memory

To understand Memory management in modern processors, we have to get accustomed with two terms: physical memory and virtual memory.

When we compile a program, we generally make two assumptions:

  • The compiler imagines that all of the memory space is present and usable. The size of this space depends on the number of address pins of our processor. In intel IA-32 architecture,this space is $2^{32}=4GB$. This imagined 4 GB space is known as the “virtual space”.
  • The compiler also assumes that only our program and the OS is present in our system. The 4GB virtual memory space is shared between the OS and our program. Our program takes the range 0-3G (ie, the user space), and OS would take the range 3G-4G (ie, the kernel space). When the compiler generates machine code for our programs, it never uses addresses in the OS range, ie, above 3G.

Virtual memory is divided into “pages”, and each page has a page number. In intel architecture, these pages are of 4K size. The lowest page have page number 0, etc. The minimum unit we can allocate from virtual memory is a page. Even if we only request 100 bytes, operating system will assign us a page, which is 4K. Note that not all 4K ranges are pages.

Example: Let us say that our CPU has 16 address pins, which makes 64K of virtual memory (we are not discussing intel now).  Assume also that the pages are 4K. This makes 64K/4K=16 pages. Let us assume that the OS we use is of size 10K and compiled for 54-64K range. Then we must assign pages 13, 14, 15 to OS image. Note that 2K is wasted.
Assume we have three programs A, B and C. These programs are compiled fort he following ranges:

  • A: 17-20K
  • B: 12-18K, 30-35K
  • C: 30-40K

Note that there is no obligation for programs to be “single piece”, program B consists of two pieces.

When we assign pages to processes A, B, C, the result will be

  • A: pages 2, 3, 4, 13, 14, 15
  • B: pages 3,4,5,7,8, 13, 14, 15
  • C: pages: 7, 8, 9, 13, 14, 15

This means that when these programs run, they will see the memory as shown in the figure


Note that virtual memory is something that belongs to a program.. Each program has its own view of the virtual memory, and is not aware of other programs’ virtual memory. Its view of virtual memory may even contradict with other programs.. In our example, programs seem to intrude each others’ space at pages 3,4, 7 and 8. But all programs must see the OS consistently at the same location, ie, at the same pages (in our example, pages 13, 14, 15). As we will see later, this will be important during the context switch..

Physical memory

But when we attempt to load our program into a physical computer’s memory, we came face to face again with the two limitations listed above. The computer’s memory is the physical memory, and it may be smaller (but never larger) than the virtual memory, not the virtual memory, and may include “holes”. In 32-bit processors, physical memory has a chance to be equal to virtual memory, which is just 4G. In the more recent 64-bit systems, there is no such chance. Virtual memory is always much, much larger than physical memory.

Physical memory is divided into page frames. Each page frame is 4K long.
Continuing from the previous example, assume that our computer has 2 16K memory chips, 32K of physical memory in total, which makes 10 page frames. Let us asume that the adress decode logic of our main board assigns the physical address range 0K-16K to the first memory chip and the physical address range 32-48K to the second one.  In this case

  • Page frames 0, 1, 2, 3 will be in the first memory chip
  • Page frames 8, 9, 10, 11 will be in the second memory chip
  • Remaining page frames 4, 5, 6, 7, 12, 13, 14, 15 are not assigned to a memory chip and hence do not exist in our system.
  • Note also that physical memory may not be contiguous and there may be “holes” in the available address range. In our system, one such hole exists at the physical address range 16-32K.

The figure below compares virtual and physical memory.

Loading a program compiled for virtual memory into physical memory

Programs are loaded into pysical memory, but each program dreams that it exists alone in its own virtual memory. The adresses they generate to reach data and code, ie, the addresses that are in the instructions ld, st, jmp, jz, call etc and the addresses generated by the program counter are virtual memory adresses. Such addresses must be mapped into physical memory addresses before they are used. This is called memory management.

Each modern CPU which is capable of memory management has a memory management unit inside it. This is a small hardware circuit embedded into the processor. Its main duty is to map pages in the virtual memory to page frames in the physical memory. It does this with the help of a data structure called a “page table”, which resides in memory. Each process has its own page table, which is stored in kernel. There is a pointer from each process’ task_struct to that process’ page table.


In order to understand process switch, which is our next topic, we have to know a little bit about the posix threads..


In Linux, a process switch only occurs by calling the schedule(void) function. Also,

  • The variable prev must point to the task_struct of the process to be switched out.
  • The variable next must point to the task_struct of the process to be switched in.

Every process switch involves two steps

  1. A Page table switch.
  2. A Kernel mode stack switch.

The second step is performed by switch_to(prev, next, last) macro. This macro has three parameters, called prev, next, and lastprev and next is described above. But what is last?

The first thing to observe is that switch_to is always called like switch_to(prev, next, prev). Therefore the parameters prev and last point to the same structure, the process to be switched out.

Secondly, who calls switch_to? Assume process A is switched out and process B is switched in in its place. switch_to() is obviously called while the process A is active. Therefore, as the parameters prev,  next and last are local variables of switch_to(),  they are allocated in process A’s kernel mode stack. Then switch_to() stops the execution of A, and the variables prev and next freezes in the kernel mode stack of A.  Therefore, when frozen, A remembers which process is activated after it.

Assume that at some future time process C is switched off and process A is switched on. When A resumes its execution, it finds in its Kernel Mode stack the prev local variable points to A’s descriptor and the next local variable points to B’s descriptor.  There is no reference to C. But for various reasons that will be explained later, we want a running processes to know which process was active before it. Hence we have to find a mechanism which will commit this information to B. This is achieved with the help pf the 3rd parameter of switch_to(), which is last. last provides a memory location into which the macro writes the descriptor address of process C, and this is done with the help of an assembly language trick.. Best way to understand how this trick works is to go through code:

  1.  eax <= prev, edx<=next
  2. ???????
  3. sp => (prev->thread.esp),  (prev->thread.esp) = > sp
    This is the stack switch. As the address of a process descriptor is closely related to that of the Kernel Mode stack, changing the kernel stack means changing the current process.
  4. Saves the address labeled 1 in prev->thread.eip. When the process being replaced resumes its execution, the process executes the instruction labeled as 1:
  5. On the Kernel Mode stack of next, the macro pushes the next->thread.eip value, which, in most cases, is the address labeled as 1
  6. jumps  to  __switch_to() macro  (explained below)

__switch_to  acts on the prev_p and next_p parameters that denote the
former process and the new process. It receives these parameters eax and edx registers, not from the stack like most functions. To force the function to go to the registers for its parameters, the kernel uses the _ _attribute_ _ and regparm keywords, which are nonstandard extensions of the C language implemented by the gcc compiler. The _ _switch_to( ) function is declared in the include/asm-i386/system.h header file as follows:

__switch_to(struct task_struct *prev_p, struct task_struct *next_p)_attribute_ _(regparm(3))

  1. saves the contents of fpu, mmx and xmm registers.
  2. Executes the smp_processor_id( ) macro to get the index of the local CPU, namely the CPU that executes the code. The macro gets the index from the cpu field of the thread_info structure of the current process and stores it into the cpu local variable.
  3. Loads next_p->thread.esp0 in the esp0 field of the TSS relative to the local CPU. When the CPU goes from user to kernel mode, address of the runnig process’ kernel mode stack will be found here. Therefore, whenever a process is activated, this field must also be updated.
  4. Does something related to private variables of threads.
  5. The __switch_to() C function ends by means of the statement:
    eax <= prev_p   //return value of any C ftn passes through eax

Notice that the value of eax is thus preserved across the invocation of __switch_to(); this is quite important, because the invoking switch_to macro assumes that eax always stores the address of the process descriptor being replaced.

We have entered __switch_to via a jump, but now we are exiting it via a ret. Therefore, __switch_to will pop the top of the stack and will return to 1:

If next_p was never suspended before because it is being executed for the first time, the function finds the starting address of the ret_from_fork( ) function. 


The do_fork( ) function underlies the clone( ), fork( ), and vfork( ) system calls. Its relevant parameters are:

  • clone_flags
  • stack_start: Same as the child_stack parameter of clone()
  • Pointer to the values of the general purpose registers saved into the Kernel Mode
    stack when switching from User Mode to Kernel Mode (galiba kullanılmıyor)
  • parent_tidptr, child_tidptr: Same as the corresponding ptid and ctid parameters of clone()

do_fork() makes use of an auxiliary function called copy_process() to set up the process descriptor and any other kernel data structure required for child’s execution. Here are the main steps performed by do_fork().

In this list, code relating to ptrace or CLONE_STOPPED flag is ignored

  1. Allocates a new PID for the child by looking in the pidmap_array bitmap
  2. Invokes copy_process() to make a copy of the process descriptor. If all needed resources are available, this function returns the address of the task_struct descriptor just created.
  3. invokes the wake_up_new_task() function, which performs the following operations:
    a. Adjusts the scheduling parameters of both the parent and the child
    b. If the child will run on the same CPU as the parent,* and parent and child do not share the same set of page tables (CLONE_VM flag cleared), it then forces the child to run before the parent by inserting it into the parent’s runqueue right before the parent. This simple step yields better performance if the child flushes its address space and executes a new program right after the forking. If we let the parent run first, the Copy On Write mechanism wouldgive rise to a series of unnecessary page duplications.
    c. Otherwise, if the child will not be run on the same CPU as the parent, or if parent and child share the same set of page tables (CLONE_VM flag set), it inserts the child in the last position of the parent’s runqueue.
  4. If the CLONE_VFORK flag is specified, it inserts the parent process in a wait queue and suspends it until the child releases its memory address space (that is, until the child either terminates or executes a new program).
  5. Terminates by returning the PID of the child.

copy_process() traces the following steps. Again, all security and consistency checks, anything relating to execution domains and executable formats, thread local storage, all FPU, MMX, SSE/SSE2 stuff along the way are ignored

  1. Invokes dup_task_struct() to get the process descriptor for the child. This function performs the following actions:
    a. Executes the alloc_task_struct() macro to get a process descriptor (task_struct structure) for the new process, and stores its address in the tsk local variable.
    b. Executes the alloc_thread_info macro to get a free memory area to store the thread_info structure and the Kernel Mode stack of the new process, and saves its address in the ti local variable. As explained in the earlier section “Identifying a Process,” the size of this memory area is either 8 KB or 4 KB.
    c. Copies the contents of the current’s process descriptor into the task_struct structure pointed to by tsk, then sets tsk->thread_info to ti.
    d. Copies the contents of the current’s thread_info descriptor into the structure pointed to by ti, then sets ti->task to tsk.
    e. Sets the usage counter of the new process descriptor (tsk->usage) to specify that the process descriptor is in use and that the corresponding process is alive (its state is not EXIT_ZOMBIE or EXIT_DEAD)
    f. Returns the process descriptor pointer of the new process (tsk)
  2. Checks whether the value stored in current->signal->rlim[RLIMIT_NPROC].rlim_cur
    is smaller than or equal to the current number of processes owned by the user. If so, an error code is returned, unless the process has root privileges. The function gets the current number of processes owned by the user from a per-user data structure named user_struct. This data structure can be found through a pointer in the user field of the process descriptor.
  3. Increases the usage counter of the user_struct structure
    tsk->user->_ _count
    and the counter of the processes owned by the user
  4. Checks that the number of processes in the system (stored in the nr_threads variable) does not exceed the value of the max_threads variable. The default value of this variable depends on the amount of RAM in the system. The general rule is that the space taken by all thread_info descriptors and Kernel Mode stacks cannot exceed 1/8 of the physical memory. However, the system administrator may change this value by writing in the /proc/sys/kernel/threads-max file.
  5. Sets a few crucial fields related to the process state:
    a. Initializes the big kernel lock counter tsk->lock_depth to 1 (see the section “The Big Kernel Lock” in Chapter 5).
    b. Initializes the tsk->did_exec field to 0: it counts the number of execve() system calls issued by the process. (bunu niye sayıyor?)
    c. Updates some of the flags included in the tsk->flags field that have been copied from the parent process: first clears the PF_SUPERPRIV flag, which indicates whether the process has used any of its superuser privileges, then sets the PF_FORKNOEXEC flag, which indicates that the child has not yet issued an execve() system call.
  6. Stores the PID of the new process in the tsk->pid field.
  7. If the CLONE_PARENT_SETTID flag in the clone_flags parameter is set, it copies the child’s PID into the User Mode variable addressed by the parent_tidptr parameter.  (?????????)
  8. Initializes the list_head data structures and the spin locks included in the child’s process descriptor, and sets up several other fields related to pending signals, timers, and time statistics.
  9. Invokes copy_semundo(), copy_files( ), copy_fs( ), copy_sighand( ), copy_signal(), copy_mm( ), and copy_namespace( ) to create new data structures and copy into them the values of the corresponding parent process data structures, unless specified differently by the clone_flags parameter.
  10. Invokes copy_thread( ) to initialize the Kernel Mode stack of the child process with the values contained in the CPU registers when the clone( ) system call was issued (these values have been saved in the Kernel Mode stack of the parent, as described in Chapter 10). However, the function forces the value 0 into the field corresponding to the eax register (this is the child’s return value of the fork() or
    clone() system call). The thread.esp field in the descriptor of the child process is initialized with the base address of the child’s Kernel Mode stack, and the address of an assembly language function (ret_from_fork( )) is stored in the thread.eip field.
  11. Initializes the tsk->exit_signal field with the signal number encoded in the low bits of the clone_flags parameter, unless the CLONE_THREAD flag is set. CLONE_THREAD flag Inserts the child into the same thread group of the parent, and forces the child to share
    the signal descriptor of the parent. This
    case initializes the tsk->exit_signal field to 1. As we’ll see in the section “Process Termination” later in this chapter, only the death of the last member of a thread group (usually, the thread group leader) causes a signal notifying the parent of the thread group leader.
  12. Invokes sched_fork() to scheduler-related setup for the new process. The function also sets the state of the new process to TASK_RUNNING and sets the preempt_count field of the thread_info structure to 1, thus disabling kernel preemption (see the section “Kernel Preemption” in Chapter 5). Moreover, in order to keep process scheduling fair, the function shares the remaining timeslice of the parent between the parent and the child.
  13. Sets the cpu field in the thread_info structure of the new process to the number of the local CPU returned by smp_processor_id(). (????)
  14. Initializes the fields that specify the parenthood relationships. In particular, if CLONE_PARENT or CLONE_THREAD are set, it initializes  tsk->real_parent  and  tsk->parent to the value in current->real_parent; the parent of the child thus appears
    as the parent of the current process. Otherwise, it sets the same fields to current.
  15. Executes the SET_LINKS macro to insert the new process descriptor into the process list. Recall that the process list is a list that links together all existing process descriptors.
  16. Invokes attach_pid( ) to insert the PID of the new process descriptor in the pidhash[PIDTYPE_PID] hash table.
  17. If the child is a thread group leader (flag CLONE_THREAD cleared):
    a. Initializes tsk->tgid to tsk->pid.
    b. Initializes tsk->group_leader to tsk.
    c. Invokes three times attach_pid() to insert the child in the PID hash tables of type PIDTYPE_TGID,  PIDTYPE_PGID, and  PIDTYPE_SID.
  18. Otherwise, if the child belongs to the thread group of its parent (CLONE_THREAD flag set):
    a. Initializes tsk->tgid to tsk->current->tgid.
    b. Initializes tsk->group_leader to the value in current->group_leader.
    c. Invokes attach_pid() to insert the child in the PIDTYPE_TGID hash table (more specifically, in the per-PID list of the current->group_leader process).