User-mode System Calls in xv6

xv6 is a modern recreation of Thompson and Ritchie’s seminal V6 Unix. Created by Frans Kaashoek, Robert Morris, and Russ Cox for MIT’s Operating System Engineering course. Originally written for x86—replacing the primordial PDP-11—the current vintage runs on a multi-core RISC-V processor.

Operating systems construct heaven and earth for user programs, crafting and controlling the reality in which a process lives. Limitations in the operating system therefore directly transfer to limitations on all user programs.

xv6 is a lean (a mere 21 system calls and 11k LoC), well-crafted, and comprehensible system. Yet its design—the Unix interface—reverberates throughout modern systems. It is worthy of careful study.


As one path through the system we’ll take a look at how a simple system call optimisation—avoiding the kernel and performing the work in a user mode library instead—could be implemented in xv6.

The full implementation is available here https://github.com/sameersismail/xv6-usermode.

A Typical System Call

We’ll be implementing getrusage(2), which retrieves a process’s resource usage. Before we implement the optimised version, let’s first walk through what a typical system call implementation would look like in xv6.

To simplify our implementation we won’t be producing the full struct rusage as defined in <sys/resource.h>, but instead just replicating the single field ru_maxrss. This field contains the maximum resident set size, which is the number of virtual memory pages that are currently mapped to a physical frame, in bytes. So our pared down struct usage therefore looks like this.

struct rusage {
  int resident_set_size;
}

In the xv6 kernel we can find this value in kernel/proc.h’s struct proc, which contains the necessary state to execute a process—its pagetable, kernel stack, saved program state, etc.—and, critically for us, it also stores the number of bytes a process uses in p->sz. (The notation p->x will refer to the x field in struct proc.)

struct proc {

  uint64 sz;                   // size of process memory (bytes)
  uint64 kstack;               // virtual address of kernel stack
  pagetable_t pagetable;       // user page table

};

All we need to do now is to deliver this value to user space. The interface that getrusage will expose takes as input a pointer to a struct rusage which the kernel is then responsible for populating. A process would invoke the syscall like so.

{
  struct rusage ru;
  getrusage(&ru);
  ru.resident_set_size;
}

Let’s begin with implementing the kernel side of this system call. We first need to define some ceremonial constants. First a new syscall number SYS_getrusage for our system call in kernel/syscall.h.

#define SYS_link       19
#define SYS_mkdir      20
#define SYS_close      21
#define SYS_getrusage  22

Then in kernel/syscall.c we’ll need to add this SYS_getrusage to the syscalls trap table, so that when getrusage is called it’s directed to the correct kernel function.

 [SYS_link]        sys_link,
 [SYS_mkdir]       sys_mkdir,
 [SYS_close]       sys_close,
 [SYS_getrusage]   sys_getrusage,
};

After that, we can then implement the primary sys_getrusage function in kernel/sysproc.c. The two files kernel/sysproc.c and kernel/sysfile.c contain the interface boundary between user mode and kernel mode—they’re responsible for constructing, from a user program’s perspective, the abstraction of a process and a file.

Recall the interface defined above: a pointer to a struct rusage is passed in, which will be populated with the process’s current memory usage.

uint64 sys_getrusage(void) {
  uint64 ru;
  if (argaddr(0, &ru) < 0) {
    return -1;
  }

  struct proc* p = myproc();
  struct rusage rusage = {
    .resident_set_size = p->sz
  };

  if (copyout(p->pagetable, ru, (char*) &rusage, sizeof(struct rusage)) < 0) {
    return -1;
  }

  return 0;
}

We first retrieve the user space address of the struct rusage with argaddr, which will read the value from the process’s trapframe. We then retrieve the invoking-process’s struct proc and populate an in-kernel struct rusage with p->sz—the size of the process’s memory. After that we copy over the in-kernel struct rusage to the given user space address with copyout. copyout will ensure that ru is a valid virtual address, walk the page table, retrieve the associated physical address, and write out sizeof(struct rusage)-many bytes.


Now, we can’t yet access this function from user mode, in order to do so we need to execute the system call and transfer control from user mode to kernel mode. To implement this transition we first define the function signature of getrusage in user/user.h. This is an implicit contract between the kernel- and user mode.

int getrusage(struct rusage*);

This signature is then implemented by a Perl script user/usys.pl which contains an assembly template and for each system call outputs the generic user-kernel bridge for that system call. getrusage’s looks like this.

.global getrusage
getrusage:
 li a7, SYS_getrusage
 ecall
 ret

Now when some user code invokes getrusage(&ru) the compiler will, in accordance with the RISC-V calling convention, place &ru in the appropriate register (a0) and transfer control to the symbol getrusage. After which we load the system call number SYS_getrusage into the kernel-expected location (a7) and execute ecall.

Which, quoting The RISC-V Instruction Set Manual:

The ECALL instruction is used to make a service request to the execution environment.

What this results in is the RISC-V hardware thread escalating its privilege level and forcing a switch to a kernel-defined trap handler. The kernel then has to store the execution state of the process, switch to the kernel page table, transition to the appropriate kernel stack, execute the specified system call, and transparently return execution to the process (through a reverse of the prior steps—and then to increment the process’s program counter and execute an sret instruction).


All this to read a single int!

Bypassing the Kernel Interface

Rather than proceeding directly through the kernel we can go indirectly into the kernel. Instead of invoking the full system call machinery our new API will simply return a pointer to struct rusage without leaving user mode.

struct rusage* rusage = getrusage();

In user mode, the innards of getrusage look like this. All we’re doing is interpreting some pre-defined virtual address RES_USAGE as a pointer to a struct rusage and returning it. Then we can access rusage->resident_set_size as we please. Which is much simpler than all that prior syscall machinery! Additionally, as the underlying p->sz changes throughout the program’s execution, we won’t need to make the long journey down to the kernel for a measly 32 bits—we can just dereference the same memory location again.

struct rusage* getrusage(void) {
  return (struct rusage*) RES_USAGE;
}

This is implemented as a straightforward application of The Fundamental Theorem of Software Engineering: indirection. We take advantage of virtual memory to, for each process, map RES_USAGE from a read-only page in user space to a kernel-managed writable frame. The kernel then populates this frame with the process-appropriate struct rusage and maintains and reflects the process’s state throughout its lifetime.

To implement this we first need to choose the location of RES_USAGE—where in the virtual address space should it live? A user mode process’s virtual address space looks like this.

A process’s user address space, with its initial stack.
(xv6: a simple, Unix-like teaching operating system)

It begins at 0 and continues to 2^38, as MAXVA is defined by (1 << (9 + 9 + 9 + 12 - 1))—because we’re running the 64-bit Sv39 variant of RISC-V, whose virtual addresses encode the three-level page table structure: 9 bits for each level and 12 offset bits. The stack grows downwards and the heap grows upwards.

There’s plenty of space at the top—below (1) the trampoline, which enables privilege escalation; and (2) the trapframe, which stores the process state for subsequent resumption. So let’s reserve the page directly below TRAPFRAME for RES_USAGE.

#define RES_USAGE (TRAPFRAME - PGSIZE)

This page will store struct rusage, which is mapped to a unique frame for each process. So we’ll need a pointer to this frame in the per-process struct proc located in kernel/proc.h.

  struct inode *cwd;           // Current directory
  char name[16];               // Process name (debugging)
  struct rusage* rusage;       // Resource usage
};

Now when a process is created we’ll need to allocate a frame and store a pointer to it in p->rusage. Processes are created via the conventional fork and exec pipeline. Once a fork call enters the kernel, it invokes several methods to duplicate the process—a typical stacktrace is below.

#0  proc_pagetable                  at kernel/proc.c:184
#1  0x0000000080001c80 in allocproc at kernel/proc.c:138
#2  0x0000000080001e20 in fork      at kernel/proc.c:303
#3  0x0000000080002c22 in sys_fork  at kernel/sysproc.c:29
#4  0x0000000080002b96 in syscall   at kernel/syscall.c:140
#5  0x0000000080002880 in usertrap  at kernel/trap.c:67
#6  0x000000000000008a in ??

We first, after the trap, enter kernel/syscall.c and are dispatched to kernel/sys_proc.c:sys_fork which will then call kernel/proc.c:fork. The first function call this method makes is to kernel/proc.c:allocproc—it’s here that the new process’s state is created.

It’s also here where we’ll allocate a new frame for the struct rusage to live in.

  if ((p->rusage = (struct rusage*) kalloc()) == 0) {
    freeproc(p);
    release(&p->lock);
    return 0;
  }

Having allocated a new frame and stuffed it in struct proc doesn’t make it available to user mode. From the perspective of user mode it’s just like any other kernel data structure—non-existent. To make it visible we need to map the previously-defined RES_USAGE virtual address to this particular physical frame.

Once allocproc has allocated the initial frames for a new process (initially just the trapframe, but now also the RES_USAGE frame), it calls into kernel/proc.ch:proc_pagetable, whose responsibility it is to create and setup the process’s pagetable. It’s here where we insert an entry into the pagetable, mapping from RES_USAGE to p->rusage (the address of the physical frame), and set the permission bits to be readable and accessible, but not writable, from user mode (PTE_R | PTE_U).

  if (mappages(pagetable, RES_USAGE, PGSIZE,
              (uint64)(p->rusage), PTE_R | PTE_U) < 0) {
    uvmunmap(pagetable, TRAPFRAME, 1, 0);
    uvmunmap(pagetable, TRAMPOLINE, 1, 0);
    uvmfree(pagetable, 0);
    return 0;
  }

On process exit we’ll need to clean up after ourselves, both the frame and the mapping—this is done in kernel/proc.c:freeproc and kernel/proc.c:proc_freepagetable respectively.

  if (p->rusage) {
    kfree((void*) p->rusage);
    p->rusage = 0;
  }
  uvmunmap(pagetable, TRAMPOLINE, 1, 0);
  uvmunmap(pagetable, TRAPFRAME, 1, 0);
  uvmunmap(pagetable, RES_USAGE, 1, 0);
  uvmfree(pagetable, sz);

Now an expression such as ((struct rusage*) RES_USAGE)->resident_set_size in user mode would succeed, but we’d only get uninitialised junk memory. We need to synchronise the value we’re after, p->sz, with p->rusage->resident_set_size.

A process acquires additional memory via only one path—the sbrk system call. The interface is simple: given a parameter n, either allocate n bytes or deallocate n bytes (when n < 0), and return the beginning of this portion of memory. Memory is allocated in page-aligned chunks of 4096 bytes. A representative kernel backtrace resulting from an sbrk(n) call looks like the following. It bottoms out at kalloc, the kernel memory allocator, which uses a free list to manage frame allocation.

#0  kalloc                          at kernel/kalloc.c:70
#1  0x0000000080001434 in uvmalloc  at kernel/vm.c:231
#2  0x0000000080001dda in growproc  at kernel/proc.c:283
#3  0x0000000080002c92 in sys_sbrk  at kernel/sysproc.c:50
#4  0x0000000080002b96 in syscall   at kernel/syscall.c:140
#5  0x0000000080002880 in usertrap  at kernel/trap.c:67
#6  0x00000000000012ec in ??

The method we’re concerned with is growproc, as it’s here that that uvmalloc or uvmdealloc are used to manage the process’s memory and update p->sz. When sz is updated we reflect this update our struct rusage.

  p->sz = sz;
  p->rusage->resident_set_size = sz;
  return 0;
}

Now p->rusage->resident_set_size will accurately reflect the process’s dynamic memory usage. But, we’re missing one case: memory initialisation. We need to ensure that resident_set_size begins with the correct value.

That is, upon process creation resident_set_size should contain the correct value. We’ve previously covered process duplication (with fork) but not process creation, which is exec’s responsibility. It replaces the duplicated process with a new executable and address space. Once it has constructed a new page table with all of the ELF segments appropriately mapped, and setup the new user stack, it will commit these values to the process. It’s here that we must also update our struct rusage.

  oldpagetable = p->pagetable;
  p->pagetable = pagetable;
  p->sz = sz;
  p->rusage->resident_set_size = sz;
  p->trapframe->epc = elf.entry;  // initial program counter = main
  p->trapframe->sp = sp; // initial stack pointer

Coda

Now that we’ve (1) allocated a new per-process frame, (2) mapped it in each process’s address space at RES_USAGE, and (3) stored and synchronised p->sz, we’re able to access a process’s memory usage without requiring an expensive kernel trap!

int main(int argc, char* argv[]) {
  struct rusage* rusage = getrusage();

  printf("Bytes: %d\n", rusage->resident_set_size);
  // Bytes: 16384

  uint64 top = (uint64) sbrk(8192);
  printf("Bytes: %d\n", rusage->resident_set_size);
  printf("Top: %d\n", top);
  // Bytes: 24576
  // Top:   16384
  
  // Writing to this location will force a page fault.
  rusage->resident_set_size = 42;
  // usertrap(): unexpected scause 0x000000000000000f pid=4

  exit(0);
}

March 2024