User-mode System Calls in xv6
xv6 is a modern recreation of Thompson and Ritchie’s seminal V6 Unix. Created by Frans Kaashoek, Robert Morris, and Russ Cox for MIT’s Operating System Engineering course. Originally written for x86—replacing the primordial PDP-11—the current vintage runs on a multi-core RISC-V processor.
Operating systems construct heaven and earth for user programs, crafting and controlling the reality in which a process lives. Limitations in the operating system therefore directly transfer to limitations on all user programs.
xv6 is a lean (a mere 21 system calls and 11k LoC), well-crafted, and comprehensible system. Yet its design—the Unix interface—reverberates throughout modern systems. It is worthy of careful study.
As one path through the system we’ll take a look at how a simple system call optimisation—avoiding the kernel and performing the work in a user mode library instead—could be implemented in xv6.
The full implementation is available here https://github.com/sameersismail/xv6-usermode.
A Typical System Call
We’ll be implementing
getrusage(2)
, which retrieves a process’s resource usage. Before we implement the
optimised version, let’s first walk through what a typical system call
implementation would look like in xv6.
To simplify our implementation we won’t be producing the full
struct rusage
as defined in
<sys/resource.h>
, but instead just replicating the
single field ru_maxrss
. This field contains the maximum
resident set size, which is the number of virtual memory pages that are
currently mapped to a physical frame, in bytes. So our pared down
struct usage
therefore looks like
this.
struct rusage {
int resident_set_size;
}
In the xv6 kernel we can find this value in kernel/proc.h
’s
struct proc
, which contains the necessary state to execute a process—its
pagetable, kernel stack, saved program state, etc.—and, critically for
us, it also stores the number of bytes a process uses in
p->sz
. (The notation p->x
will refer to
the x
field in struct proc
.)
struct proc {
…; // size of process memory (bytes)
uint64 sz; // virtual address of kernel stack
uint64 kstack; // user page table
pagetable_t pagetable
…};
All we need to do now is to deliver this value to user space. The
interface that getrusage
will expose takes as input a
pointer to a struct rusage
which the kernel is then
responsible for populating. A process would invoke the syscall like so.
{
struct rusage ru;
(&ru);
getrusage.resident_set_size;
ru}
Let’s begin with implementing the kernel side of this system call. We
first need to define some ceremonial constants. First a new syscall
number SYS_getrusage
for our system call in
kernel/syscall.h
.
#define SYS_link 19
#define SYS_mkdir 20
#define SYS_close 21
#define SYS_getrusage 22
Then in
kernel/syscall.c
we’ll need to add this SYS_getrusage
to the
syscalls
trap table, so that when getrusage
is
called it’s directed to the correct kernel function.
[SYS_link] sys_link,
[SYS_mkdir] sys_mkdir,
[SYS_close] sys_close,
[SYS_getrusage] sys_getrusage,
};
After that, we can then implement the primary
sys_getrusage
function in kernel/sysproc.c
. The two files
kernel/sysproc.c
and kernel/sysfile.c
contain
the interface boundary between user mode and kernel mode—they’re
responsible for constructing, from a user program’s perspective, the
abstraction of a process and a file.
Recall the interface defined above: a pointer to a
struct rusage
is passed in, which will be populated with
the process’s current memory usage.
(void) {
uint64 sys_getrusage;
uint64 ruif (argaddr(0, &ru) < 0) {
return -1;
}
struct proc* p = myproc();
struct rusage rusage = {
.resident_set_size = p->sz
};
if (copyout(p->pagetable, ru, (char*) &rusage, sizeof(struct rusage)) < 0) {
return -1;
}
return 0;
}
We first retrieve the user space address of the
struct rusage
with argaddr
, which will read
the value from the process’s trapframe. We then retrieve the
invoking-process’s struct proc
and populate an in-kernel
struct rusage
with p->sz
—the size of the
process’s memory. After that we copy over the in-kernel
struct rusage
to the given user space address with
copyout
. copyout
will ensure that
ru
is a valid virtual address, walk the page table,
retrieve the associated physical address, and write out
sizeof(struct rusage)
-many bytes.
Now, we can’t yet access this function from user mode, in order to do so
we need to execute the system call and transfer control from user mode
to kernel mode. To implement this transition we first define the
function signature of getrusage
in
user/user.h
. This is an implicit contract between the kernel- and user mode.
int getrusage(struct rusage*);
This signature is then implemented by a Perl script
user/usys.pl
which contains an assembly template and for each system call outputs the
generic user-kernel bridge for that system call.
getrusage
’s looks like this.
.global getrusagegetrusage:
, SYS_getrusage
li a7
ecallret
Now when some user code invokes getrusage(&ru)
the
compiler will, in accordance with the RISC-V calling convention, place
&ru
in the appropriate register (a0
) and
transfer control to the symbol getrusage
. After which we
load the system call number SYS_getrusage
into the
kernel-expected location (a7
) and execute
ecall
.
Which, quoting The RISC-V Instruction Set Manual:
The ECALL instruction is used to make a service request to the execution environment.
What this results in is the RISC-V hardware thread escalating its
privilege level and forcing a switch to a kernel-defined trap handler.
The kernel then has to store the execution state of the process, switch
to the kernel page table, transition to the appropriate kernel stack,
execute the specified system call, and transparently return execution to
the process (through a reverse of the prior steps—and then to increment
the process’s program counter and execute an
sret
instruction).
All this to read a single int
!
Bypassing the Kernel Interface
Rather than proceeding directly through the kernel we can go
indirectly into the kernel. Instead of invoking the full system
call machinery our new API will simply return a pointer to
struct rusage
without leaving user mode.
struct rusage* rusage = getrusage();
In user mode, the innards of getrusage
look like
this. All we’re doing is interpreting some pre-defined virtual address
RES_USAGE
as a pointer to a struct rusage
and
returning it. Then we can access
rusage->resident_set_size
as we please. Which is much
simpler than all that prior syscall machinery! Additionally, as the
underlying p->sz
changes throughout the program’s
execution, we won’t need to make the long journey down to the kernel for
a measly 32 bits—we can just dereference the same memory location again.
struct rusage* getrusage(void) {
return (struct rusage*) RES_USAGE;
}
This is implemented as a straightforward application of
The Fundamental Theorem of Software Engineering: indirection. We take advantage of virtual memory to, for each
process, map RES_USAGE
from a read-only page in user space
to a kernel-managed writable frame. The kernel then populates this frame
with the process-appropriate struct rusage
and maintains
and reflects the process’s state throughout its lifetime.
To implement this we first need to choose the location of
RES_USAGE
—where in the virtual address space should it
live? A user mode process’s virtual address space looks like this.
It begins at 0
and continues to 2^38
, as
MAXVA
is
defined by
(1 << (9 + 9 + 9 + 12 - 1))
—because we’re running the
64-bit Sv39 variant of RISC-V, whose virtual addresses encode the
three-level page table structure: 9 bits for each level and 12 offset
bits. The stack grows downwards and the heap grows upwards.
There’s plenty of space at the top—below (1) the trampoline, which
enables privilege escalation; and (2) the trapframe, which stores the
process state for subsequent resumption. So let’s reserve the page
directly below TRAPFRAME
for
RES_USAGE
.
#define RES_USAGE (TRAPFRAME - PGSIZE)
This page will store struct rusage
, which is mapped to a
unique frame for each process. So we’ll need a pointer to this frame in
the per-process struct proc
located in
kernel/proc.h
.
struct inode *cwd; // Current directory
char name[16]; // Process name (debugging)
struct rusage* rusage; // Resource usage
};
Now when a process is created we’ll need to allocate a frame and store a
pointer to it in p->rusage
. Processes are created via
the conventional fork
and exec
pipeline. Once
a fork
call enters the kernel, it invokes several methods
to duplicate the process—a typical stacktrace is below.
0 proc_pagetable at kernel/proc.c:184
#1 0x0000000080001c80 in allocproc at kernel/proc.c:138
#2 0x0000000080001e20 in fork at kernel/proc.c:303
#3 0x0000000080002c22 in sys_fork at kernel/sysproc.c:29
#4 0x0000000080002b96 in syscall at kernel/syscall.c:140
#5 0x0000000080002880 in usertrap at kernel/trap.c:67
#6 0x000000000000008a in ?? #
We first, after the trap, enter kernel/syscall.c
and are
dispatched to kernel/sys_proc.c:sys_fork
which will then
call kernel/proc.c:fork
. The first function call this
method makes is to kernel/proc.c:allocproc
—it’s here that
the new process’s state is created.
It’s also
here
where we’ll allocate a new frame for the struct rusage
to
live in.
if ((p->rusage = (struct rusage*) kalloc()) == 0) {
(p);
freeproc(&p->lock);
releasereturn 0;
}
Having allocated a new frame and stuffed it in
struct proc
doesn’t make it available to user mode. From
the perspective of user mode it’s just like any other kernel data
structure—non-existent. To make it visible we need to map the
previously-defined RES_USAGE
virtual address to this
particular physical frame.
Once allocproc
has allocated the initial frames for a new
process (initially just the trapframe, but now also the
RES_USAGE
frame), it calls into
kernel/proc.ch:proc_pagetable
, whose responsibility it is
to create and setup the process’s pagetable. It’s
here
where we insert an entry into the pagetable, mapping from
RES_USAGE
to p->rusage
(the address of the
physical frame), and set the permission bits to be readable and
accessible, but not writable, from user mode (PTE_R | PTE_U
).
if (mappages(pagetable, RES_USAGE, PGSIZE,
(uint64)(p->rusage), PTE_R | PTE_U) < 0) {
(pagetable, TRAPFRAME, 1, 0);
uvmunmap(pagetable, TRAMPOLINE, 1, 0);
uvmunmap(pagetable, 0);
uvmfreereturn 0;
}
On process exit we’ll need to clean up after ourselves, both the
frame
and the
mapping—this is done in kernel/proc.c:freeproc
and
kernel/proc.c:proc_freepagetable
respectively.
if (p->rusage) {
((void*) p->rusage);
kfree->rusage = 0;
p}
(pagetable, TRAMPOLINE, 1, 0);
uvmunmap(pagetable, TRAPFRAME, 1, 0);
uvmunmap(pagetable, RES_USAGE, 1, 0);
uvmunmap(pagetable, sz); uvmfree
Now an expression such as
((struct rusage*) RES_USAGE)->resident_set_size
in user
mode would succeed, but we’d only get uninitialised junk memory. We need
to synchronise the value we’re after, p->sz
, with
p->rusage->resident_set_size
.
A process acquires additional memory via only one path—the
sbrk
system call. The interface is simple: given a
parameter n
, either allocate n
bytes or
deallocate n
bytes (when n < 0
), and return
the beginning of this portion of memory. Memory is allocated in
page-aligned chunks of 4096 bytes. A representative kernel backtrace
resulting from an sbrk(n)
call looks like the following. It
bottoms out at kalloc
, the kernel memory allocator, which
uses a free list to manage frame allocation.
0 kalloc at kernel/kalloc.c:70
#1 0x0000000080001434 in uvmalloc at kernel/vm.c:231
#2 0x0000000080001dda in growproc at kernel/proc.c:283
#3 0x0000000080002c92 in sys_sbrk at kernel/sysproc.c:50
#4 0x0000000080002b96 in syscall at kernel/syscall.c:140
#5 0x0000000080002880 in usertrap at kernel/trap.c:67
#6 0x00000000000012ec in ?? #
The method we’re concerned with is growproc
, as it’s here
that that uvmalloc
or uvmdealloc
are used to
manage the process’s memory and update p->sz
. When
sz
is updated we
reflect this
update our struct rusage
.
->sz = sz;
p->rusage->resident_set_size = sz;
preturn 0;
}
Now p->rusage->resident_set_size
will accurately
reflect the process’s dynamic memory usage. But, we’re missing one case:
memory initialisation. We need to ensure that
resident_set_size
begins with the correct value.
That is, upon process creation resident_set_size
should
contain the correct value. We’ve previously covered process duplication
(with fork
) but not process creation, which is
exec
’s responsibility. It replaces the duplicated process
with a new executable and address space. Once it has constructed a new
page table with all of the ELF segments appropriately mapped, and setup
the new user stack, it will commit these values to the process. It’s
here
that we must also update our struct rusage
.
= p->pagetable;
oldpagetable ->pagetable = pagetable;
p->sz = sz;
p->rusage->resident_set_size = sz;
p->trapframe->epc = elf.entry; // initial program counter = main
p->trapframe->sp = sp; // initial stack pointer p
Coda
Now that we’ve (1) allocated a new per-process frame, (2) mapped it in
each process’s address space at RES_USAGE
, and (3) stored
and synchronised p->sz
,
we’re able to
access a process’s memory usage without requiring an expensive kernel
trap!
int main(int argc, char* argv[]) {
struct rusage* rusage = getrusage();
("Bytes: %d\n", rusage->resident_set_size);
printf// Bytes: 16384
= (uint64) sbrk(8192);
uint64 top ("Bytes: %d\n", rusage->resident_set_size);
printf("Top: %d\n", top);
printf// Bytes: 24576
// Top: 16384
// Writing to this location will force a page fault.
->resident_set_size = 42;
rusage// usertrap(): unexpected scause 0x000000000000000f pid=4
(0);
exit}
March 2024