学习目标:通过学习 MIT 的一套教学操作系统深度了解操作系统的内部简易实现机制,从代码层次更多把握OS 的内部机制,可以作为操作系统的实践部分进行阅读学习!注意:从我个人的角度出发,我不会细抠内部的每一行代码,我想达到的效果是能建立一个 OS 的全局观,打破这个”黑盒子”,但不是制作这个”盒子”,更多还是从 programmer 而不是自始而终的 builder 出发看待问题!
open 的第二个参数由一组用位表示的标志组成,用来控制 open 的工作。可能的值在文件控制(fcntl)头中定义。O_RDONLY, O_WRONLY, O_RDWR, O_CREATE, 和 O_TRUNC, 它们指定 open 打开文件时的功能:读、写、读和写、如果文件不存在创建文件、将文件截断为零。
==xv6 文件系统包含了数据文件(拥有字节数组)和目录(拥有对数据文件和其他目录的命名引用)。这些目录形成一棵树,从一个被称为根目录的特殊目录开始。==像/a/b/c 这样的路径指的是根目录/中的 a 目录中的 b 目录中的名为 c 的文件或目录。不以/开头的路径是相对于调用进程的当前目录进行计算其绝对位置的,可以通过 chdir 系统调用来改变进程的当前目录。下面两个 open 打开了同一个文件(假设所有涉及的目录都存在)。
fstat系统调用从文件描述符引用的 inode 中检索信息。它定义在 stat.h的 stat 结构中:
1 2 3 4 5 6 7 8 9 10 11
#define T_DIR 1 // Directory #define T_FILE 2 // File #define T_DEVICE 3 // Device structstat { int dev; // File system’s disk device uint ino; // Inode number short type; // Type of file short nlink; // Number of links to file uint64 size; // Size of file in bytes };
link系统调用创建了一个引用了同一个 inode 的文件(文件名)。下面的片段创建了引用了同一个 inode 两个文件 a 和 b。
1 2
open("a", O_CREATE | O_WRONLY); link("a", "b");
读写 a 与读写 b 是一样的,每个 inode 都有一个唯一的 inode 号来标识。经过上面的代码序列后,可以通过检查 fstat 的结果来确定 a 和 b 指的是同一个底层内容:两者将返回相同的 inode 号(ino),并且 nlink 计数为 2。
inode 是 linux 和类 unix 操作系统用来储存除了文件名和实际数据的数据结构,它是用来连接实际数据和文件名的。
CPU 提供了强隔离的硬件支持。例如,RISC-V 有三种CPU指令执行模式:(machine)机器模式、监督者(supervisor)模式和(user)用户模式。在机器模式下执行的指令具有完全的权限,一个 CPU 在机器模式下启动。机器模式主要用于配置计算机。xv6 会在机器模式下执行几条指令,然后转为监督者模式。
# qemu -kernel loads the kernel at 0x80000000 # and causes each CPU to jump there. # kernel.ld causes the following code to # be placed at 0x80000000. .section .text _entry: # set up a stack for C. # stack0 is declared in start.c, # with a 4096-byte stack per CPU. # sp = stack0 + (hartid * 4096) # stack0作为CPU栈的基地址 la sp, stack0 # 把 4096 这个立即数读到 a0 寄存器中 li a0, 1024*4 # 把当前 CPU 的 ID 读到 a1 寄存器中 csrr a1, mhartid # cpuid从0开始,注意CPU初始栈大小都是4096byte,且栈向下生长 addi a1, a1, 1 # 计算当前CPU栈的偏移地址 mul a0, a0, a1 # 算出栈顶地址(栈向下生长)并且放到 sp 寄存器中 add sp, sp, a0 # jump to start() in start.c call start spin: j spin
#include"types.h" #include"param.h" #include"memlayout.h" #include"riscv.h" #include"defs.h" voidmain(); voidtimerinit(); // entry.S needs one stack per CPU. // 定义了 entry.S 中的 stack0 ,它要求 16bit 对齐 __attribute__ ((aligned (16))) char stack0[4096 * NCPU]; // scratch area for timer interrupt, one per CPU. // 定义了共享变量,即每个 CPU 的暂存区用于 machine-mode 定时器中断,它是和 timer 驱动之间传递数据用的。 uint64 mscratch0[NCPU * 32]; // assembly code in kernelvec.S for machine-mode timer interrupt. // 声明了 timer 中断处理函数,在接下来的 timer 初始化函数中被用到 externvoidtimervec(); // entry.S jumps here in machine mode on stack0. void start() { // set M Previous Privilege mode to Supervisor, for mret. // 使 CPU 进入 supervisor mode unsignedlong x = r_mstatus(); x &= ~MSTATUS_MPP_MASK; x |= MSTATUS_MPP_S; w_mstatus(x); // set M Exception Program Counter to main, for mret. // requires gcc -mcmodel=medany // 设置了汇编指令 mret 后 PC 指针跳转的函数,也就是 main 函数 w_mepc((uint64)main); // disable paging for now. // 暂时关闭了分页功能,即直接使用物理地址 w_satp(0); // delegate all interrupts and exceptions to supervisor mode. // 将所有中断异常处理设定在给 supervisor mode 下 w_medeleg(0xffff); w_mideleg(0xffff); // External Interupt | Software Interupt | Timer Interupt w_sie(r_sie() | SIE_SEIE | SIE_STIE | SIE_SSIE); // ask for clock interrupts. // 请求时钟中断,也就是 clock 的初始化 timerinit(); // keep each CPU's hartid in its tp register, for cpuid(). // 将 CPU 的 ID 值保存在寄存器 tp 中 int id = r_mhartid(); w_tp(id); // switch to supervisor mode and jump to main(). asmvolatile("mret"); }
// set up to receive timer interrupts in machine mode, // which arrive at timervec in kernelvec.S, // which turns them into software interrupts for // devintr() in trap.c. void timerinit() { // each CPU has a separate source of timer interrupts. // 首先读出 CPU 的 ID int id = r_mhartid(); // ask the CLINT for a timer interrupt. // 设置中断时间间隔,这里设置的是 0.1 秒 int interval = 1000000; // cycles; about 1/10th second in qemu. *(uint64*)CLINT_MTIMECMP(id) = *(uint64*)CLINT_MTIME + interval; // prepare information in scratch[] for timervec. // scratch[0..3] : space for timervec to save registers. // scratch[4] : address of CLINT MTIMECMP register. // scratch[5] : desired interval (in cycles) between timer interrupts. uint64 *scratch = &mscratch0[32 * id]; scratch[4] = CLINT_MTIMECMP(id); scratch[5] = interval; w_mscratch((uint64)scratch); // set the machine-mode trap handler. // 设置中断处理函数 w_mtvec((uint64)timervec); // enable machine-mode interrupts. // 打开中断 w_mstatus(r_mstatus() | MSTATUS_MIE); // enable machine-mode timer interrupts. // 打开时钟中断 w_mie(r_mie() | MIE_MTIE); }
// qemu -machine virt is set up like this, // based on qemu's hw/riscv/virt.c: // 00001000 -- boot ROM, provided by qemu // 02000000 -- CLINT // 0C000000 -- PLIC // 10000000 -- uart0 // 10001000 -- virtio disk // 80000000 -- boot ROM jumps here in machine mode // -kernel loads the kernel here // unused RAM after 80000000.
// the kernel uses physical memory thus: // 80000000 -- entry.S, then kernel text and data // end -- start of kernel page allocation area // PHYSTOP -- end RAM used by the kernel
// qemu puts UART registers here in physical memory. #define UART0 0x10000000L #define UART0_IRQ 10
// the kernel expects there to be RAM // for use by the kernel and user pages // from physical address 0x80000000 to PHYSTOP. // 物理内存实际只有128M #define KERNBASE 0x80000000L #define PHYSTOP (KERNBASE + 128*1024*1024)
// map the trampoline page to the highest address, // in both user and kernel space. // 用户空间和内核空间的最高地址都是trampoline页(用户态和内核态切换) #define TRAMPOLINE (MAXVA - PGSIZE)
// map kernel stacks beneath the trampoline, // each surrounded by invalid guard pages. // 内核内存trampoline页后面跟着内核栈页(每一个进程的内核栈都是两页,其中一页是守护页) #define KSTACK(p) (TRAMPOLINE - ((p)+1)* 2*PGSIZE)
// User memory layout. // Address zero first: // text // original data and bss // fixed-size stack // expandable heap // ... // TRAPFRAME (p->trapframe, used by the trampoline) // TRAMPOLINE (the same page as in the kernel) // 用户内存的trampoline页后面跟着trapframe页 #define TRAPFRAME (TRAMPOLINE - PGSIZE)
// machine exception program counter, holds the // instruction address to which a return from // exception will go. staticinlinevoid w_mepc(uint64 x) { asmvolatile("csrw mepc, %0" : : "r" (x)); }
// shift a physical address to the right place for a PTE. // 物理地址转换为PTE项(低10位需要额外填充标志位) #define PA2PTE(pa) ((((uint64)pa) >> 12) << 10) // PTE项转换为物理地址 #define PTE2PA(pte) (((pte) >> 10) << 12) // 物理地址所在的块号 #define PA2IDX(pa) (((uint64)pa) >> 12) // PTE项的低10位标记位 #define PTE_FLAGS(pte) ((pte) & 0x3FF)
// extract the three 9-bit page table indices from a virtual address. // 抽取出页表项的页目录索引项(9位) #define PXMASK 0x1FF // 9 bits #define PXSHIFT(level) (PGSHIFT+(9*(level))) #define PX(level, va) ((((uint64) (va)) >> PXSHIFT(level)) & PXMASK)
// one beyond the highest possible virtual address. // MAXVA is actually one bit less than the max allowed by // Sv39, to avoid having to sign-extend virtual addresses // that have the high bit set. #define MAXVA (1L << (9 + 9 + 9 + 12 - 1))
// Per-CPU state. structcpu { // 当前运行在本CPU上的进程 structproc *proc;// The process running on this cpu, or null. // CPU的调度器上下文 structcontextcontext;// swtch() here to enter scheduler(). // 嵌套加锁深度 int noff; // Depth of push_off() nesting. // 初始加锁之前的中断开启状态 int intena; // Were interrupts enabled before push_off()? }; externstructcpucpus[NCPU]; // per-process data for the trap handling code in trampoline.S. // sits in a page by itself just under the trampoline page in the // user page table. not specially mapped in the kernel page table. // the sscratch register points here. // uservec in trampoline.S saves user registers in the trapframe, // then initializes registers from the trapframe's // kernel_sp, kernel_hartid, kernel_satp, and jumps to kernel_trap. // usertrapret() and userret in trampoline.S set up // the trapframe's kernel_*, restore user registers from the // trapframe, switch to the user page table, and enter user space. // the trapframe includes callee-saved user registers like s0-s11 because the // return-to-user path via usertrapret() doesn't return through // the entire kernel call stack. // 在处理trap时用户页表中存放进程数据的页(紧随trampoline页),由sscrach寄存器指向该页 structtrapframe { /* 0 */ uint64 kernel_satp; // kernel page table 内核页表 /* 8 */ uint64 kernel_sp; // top of process's kernel stack 进程内核栈顶 /* 16 */ uint64 kernel_trap; // usertrap() 内核处理trap的入口 /* 24 */ uint64 epc; // saved user program counter 保存上一次执行位置 /* 32 */ uint64 kernel_hartid; // saved kernel tp 内核tp寄存器的值(CPU的ID) /* 40 */ uint64 ra; /* 48 */ uint64 sp; /* 56 */ uint64 gp; /* 64 */ uint64 tp; /* 72 */ uint64 t0; /* 80 */ uint64 t1; /* 88 */ uint64 t2; /* 96 */ uint64 s0; /* 104 */ uint64 s1; /* 112 */ uint64 a0; /* 120 */ uint64 a1; /* 128 */ uint64 a2; /* 136 */ uint64 a3; /* 144 */ uint64 a4; /* 152 */ uint64 a5; /* 160 */ uint64 a6; /* 168 */ uint64 a7; /* 176 */ uint64 s2; /* 184 */ uint64 s3; /* 192 */ uint64 s4; /* 200 */ uint64 s5; /* 208 */ uint64 s6; /* 216 */ uint64 s7; /* 224 */ uint64 s8; /* 232 */ uint64 s9; /* 240 */ uint64 s10; /* 248 */ uint64 s11; /* 256 */ uint64 t3; /* 264 */ uint64 t4; /* 272 */ uint64 t5; /* 280 */ uint64 t6; };
// 进程状态有5种:未使用、休眠、可运行、运行中、僵死 enumprocstate { UNUSED, SLEEPING, RUNNABLE, RUNNING, ZOMBIE }; // Per-process state structproc { structspinlocklock; // p->lock must be held when using these: enumprocstatestate;// Process state 进程状态 structproc *parent;// Parent process 父进程 void *chan; // If non-zero, sleeping on chan 休眠链 int killed; // If non-zero, have been killed 进程杀死标志 int xstate; // Exit status to be returned to parent's wait 返回给父进程wait的退出状态 int pid; // Process ID 进程ID // these are private to the process, so p->lock need not be held. uint64 kstack; // Virtual address of kernel stack 内核栈虚拟地址 uint64 sz; // Size of process memory (bytes) 进程占用内存大小 pagetable_t pagetable; // User page table 页表 structtrapframe *trapframe;// data page for trampoline.S trapframe页 structcontextcontext;// swtch() here to run process 进程上下文 structfile *ofile[NOFILE];// Open files 打开的文件 structinode *cwd;// Current directory 当前工作目录 char name[16]; // Process name (debugging) 进程名称 };
// Set up first user process. // 建立第一个用户进程 void userinit(void) { structproc *p; // 从进程表中分配一个进程作为initproc p = allocproc(); initproc = p; // allocate one user page and copy init's instructions // and data into it. // 这个进程执行了一个initcode.S的汇编程序,这个汇编程序调用了exec这个system call来执行/init,重新进入kernel。 uvminit(p->pagetable, initcode, sizeof(initcode)); // init进程只有一页大小 p->sz = PGSIZE;
// prepare for the very first "return" from kernel to user. // 准备好initproc被内核调度器线程选中需要restore的信息(epc==0表示从initproc进程的一开始执行,sp==PGSIZE表示initproc的用户栈从PGSIZE开始) p->trapframe->epc = 0; // user program counter p->trapframe->sp = PGSIZE; // user stack pointer
# Initial process that execs /init. # This code runs in user space. # 初始进程/init进程 #include "syscall.h" # exec(init, argv) .globl start start: la a0, init la a1, argv li a7, SYS_exec ecall
# for(;;) exit(); exit: li a7, SYS_exit ecall jal exit
for(;;){ // this call to wait() returns if the shell exits, // or if a parentless process exits. // 注意:如果一个父进程退出,子进程会被委托给init进程 // 因此wait的返回不一定是因为shell进程的退出 wpid = wait((int *) 0); if(wpid == pid){ // the shell exited; restart it. // shell程序退出,需要重启一个shell子进程 break; } elseif(wpid < 0){ // init进程肯定存在子进程,所以wait不可能返回-1 printf("init: wait returned an error\n"); exit(1); } else { // 委托给init进程的子进程退出,init进程继续等待 // it was a parentless process; do nothing. } } } }
要告诉硬件使用页表,内核必须将根页表页的物理地址写入 satp 寄存器中。每个 CPU都有自己的 satp 寄存器。一个 CPU 将使用自己的 satp 所指向的页表来翻译后续指令产生的所有地址。每个 CPU 都有自己的 satp,这样不同的 CPU 可以运行不同的进程,每个进程都有自己的页表所描述的私有地址空间。
关于术语的一些说明:物理内存指的是 DRAM 中的存储单元。物理存储器的一个字节有一个地址,称为物理地址。当指令操作虚拟地址时,分页硬件会将其翻译成物理地址,然后发送给 DRAM 硬件,以读取或写入存储。不像物理内存和虚拟地址,虚拟内存不是一个物理对象,而是指内核提供的管理物理内存和虚拟地址的抽象和机制的集合。
// map kernel text executable and read-only. kvmmap(KERNBASE, KERNBASE, (uint64)etext-KERNBASE, PTE_R | PTE_X);
// map kernel data and the physical RAM we'll make use of. kvmmap((uint64)etext, (uint64)etext, PHYSTOP-(uint64)etext, PTE_R | PTE_W);
// map the trampoline for trap entry/exit to // the highest virtual address in the kernel. kvmmap(TRAMPOLINE, (uint64)trampoline, PGSIZE, PTE_R | PTE_X); }
// Switch h/w page table register to the kernel's page table, // and enable paging. void kvminithart() { // 将根页表页的物理地址写入寄存器 satp 中 w_satp(MAKE_SATP(kernel_pagetable)); // 刷新当前 CPU 的 TLB sfence_vma(); }
// Return the address of the PTE in page table pagetable // that corresponds to virtual address va. If alloc!=0, // create any required page-table pages. // // The risc-v Sv39 scheme has three levels of page-table // pages. A page-table page contains 512 64-bit PTEs. // A 64-bit virtual address is split into five fields: // 39..63 -- must be zero. // 30..38 -- 9 bits of level-2 index. // 21..29 -- 9 bits of level-1 index. // 12..20 -- 9 bits of level-0 index. // 0..11 -- 12 bits of byte offset within the page. pte_t * walk(pagetable_t pagetable, uint64 va, int alloc) { // 虚拟地址不能超出最大范围 if(va >= MAXVA) panic("walk");
// Look up a virtual address, return the physical address, // or 0 if not mapped. // Can only be used to look up user pages. uint64 walkaddr(pagetable_t pagetable, uint64 va) { pte_t *pte; uint64 pa;
// add a mapping to the kernel page table. // only used when booting. // does not flush TLB or enable paging. // 仅在boot启动阶段使用,映射的虚拟地址和物理地址都是连续的,注意在此之前end~PHYSTOP的物理内存块都被串联成单链表供内存分配 void kvmmap(uint64 va, uint64 pa, uint64 sz, int perm) { // 它将一个虚拟地址范围映射到一个物理地址范围。它将范围内地址分割成多页(忽略余数),每次映射一页的顶端地址 if(mappages(kernel_pagetable, va, sz, pa, perm) != 0) panic("kvmmap"); }
// translate a kernel virtual address to // a physical address. only needed for // addresses on the stack. // assumes va is page aligned. uint64 kvmpa(uint64 va) { uint64 off = va % PGSIZE; pte_t *pte; uint64 pa; pte = walk(kernel_pagetable, va, 0); if(pte == 0) panic("kvmpa"); if((*pte & PTE_V) == 0) panic("kvmpa"); pa = PTE2PA(*pte); return pa+off; }
// Create PTEs for virtual addresses starting at va that refer to // physical addresses starting at pa. va and size might not // be page-aligned. Returns 0 on success, -1 if walk() couldn't // allocate a needed page-table page. int mappages(pagetable_t pagetable, uint64 va, uint64 size, uint64 pa, int perm) { uint64 a, last; pte_t *pte;
a = PGROUNDDOWN(va);// 开始页地址 last = PGROUNDDOWN(va + size - 1);// 终止页地址 for(;;){ //调用 walk 找到该地址的最后一级 PTE 的地址 if((pte = walk(pagetable, a, 1)) == 0) return-1; //如果PTE不空且有效则说明是重新映射 if(*pte & PTE_V) panic("remap"); //PTE的0~9位是标志位,10~53位是物理页号,54~63位保留 *pte = PA2PTE(pa) | perm | PTE_V; //如果map的虚拟页到达终止页,结束该次mappages if(a == last) break; //当前映射页后移一页 a += PGSIZE; pa += PGSIZE; } return0; }
// Remove npages of mappings starting from va. va must be // page-aligned. The mappings must exist. // Optionally free the physical memory. void uvmunmap(pagetable_t pagetable, uint64 va, uint64 npages, int do_free) { uint64 a; pte_t *pte;
//虚拟地址必须页对齐 if((va % PGSIZE) != 0) panic("uvmunmap: not aligned");
for(a = va; a < va + npages*PGSIZE; a += PGSIZE){ // 页目录项必须存在 if((pte = walk(pagetable, a, 0)) == 0) panic("uvmunmap: walk"); // 页目录项必须有效 if((*pte & PTE_V) == 0) panic("uvmunmap: not mapped"); // PTE是中间目录项(不是叶子目录项) if(PTE_FLAGS(*pte) == PTE_V) panic("uvmunmap: not a leaf"); // 可选项,是否释放物理内存页 if(do_free){ uint64 pa = PTE2PA(*pte); kfree((void*)pa); } //清空pte表项 *pte = 0; } }
// create an empty user page table. // returns 0 if out of memory. // 创建一个空用户页表 pagetable_t uvmcreate() { pagetable_t pagetable; pagetable = (pagetable_t) kalloc(); if(pagetable == 0) return0; // 创建一个空的user页表 memset(pagetable, 0, PGSIZE); return pagetable; }
// Load the user initcode into address 0 of pagetable, // for the very first process. // sz must be less than a page. // uvminit(p->pagetable, initcode, sizeof(initcode)); void uvminit(pagetable_t pagetable, uchar *src, uint sz) { char *mem; // initcode.S不超过一页 if(sz >= PGSIZE) panic("inituvm: more than a page"); mem = kalloc(); // 空页初始化 memset(mem, 0, PGSIZE); // 初始化进程的虚拟地址0映射到实际分配的物理页 mappages(pagetable, 0, PGSIZE, (uint64)mem, PTE_W|PTE_R|PTE_X|PTE_U); // 给实际的物理页填充内容 memmove(mem, src, sz); }
// Allocate PTEs and physical memory to grow process from oldsz to // newsz, which need not be page aligned. Returns new size or 0 on error. uint64 uvmalloc(pagetable_t pagetable, uint64 oldsz, uint64 newsz) { char *mem; uint64 a;
if(newsz < oldsz) return oldsz; // 需要映射的下一个地址 oldsz = PGROUNDUP(oldsz); for(a = oldsz; a < newsz; a += PGSIZE){ // 分配一页物理内存 mem = kalloc(); if(mem == 0){ // 分配过程中失败,则页表应该恢复到原来的大小 uvmdealloc(pagetable, a, oldsz); return0; } // 初始化物理页内容 memset(mem, 0, PGSIZE); // 扩展的虚拟地址映射到新的物理地址 if(mappages(pagetable, a, PGSIZE, (uint64)mem, PTE_W|PTE_X|PTE_R|PTE_U) != 0){ // 映射失败则释放当前分配的内存页,页表应该恢复到原来的大小 kfree(mem); uvmdealloc(pagetable, a, oldsz); return0; } } return newsz; }
// Deallocate user pages to bring the process size from oldsz to // newsz. oldsz and newsz need not be page-aligned, nor does newsz // need to be less than oldsz. oldsz can be larger than the actual // process size. Returns the new process size. uint64 uvmdealloc(pagetable_t pagetable, uint64 oldsz, uint64 newsz) { if(newsz >= oldsz) return oldsz; // 解除用户页表的部分内存映射 if(PGROUNDUP(newsz) < PGROUNDUP(oldsz)){ // 计算需要解除映射的页数 int npages = (PGROUNDUP(oldsz) - PGROUNDUP(newsz)) / PGSIZE; // 从PGROUNDUP(newsz)开始移除npages页,并且释放对应的物理内存(do_free=1) uvmunmap(pagetable, PGROUNDUP(newsz), npages, 1); }
return newsz; }
// Recursively free page-table pages. // All leaf mappings must already have been removed. void freewalk(pagetable_t pagetable) { // there are 2^9 = 512 PTEs in a page table. for(int i = 0; i < 512; i++){ pte_t pte = pagetable[i]; // this PTE points to a lower-level page table. // PTE项是中间页表目录项 if((pte & PTE_V) && (pte & (PTE_R|PTE_W|PTE_X)) == 0){ // 进入下一级页表进行页表页释放 uint64 child = PTE2PA(pte); freewalk((pagetable_t)child); // PTE对应的下级页表释放完毕,PTE项置零 pagetable[i] = 0; } elseif(pte & PTE_V){ // 所有叶子目录的映射本应该在此之前全部移除 panic("freewalk: leaf"); } } // 页目录项全部置零后释放该级页表 kfree((void*)pagetable); }
// Given a parent process's page table, copy // its memory into a child's page table. // Copies both the page table and the // physical memory. // returns 0 on success, -1 on failure. // frees any allocated pages on failure. // fork创建子进程时会默认为子进程分配物理内存,并将父进程的内存复制到子进程中 int uvmcopy(pagetable_t old, pagetable_t new, uint64 sz) { pte_t *pte; uint64 pa, i; uint flags; char *mem;
// mark a PTE invalid for user access. // used by exec for the user stack guard page. // PTE项标记为用户模式无法访问,用于exec执行用户程序的用户栈后的守护页(发生栈溢出时就会报错无法访问特权内存区) void uvmclear(pagetable_t pagetable, uint64 va) { pte_t *pte; pte = walk(pagetable, va, 0); if(pte == 0) panic("uvmclear"); *pte &= ~PTE_U; }
// Copy from kernel to user. // Copy len bytes from src to virtual address dstva in a given page table. // Return 0 on success, -1 on error. int copyout(pagetable_t pagetable, uint64 dstva, char *src, uint64 len) { uint64 n, va0, pa0;
// Copy from user to kernel. // Copy len bytes to dst from virtual address srcva in a given page table. // Return 0 on success, -1 on error. int copyin(pagetable_t pagetable, char *dst, uint64 srcva, uint64 len) { uint64 n, va0, pa0;
void freerange(void *pa_start, void *pa_end) { char *p; p = (char*)PGROUNDUP((uint64)pa_start); for(; p + PGSIZE <= (char*)pa_end; p += PGSIZE) kfree(p); }
// Free the page of physical memory pointed at by v, // which normally should have been returned by a // call to kalloc(). (The exception is when // initializing the allocator; see kinit above.) void kfree(void *pa) { structrun *r; if(((uint64)pa % PGSIZE) != 0 || (char*)pa < end || (uint64)pa >= PHYSTOP) panic("kfree"); // Fill with junk to catch dangling refs. // 这将使得释放内存后使用内存的代码(使用悬空引用)读取垃圾而不是旧的有效内容; // 希望这将导致这类代码更快地崩溃 memset(pa, 1, PGSIZE); r = (struct run*)pa; acquire(&kmem.lock); r->next = kmem.freelist; kmem.freelist = r; release(&kmem.lock); }
// Allocate one 4096-byte page of physical memory. // Returns a pointer that the kernel can use. // Returns 0 if the memory cannot be allocated. void * kalloc(void) { structrun *r; acquire(&kmem.lock); r = kmem.freelist; if(r) kmem.freelist = r->next; release(&kmem.lock);
if(r){ memset((char*)r, 5, PGSIZE); // fill with junk } // kalloc 移除并返回空闲链表中的第一个元素 return (void*)r; }
// Allocate two pages at the next page boundary. // Use the second as the user stack. // 分配两页作为用户栈使用 sz = PGROUNDUP(sz); uint64 sz1; if((sz1 = uvmalloc(pagetable, sz, sz + 2*PGSIZE)) == 0) goto bad; sz = sz1; uvmclear(pagetable, sz-2*PGSIZE); sp = sz; stackbase = sp - PGSIZE;
// arguments to user main(argc, argv) // argc is returned via the system call return // value, which goes in a0. p->trapframe->a1 = sp;
// Save program name for debugging. for(last=s=path; *s; s++) if(*s == '/') last = s+1; safestrcpy(p->name, last, sizeof(p->name)); // Commit to the user image. // 进程镜像切换 oldpagetable = p->pagetable; p->pagetable = pagetable; p->sz = sz; p->trapframe->epc = elf.entry; // initial program counter = main p->trapframe->sp = sp; // initial stack pointer proc_freepagetable(oldpagetable, oldsz);
return argc; // this ends up in a0, the first argument to main(argc, argv)
有三种事件会导致 CPU 搁置普通指令的执行,强制将控制权转移给处理该事件的特殊代码。一种情况是系统调用,当用户程序执行 ecall 指令要求内核为其做某事时。另一种情况是异常,一条指令(用户或内核)做了一些非法的事情,如除以零或使用无效的虚拟地址。第三种情况是设备中断,当一个设备发出需要注意的信号时,例如当磁盘硬件完成一个读写请求时。
# a0 is no longer valid, since the kernel page # table does not specially map p->tf. # a0(p->trapframe)不再有效,因为现在切换成了内核页表,而内核页表中没有p->trapframe页映射
# jump to usertrap(), which does not return # 跳转到usertrap函数内 jr t0
.globl userret userret: # userret(TRAPFRAME, pagetable)从内核空间切换回用户空间 # a0: TRAPFRAME, in user page table. # a1: user page table, for satp.
# switch to the user page table. csrw satp, a1 sfence.vma zero, zero
# put the saved user a0 in sscratch, so we # can swap it with our a0 (TRAPFRAME) in the last step. # 将trapframe中保存的a0复制到sscratch中,为最后a0与TRAPFRAME交换做准备 ld t0, 112(a0) csrw sscratch, t0
# restore user a0, and save TRAPFRAME in sscratch # 对a0和sscratch做最后的交换,恢复用户a0并保存TRAPFRAME,为下一次 trap 做准备 csrrw a0, sscratch, a0 # return to user mode and user pc. # usertrapret() set up sstatus and sepc. # 返回到用户模式下并且沿着发生异常的下一条指令继续执行 sret
// set up to take exceptions and traps while in the kernel. // 设置内核空间的异常/陷入处理程序入口 void trapinithart(void) { w_stvec((uint64)kernelvec); }
// // handle an interrupt, exception, or system call from user space. // called from trampoline.S // 处理来自用户空间的中断、异常、系统调用 // void usertrap(void) { int which_dev = 0;
// SPP 位表示 trap 是来自用户模式还是监督者模式,并控制sret 返回到什么模式 // 只能处理来自用户空间的中断、异常、系统调用 if((r_sstatus() & SSTATUS_SPP) != 0) panic("usertrap: not from user mode");
// send interrupts and exceptions to kerneltrap(), // since we're now in the kernel. // 我们现在已经进入内核空间了,在内核中再发生的 trap 将由 kernelvec 处理 w_stvec((uint64)kernelvec);
// give up the CPU if this is a timer interrupt. // 时钟中断处理,让出CPU重新调度 if(which_dev == 2) yield();
usertrapret(); }
// // return to user space // 开始回到用户空间,首先设置 RISC-V 控制寄存器,为以后用户空间 trap 做准备 // void usertrapret(void) { structproc *p = myproc();
// we're about to switch the destination of traps from // kerneltrap() to usertrap(), so turn off interrupts until // we're back in user space, where usertrap() is correct. // 返回用户空间时关闭中断 intr_off();
// send syscalls, interrupts, and exceptions to trampoline.S // 改变 stvec 指向 uservec w_stvec(TRAMPOLINE + (uservec - trampoline));
// set up trapframe values that uservec will need when // the process next re-enters the kernel. // 准备 uservec 所依赖的 trapframe 字段 p->trapframe->kernel_satp = r_satp(); // kernel page table p->trapframe->kernel_sp = p->kstack + PGSIZE; // process's kernel stack p->trapframe->kernel_trap = (uint64)usertrap; p->trapframe->kernel_hartid = r_tp(); // hartid for cpuid()
// set up the registers that trampoline.S's sret will use // to get to user space. // set S Previous Privilege mode to User. // 在sret前设置sstatus寄存器,确保返回用户模式 unsignedlong x = r_sstatus(); x &= ~SSTATUS_SPP; // clear SPP to 0 for user mode x |= SSTATUS_SPIE; // enable interrupts in user mode w_sstatus(x);
// set S Exception Program Counter to the saved user pc. // 将 sepc 设置为先前保存的用户程序计数器 w_sepc(p->trapframe->epc);
// tell trampoline.S the user page table to switch to. // 用户进程页表 uint64 satp = MAKE_SATP(p->pagetable);
// jump to trampoline.S at the top of memory, which // switches to the user page table, restores user registers, // and switches to user mode with sret. // 调用 userret uint64 fn = TRAMPOLINE + (userret - trampoline); ((void (*)(uint64,uint64))fn)(TRAPFRAME, satp); }
// interrupts and exceptions from kernel code go here via kernelvec, // on whatever the current kernel stack is. void kerneltrap() { int which_dev = 0; uint64 sepc = r_sepc(); uint64 sstatus = r_sstatus(); uint64 scause = r_scause(); if((sstatus & SSTATUS_SPP) == 0) panic("kerneltrap: not from supervisor mode"); if(intr_get() != 0) panic("kerneltrap: interrupts enabled");
// give up the CPU if this is a timer interrupt. if(which_dev == 2 && myproc() != 0 && myproc()->state == RUNNING) yield();
// the yield() may have caused some traps to occur, // so restore trap registers for use by kernelvec.S's sepc instruction. w_sepc(sepc); w_sstatus(sstatus); }
// check if it's an external interrupt or software interrupt, // and handle it. // returns 2 if timer interrupt, // 1 if other device, // 0 if not recognized. int devintr() { uint64 scause = r_scause();
// this is a supervisor external interrupt, via PLIC. // 经PLIC而来的外部设备中断 if((scause & 0x8000000000000000L) && (scause & 0xff) == 9){ // irq indicates which device interrupted. // IRQ表明了具体哪个设备导致的中断 int irq = plic_claim(); if(irq == UART0_IRQ){ // 键盘串口中断 uartintr(); } elseif(irq == VIRTIO0_IRQ){ // 磁盘中断 virtio_disk_intr(); } elseif(irq){ printf("unexpected interrupt irq=%d\n", irq); }
// the PLIC allows each device to raise at most one // interrupt at a time; tell the PLIC the device is // now allowed to interrupt again. // PLIC只允许同一时刻一个设备至多被打断一次 if(irq) plic_complete(irq); return1; } // software interrupt from a machine-mode timer interrupt, // forwarded by timervec in kernelvec.S // 来自机器模式的时钟中断的软件中断 elseif(scause == 0x8000000000000001L){ if(cpuid() == 0){ clockintr(); } // acknowledge the software interrupt by clearing // the SSIP bit in sip. w_sip(r_sip() & ~2); return2; } else { return0; } }
// Fetch the uint64 at addr from the current process. int fetchaddr(uint64 addr, uint64 *ip) { structproc *p = myproc(); if(addr >= p->sz || addr+sizeof(uint64) > p->sz) return-1; if(copyin(p->pagetable, (char *)ip, addr, sizeof(*ip)) != 0) return-1; return0; }
// Fetch the nul-terminated string at addr from the current process. // Returns length of string, not including nul, or -1 for error. // 安全地将数据复制到用户提供的地址或从用户提供的地址复制数据 int fetchstr(uint64 addr, char *buf, int max) { structproc *p = myproc(); int err = copyinstr(p->pagetable, buf, addr, max); if(err < 0) return err; returnstrlen(buf); }
// Fetch the nth 32-bit system call argument. // 以整数的形式检索第n个系统调用参数 int argint(int n, int *ip) { *ip = argraw(n); return0; }
// Retrieve an argument as a pointer. // Doesn't check for legality, since // copyin/copyout will do that. int argaddr(int n, uint64 *ip) { // 以指针的形式检索第n个系统调用参数 *ip = argraw(n); return0; }
// Fetch the nth word-sized system call argument as a null-terminated string. // Copies into buf, at most max. // Returns string length if OK (including nul), -1 if error. int argstr(int n, char *buf, int max) { // 以字符串的形式检索第n个系统调用参数 uint64 addr; if(argaddr(n, &addr) < 0) return-1; return fetchstr(addr, buf, max); }
==COW fork 中的基本设计是父进程和子进程最初共享所有的物理页面,但将它们映射设置为只读。因此,当子进程或父进程执行 store 指令时,RISC-V CPU 会引发一个页面故障异常。作为对这个异常的响应,内核会拷贝一份包含故障地址的页。然后将一个副本的读/写映射在子进程地址空间,另一个副本的读/写映射在父进程地址空间。更新页表后,内核在引起故障的指令处恢复故障处理。因为内核已经更新了相关的 PTE,允许写入,所以现在故障指令将正常执行。==
// // send one character to the uart. // called by printf, and to echo input characters, // but not from write(). // void consputc(int c) { if(c == BACKSPACE){ // if the user typed backspace, overwrite with a space. uartputc_sync('\b'); uartputc_sync(' '); uartputc_sync('\b'); } else { uartputc_sync(c); } }
// UART 硬件在软件看来是一组内存映射的控制寄存器。 // 也就是说,有一些 RISC-V 硬件的物理内存地址会连接到 UART 设备,因此加载和存储与设备硬件而不是 RAM 交互. #define Reg(reg) ((volatile unsigned char *)(UART0 + reg)) #define LSR 5 // line status register #define LSR_RX_READY (1<<0) // input is waiting to be read from RHR #define LSR_TX_IDLE (1<<5) // THR can accept another character to send
// alternate version of uartputc() that doesn't // use interrupts, for use by kernel printf() and // to echo characters. it spins waiting for the uart's // output register to be empty. void uartputc_sync(int c) { push_off(); if(panicked){ for(;;) ; } // wait for Transmit Holding Empty to be set in LSR. while((ReadReg(LSR) & LSR_TX_IDLE) == 0) ; WriteReg(THR, c); pop_off(); }
// check if it's an external interrupt or software interrupt, // and handle it. // returns 2 if timer interrupt, // 1 if other device, // 0 if not recognized. int devintr() { uint64 scause = r_scause();
// this is a supervisor external interrupt, via PLIC. if((scause & 0x8000000000000000L) && (scause & 0xff) == 9){ // irq indicates which device interrupted. int irq = plic_claim(); if(irq == UART0_IRQ){ uartintr(); } elseif(irq == VIRTIO0_IRQ){ virtio_disk_intr(); } elseif(irq){ printf("unexpected interrupt irq=%d\n", irq); }
// the PLIC allows each device to raise at most one // interrupt at a time; tell the PLIC the device is // now allowed to interrupt again. if(irq) plic_complete(irq); return1; } // software interrupt from a machine-mode timer interrupt, // forwarded by timervec in kernelvec.S elseif(scause == 0x8000000000000001L){ if(cpuid() == 0){ clockintr(); } // acknowledge the software interrupt by clearing // the SSIP bit in sip. w_sip(r_sip() & ~2); return2; } else { return0; } }
#define LSR 5 // line status register #define LSR_RX_READY (1<<0) // input is waiting to be read from RHR #define LSR_TX_IDLE (1<<5) // THR can accept another character to send
// read one input character from the UART. // return -1 if none is waiting. int uartgetc(void) { if(ReadReg(LSR) & 0x01){ // input data is ready. return ReadReg(RHR); } else { return-1; } }
// handle a uart interrupt, raised because input has // arrived, or the uart is ready for more output, or // both. called from trap.c. void uartintr(void) { // read and process incoming characters. while(1){ // 读取来自UART的一个输入字符 int c = uartgetc(); if(c == -1) break; // 字符追加到cons.buffer并唤醒可能休眠的consoleread() consoleintr(c); }
struct { // 串行化访问console硬件,避免乱序的输出 structspinlocklock; #define INPUT_BUF 128 char buf[INPUT_BUF]; uint r; // Read index uint w; // Write index uint e; // Edit index } cons;
// // the console input interrupt handler. // uartintr() calls this for input character. // do erase/kill processing, append to cons.buf, // wake up consoleread() if a whole line has arrived. // void consoleintr(int c) { acquire(&cons.lock);
switch(c){ caseC('P'): // Print process list. procdump(); break; caseC('U'): // Kill line. while(cons.e != cons.w && cons.buf[(cons.e-1) % INPUT_BUF] != '\n'){ cons.e--; consputc(BACKSPACE); } break; caseC('H'): // Backspace case '\x7f': if(cons.e != cons.w){ cons.e--; consputc(BACKSPACE); } break; default: if(c != 0 && cons.e-cons.r < INPUT_BUF){ c = (c == '\r') ? '\n' : c; // echo back to the user. // 字符回显到屏幕!! consputc(c); // store for consumption by consoleread(). // 字符添加到cons.buffer中!! cons.buf[cons.e++ % INPUT_BUF] = c; if(c == '\n' || c == C('D') || cons.e == cons.r+INPUT_BUF){ // wake up consoleread() if a whole line (or end-of-file) // has arrived. // 一整行或者文件结束符到达后唤醒休眠的consoleread() cons.w = cons.e; wakeup(&cons.r); } } break; } release(&cons.lock); }
// Print a process listing to console. For debugging. // Runs when user types ^P on console. // No lock to avoid wedging a stuck machine further. void procdump(void) { staticchar *states[] = { [UNUSED] "unused", [SLEEPING] "sleep ", [RUNNABLE] "runble", [RUNNING] "run ", [ZOMBIE] "zombie" }; structproc *p; char *state;
// // user read()s from the console go here. // copy (up to) a whole input line to dst. // user_dist indicates whether dst is a user // or kernel address. // 读取键盘的数据并拷贝至目标地址(实际上数据最终被放入shell的buf中) // int consoleread(int user_dst, uint64 dst, int n) { uint target; int c; char cbuf;
target = n; acquire(&cons.lock); while(n > 0){ // wait until interrupt handler has put some // input into cons.buffer. while(cons.r == cons.w){ if(myproc()->killed){ release(&cons.lock); return-1; } // cons.buf为空,则休眠等待中断处理函数把数据放入cons.buffer sleep(&cons.r, &cons.lock); }
c = cons.buf[cons.r++ % INPUT_BUF];
if(c == C('D')){ // end-of-file if(n < target){ // Save ^D for next time, to make sure // caller gets a 0-byte result. cons.r--; } break; }
// copy the input byte to the user-space buffer. cbuf = c; if(either_copyout(user_dst, dst, &cbuf, 1) == -1) break;
dst++; --n;
if(c == '\n'){ // a whole line has arrived, return to // the user-level read(). // 读取到一整行的末尾则结束 break; } } release(&cons.lock);
// Write to file f. // addr is a user virtual address. int filewrite(struct file *f, uint64 addr, int n) { int r, ret = 0;
// 检查文件读写权限 if(f->writable == 0) return-1;
// 根据文件类型选择不同的写入函数 if(f->type == FD_PIPE){ ret = pipewrite(f->pipe, addr, n); } elseif(f->type == FD_DEVICE){ if(f->major < 0 || f->major >= NDEV || !devsw[f->major].write) return-1; // Linux中一切皆文件包括设备,这里的思想也一样:对于文件的读写深入底层其实是不同设备的读写接口 ret = devsw[f->major].write(1, addr, n); } elseif(f->type == FD_INODE){ // write a few blocks at a time to avoid exceeding // the maximum log transaction size, including // i-node, indirect block, allocation blocks, // and 2 blocks of slop for non-aligned writes. // this really belongs lower down, since writei() // might be writing a device like the console. int max = ((MAXOPBLOCKS-1-1-2) / 2) * BSIZE; int i = 0; while(i < n){ int n1 = n - i; if(n1 > max) n1 = max;
begin_op(); ilock(f->ip); if ((r = writei(f->ip, 1, addr + i, f->off, n1)) > 0) f->off += r; iunlock(f->ip); end_op();
if(r < 0) break; if(r != n1) panic("short filewrite"); i += r; } ret = (i == n ? n : -1); } else { panic("filewrite"); }
return ret; }
console.c的consolewrite
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
// // user write()s to the console go here. // user_src=1表示src是用户空间的虚拟地址,src表示字符数据的地址,n表示要写入的字符数 // int consolewrite(int user_src, uint64 src, int n) { int i; acquire(&cons.lock); for(i = 0; i < n; i++){ char c; if(either_copyin(&c, user_src, src+i, 1) == -1) break; // 把一个字符写入输出缓冲区 uartputc(c); } release(&cons.lock); return i; }
proc.c的either_copyin
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// Copy from either a user address, or kernel address, // depending on usr_src. // Returns 0 on success, -1 on error. int either_copyin(void *dst, int user_src, uint64 src, uint64 len) { structproc *p = myproc(); if(user_src){ // 用户空间的数据需要通过copyin使用用户页表拷贝数据 return copyin(p->pagetable, dst, src, len); } else { // 内核空间的数据可以直接通过memmove复制数据(因为此时系统就处于内核态) memmove(dst, (char*)src, len); return0; } }
// the transmit output buffer. structspinlockuart_tx_lock; #define UART_TX_BUF_SIZE 32 char uart_tx_buf[UART_TX_BUF_SIZE]; int uart_tx_w; // write next to uart_tx_buf[uart_tx_w++] int uart_tx_r; // read next from uart_tx_buf[uar_tx_r++]
// add a character to the output buffer and tell the // UART to start sending if it isn't already. // blocks if the output buffer is full. // because it may block, it can't be called // from interrupts; it's only suitable for use // by write(). void uartputc(int c) { acquire(&uart_tx_lock);
if(panicked){ for(;;) ; } // 外层的while循环防止假唤醒 while(1){ if(((uart_tx_w + 1) % UART_TX_BUF_SIZE) == uart_tx_r){ // buffer is full. // wait for uartstart() to open up space in the buffer. // 循环缓冲区满了,则休眠在最近等待发送的字符位置uart_tx_r处等待uartstart的唤醒 sleep(&uart_tx_r, &uart_tx_lock); } else { // 正常情况下输出缓冲区有空间,字符直接追加写入即可 uart_tx_buf[uart_tx_w] = c; uart_tx_w = (uart_tx_w + 1) % UART_TX_BUF_SIZE; // 准备好数据就可以发送 uartstart(); release(&uart_tx_lock); return; } } }
// if the UART is idle, and a character is waiting // in the transmit buffer, send it. // caller must hold uart_tx_lock. // called from both the top- and bottom-half. void uartstart() { while(1){ if(uart_tx_w == uart_tx_r){ // transmit buffer is empty. // 发送缓冲区为空直接退出 return; } if((ReadReg(LSR) & LSR_TX_IDLE) == 0){ // the UART transmit holding register is full, // so we cannot give it another byte. // it will interrupt when it's ready for a new byte. // UART传输寄存器有内容,我们不能覆盖,这个函数会在UART硬件传输完上一个字符准备新字符输出时被uartintr中断调用 return; } int c = uart_tx_buf[uart_tx_r]; uart_tx_r = (uart_tx_r + 1) % UART_TX_BUF_SIZE; // maybe uartputc() is waiting for space in the buffer. // 发送一个字符后,唤醒可能休眠的uartputc() wakeup(&uart_tx_r); WriteReg(THR, c); } }
// handle a uart interrupt, raised because input has // arrived, or the uart is ready for more output, or // both. called from trap.c // uartintr不仅处理键盘的字符输入中断,还需要处理UART硬件的字符输出完毕中断 void uartintr(void) { // read and process incoming characters. while(1){ // 读取来自UART的一个输入字符 int c = uartgetc(); if(c == -1) break; // 字符追加到cons.buffer并唤醒可能休眠的consoleread() consoleintr(c); }
在 main 执行之前的 start.c,是在机器模式下执行的,设置了接收定时器中断。一部分工作是对CLINT硬件(core-local interruptor)进行编程,使其每隔一定时间产生一次中断。另一部分是设置一个类似于trapframe的scratch区域,帮助定时器中断处理程序保存寄存器和CLINT寄存器的地址。最后,start将mtvec设置为timervec,启用定时器中断。
// set up to receive timer interrupts in machine mode, // which arrive at timervec in kernelvec.S, // which turns them into software interrupts for // devintr() in trap.c. void timerinit() { // each CPU has a separate source of timer interrupts. // 首先读出 CPU 的 ID int id = r_mhartid(); // ask the CLINT for a timer interrupt. // 设置中断时间间隔,这里设置的是 0.1 秒 int interval = 1000000; // cycles; about 1/10th second in qemu. *(uint64*)CLINT_MTIMECMP(id) = *(uint64*)CLINT_MTIME + interval; // prepare information in scratch[] for timervec. // scratch[0..3] : space for timervec to save registers. // scratch[4] : address of CLINT MTIMECMP register. // scratch[5] : desired interval (in cycles) between timer interrupts. uint64 *scratch = &mscratch0[32 * id]; scratch[4] = CLINT_MTIMECMP(id); scratch[5] = interval; w_mscratch((uint64)scratch); // set the machine-mode trap handler. // 设置时钟中断处理函数 w_mtvec((uint64)timervec); // enable machine-mode interrupts. // 打开中断 w_mstatus(r_mstatus() | MSTATUS_MIE); // enable machine-mode timer interrupts. // 打开时钟中断 w_mie(r_mie() | MIE_MTIE); }
// check if it's an external interrupt or software interrupt, // and handle it. // returns 2 if timer interrupt, // 1 if other device, // 0 if not recognized. int devintr() { uint64 scause = r_scause(); // this is a supervisor external interrupt, via PLIC. if((scause & 0x8000000000000000L) && (scause & 0xff) == 9){ // irq indicates which device interrupted. int irq = plic_claim(); if(irq == UART0_IRQ){ uartintr(); } elseif(irq == VIRTIO0_IRQ){ virtio_disk_intr(); } elseif(irq){ printf("unexpected interrupt irq=%d\n", irq); } // the PLIC allows each device to raise at most one // interrupt at a time; tell the PLIC the device is // now allowed to interrupt again. if(irq) plic_complete(irq); return1; } // software interrupt from a machine-mode timer interrupt, // forwarded by timervec in kernelvec.S elseif(scause == 0x8000000000000001L){ if(cpuid() == 0){ clockintr(); } // acknowledge the software interrupt by clearing // the SSIP bit in sip. w_sip(r_sip() & ~2); return2; } else { return0; } }
机器模式的定时器中断向量是timervec。它在 start 准备的scratch区域保存一些寄存器,告诉 CLINT何时产生下一个定时器中断,使RISC-V产生一个软件中断,恢复寄存器,然后返回。在定时器中断处理程序中没有 C 代码。
structcpucpus[NCPU]; structprocproc[NPROC]; int nextpid = 1; structspinlockpid_lock;
// Switch to scheduler. Must hold only p->lock // and have changed proc->state. Saves and restores // intena because intena is a property of this // kernel thread, not this CPU. It should // be proc->intena and proc->noff, but that would // break in the few places where a lock is held but // there's no process. void sched(void) { int intena; structproc *p = myproc();
// Give up the CPU for one scheduling round. void yield(void) { structproc *p = myproc(); acquire(&p->lock); p->state = RUNNABLE; // 拿到p->lock后不能在这里立刻释放,因为此时可能有其他核心的调度线程会发现该线程可运行,此时多个核心运行一个线程,线程栈瞬间崩溃 // 只有在scheduler内核调度线程中才能释放p->lock,因为此时线程使用的是scheduler的内核栈,用户线程已经完全放弃CPU使用 sched(); release(&p->lock); }
// Per-CPU process scheduler. // Each CPU calls scheduler() after setting itself up. // Scheduler never returns. It loops, doing: // - choose a process to run. // - swtch to start running that process. // - eventually that process transfers control // via swtch back to the scheduler. void scheduler(void) { structproc *p; structcpu *c = mycpu(); c->proc = 0; for(;;){ // Avoid deadlock by ensuring that devices can interrupt. // CPU核心通过设置SSTATUS寄存器(SSTATUS_SIE)开启设备中断 intr_on(); int found = 0; for(p = proc; p < &proc[NPROC]; p++) { acquire(&p->lock); if(p->state == RUNNABLE) { // Switch to chosen process. It is the process's job // to release its lock and then reacquire it // before jumping back to us. p->state = RUNNING; c->proc = p; // 切换到选中的进程执行 swtch(&c->context, &p->context); // Process is done running for now. // It should have changed its p->state before coming back. c->proc = 0; found = 1; } // 这一步很重要,每次通过shced()-->scheduler()-->swtch()执行,到达这里会释放之前保 持的p->lock // 一个具体的例子就是sleep()沿着函数调用到达这里后释放了p->lock,wakeup()可以获得p- >lock并唤醒p release(&p->lock); } if(found == 0) { intr_on(); asmvolatile("wfi"); } } }
==当设备在不可预知的时间需要关注时,中断是很有用的,而且不会太频繁。但中断对 CPU的开销很大。因此,高速设备,如网络和磁盘控制器,使用了减少对中断需求的技巧。其中一个技巧是对整批传入或传出的请求提出一个单一的中断。另一个技巧是让驱动程序完全禁用中断,并定期检查设备是否需要关注。这种技术称为轮询(polling)。如果设备执行操作的速度非常快,轮询是有意义的,但如果设备大部分时间处于空闲状态,则会浪费 CPU 时间。一些驱动程序会根据当前设备的负载情况,在轮询和中断之间动态切换。==
大多数内核,包括 xv6,都会交错执行多个任务。多处理器硬件可以交错执行任务:具有多个 CPU 独立执行的计算机,如 xv6 的 RISC-V。这些多个 CPU 共享物理 RAM,xv6 利用共享来维护所有 CPU 读写的数据结构。这种共享带来了一种可能性,即一个 CPU 读取一个数据结构,而另一个 CPU 正在中途更新它,甚至多个 CPU 同时更新同一个数据;如果不仔细设计,这种并行访问很可能产生不正确的结果或破坏数据结构。即使在单处理器上,内核也可能在多个线程之间切换 CPU,导致它们的执行交错。最后,如果中断发生的时间不对,一个设备中断处理程序可能会修改与一些可中断代码相同的数据,从而破坏数据。并发一词指的是由于多处理器并行、线程切换或中断而导致多个指令流交错的情况。
==内核中充满了并发访问的数据。例如,两个 CPU 可以同时调用 kalloc,从而并发地从空闲内存链表的头部 push。内核设计者喜欢允许大量的并发,因为它可以通过并行来提高性能,提高响应速度。然而,结果是内核设计者花了很多精力说服自己,尽管存在并发,但仍然是正确的。==有很多方法可以写出正确的代码,有些方法比其他方法更简单。以并发下的正确性为目标的策略,以及支持这些策略的抽象,被称为并发控制技术。
xv6 根据不同的情况,使用了很多并发控制技术,还有更多的可能。本章重点介绍一种广泛使用的技术:锁。锁提供了相互排斥的功能,确保一次只有一个 CPU 可以持有锁。如果程序员为每个共享数据项关联一个锁,并且代码在使用某项时总是持有关联的锁,那么该项每次只能由一个 CPU 使用。在这种情况下,我们说锁保护了数据项。虽然锁是一种简单易懂的并发控制机制,但锁的缺点是会扼杀性能,因为锁将并发操作串行化了。
6.1竞争情况
作为我们为什么需要锁的一个例子,考虑两个进程在两个不同的 CPU 上调用 wait,wait释放子进程的内存。因此,在每个 CPU 上,内核都会调用 kfree 来释放子进程的内存页。内核分配器维护了一个链表:kalloc() 从空闲页链表中 pop 一页内存,kfree()将一页 push 空闲链表中。为了达到最好的性能,我们可能希望两个父进程的 kfrees 能够并行执行,而不需要任何一个进程等待另一个进程,但是考虑到xv6 的 kfree 实现,这是不正确的。
图 6.1 更详细地说明了这种设置:链表在两个 CPU 共享的内存中,CPU 使用加载和存储指令操作链表。(在现实中,处理器有缓存,但在概念上,多处理器系统的行为就像有一个单一的共享内存一样)。如果没有并发请求,你可能会实现如下的链表 push 操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
structelement{ int data; structelement *next; }; structelement *list =0; void push(int data) { structelement *l; l = malloc(sizeof *l); l->data = data; l->next = list; list = l; }
如果单独执行,这个实现是正确的。但是,如果多个副本同时执行,代码就不正确。如果两个 CPU 同时执行 push,那么两个 CPU 可能都会执行图 6.1 所示的第 12 行,然后其中一个才执行第 13行,这就会产生一个不正确的结果。这样就会出现两个 list元素,将 next 设为 list 的前值。当对 list 的两次赋值发生在第 13 行时,第二次赋值将覆盖第一次赋值;第一次赋值中涉及的元素将丢失。
第 13 行的丢失更新是竞争条件(race condition)的一个例子。竞争条件是指同时访问一个内存位置,并且至少有一次访问是写的情况。竞争通常是一个错误的标志,要么是丢失更新(如果访问是写),要么是读取一个不完全更新的数据结构。==竞争的结果取决于所涉及的两个 CPU 的确切时间,以及它们的内存操作如何被内存系统排序,这可能会使竞争引起的错误难以重现和调试。例如,在调试 push 时加入 print 语句可能会改变执行的时机,足以使竞争消失。==
虽然正确使用锁可以使不正确的代码变得正确,但锁会限制性能。例如,如果两个进程同时调用 kfree,锁会将两个调用串行化,我们在不同的 CPU 上运行它们不会获得任何好处。我们说,如果多个进程同时想要同一个锁,就会发生冲突,或者说锁经历了争夺。内核设计的一个主要挑战是避免锁的争用。xv6在这方面做得很少,但是复杂的内核会专门组织数据结构和算法来避免锁争用。==例如链表,一个内核可以为每个 CPU 维护一个空闲页链表,只有当自己的链表为空时,并且它必须从另一个 CPU 获取内存时,才会接触另一个 CPU 的空闲链表。其他用例可能需要更复杂的设计。==
void acquire(struct spinlock *lk)// does not work! { for (;;){ if (lk->locked == 0){ lk->locked = 1; break; } } }
不幸的是,这种实现并不能保证多处理器上的相互排斥。可能会出现这样的情况:两个CPU 同时到达 if 语句,看到 lk->locked 为零,然后都通过设置 lk->locked = 1 来抢夺锁。此时,两个不同的 CPU 持有锁,这就违反了互斥属性。我们需要的是让第 5 行和第 6 行作为一个原子即不可分割步骤来执行。
==由于锁被广泛使用,多核处理器通常提供了一些原子版的指令。在 RISC-V 上,这条指令是 amoswap r, a。amoswap 读取内存地址 a 处的值,将寄存器 r 的内容写入该地址,并将其读取的值放入 r 中,也就是说,它将寄存器的内容和内存地址进行了交换。它原子地执行这个序列,使用特殊的硬件来防止任何其他 CPU 使用读和写之间的内存地址。==
// Mutual exclusion lock. // 自旋锁 structspinlock { // 锁是否被持有,0表示锁空闲,1表示锁已占用 uint locked; // Is the lock held?】 // For debugging: // 锁的名字 char *name; // Name of lock. // 持有锁的CPU structcpu *cpu;// The cpu holding the lock. };
// Acquire the lock. // Loops (spins) until the lock is acquired. void acquire(struct spinlock *lk) { // 关中断,保存最外层临界区开始时的中断状态 push_off(); // disable interrupts to avoid deadlock. // CPU拿到锁后并且还没有release之前不能再次获取锁,否则会发生死锁 if(holding(lk)) panic("acquire");
// On RISC-V, sync_lock_test_and_set turns into an atomic swap: // a5 = 1 // s1 = &lk->locked // amoswap.w.aq a5, a5, (s1) // 原子指令,自旋swap while(__sync_lock_test_and_set(&lk->locked, 1) != 0) ;
// Tell the C compiler and the processor to not move loads or stores // past this point, to ensure that the critical section's memory // references happen strictly after the lock is acquired. // On RISC-V, this emits a fence instruction. // 设置内存屏障,防止指令重排 __sync_synchronize();
// Record info about lock acquisition for holding() and debugging. lk->cpu = mycpu(); }
// Tell the C compiler and the CPU to not move loads or stores // past this point, to ensure that all the stores in the critical // section are visible to other CPUs before the lock is released, // and that loads in the critical section occur strictly before // the lock is released. // On RISC-V, this emits a fence instruction. // 设置内存屏障,防止指令重排 __sync_synchronize();
// Release the lock, equivalent to lk->locked = 0. // This code doesn't use a C assignment, since the C standard // implies that an assignment might be implemented with // multiple store instructions. // On RISC-V, sync_lock_release turns into an atomic swap: // s1 = &lk->locked // amoswap.w zero, zero, (s1) // 原子指令,swap复制 __sync_lock_release(&lk->locked);
pop_off(); }
// Check whether this cpu is holding the lock. // Interrupts must be off. int holding(struct spinlock *lk) { int r; r = (lk->locked && lk->cpu == mycpu()); return r; }
// push_off/pop_off are like intr_off()/intr_on() except that they are matched: // it takes two pop_off()s to undo two push_off()s. Also, if interrupts // are initially off, then push_off, pop_off leaves them off.
上面的规则说了什么时候需要锁,但没说什么时候不需要锁,为了效率,不要太多锁,因为锁会降低并行性。如果并行性不重要,那么可以只安排一个线程,而不用担心锁的问题。一个简单的内核可以在多处理器上像这样做,通过一个单一的锁,这个锁必须在进入内核时获得,并在退出内核时释放(尽管系统调用,如管道读取或等待会带来一个问题)。许多单处理器操作系统已经被改造成使用这种方法在多处理器上运行,有时被称为大内核锁,但这种方法牺牲了并行性:内核中一次只能执行一个 CPU。如果内核做任何繁重的计算,那么使用一组更大的更精细的锁,这样内核可以同时在多个 CPU 上执行,效率会更高。
作为粗粒度锁的一个例子,xv6的kalloc.c分配器有一个单一的空闲页链表,由一个锁保护。如果不同 CPU 上的多个进程试图同时分配内存页,每个进程都必须通过在 acquire 中自旋来等待获取锁。==自旋会降低性能,因为这是无意义的。如果对锁的争夺浪费了很大一部分 CPU 时间,也许可以通过改变分配器的设计来提高性能,使其拥有多个空闲页链表,每个链表都有自己的锁,从而实现真正的并行分配。==
人们很自然地认为程序是按照源代码语句出现的顺序来执行的。然而,许多编译器和CPU 为了获得更高的性能,会不按顺序执行代码。如果一条指令需要很多周期才能完成,CPU 可能会提前发出该指令,以便与其他指令重叠,避免 CPU 停顿。例如,CPU 可能会注意到在一个串行序列中,指令 A 和 B 互不依赖。CPU 可能先启动指令 B,这是因为它的输入在 A 的输入之前已经准备好了,或者是为了使 A 和 B 的执行重叠。 编译器可以执行类似的重新排序,在一条语句的指令之前发出另一条语句的指令,由于它们原来的顺序。
xv6以睡眠锁(sleep-locks)的形式提供了这样的锁。acquiresleep在等待的过程中让出CPU,使用的技术将在第 7 章解释。在高层次上,sleep-lock有一个由spinlock保护的锁定字段,而 acquiresleep 对 sleep 的调用会原子性地让出 CPU 并释放spinlock。其结果是,在acquiresleep 等待时,其他线程可以执行。因为睡眠锁会使中断处于启用状态,所以不能在中断处理程序中使用睡眠锁。
因为acquiresleep可能会让出CPU,所以睡眠锁不能在spinlock临界区内使用(虽然spinlocks可以在睡眠锁临界区内使用)。自旋锁最适合短的临界区,因为等待它们会浪费 CPU 时间;睡眠锁对长时间的操作很有效。
sleeplock.h
1 2 3 4 5 6 7 8 9
// Long-term locks for processes structsleeplock { uint locked; // Is the lock held? // 必须使用自旋锁保护,保证并发下的正确性 structspinlocklk;// spinlock protecting this sleep lock // For debugging: char *name; // Name of lock. int pid; // Process holding lock };
int holdingsleep(struct sleeplock *lk) { int r; acquire(&lk->lk); r = lk->locked && (lk->pid == myproc()->pid); release(&lk->lk); return r; }
第七章:调度
任何操作系统运行的进程数量都可能超过计算机的 CPU 数量,因此需要制定一个方案,在各进程之间分时共享 CPU。理想情况下,这种共享对用户进程是透明的。一种常见的方法是通过将进程复用到硬件 CPU 上,给每个进程提供它有自己的虚拟 CPU 的假象。本章解释xv6 如何实现这种复用。
7.1关于复用
xv6 通过在两种情况下将 CPU 从一个进程切换到另一个进程来实现复用。首先,xv6的sleep和wakeup机制会进行切换,这会发生在进程等待设备或管道I/O,或等待子进程退出,或在 sleep系统调用中等待。其次,xv6周期性地强制切换,以应对长时间的计算进程。这种复用造成了每个进程都有自己的 CPU 的假象,就像 xv6 使用内存分配器和硬件页表造成每个进程都有自己的内存的假象一样。
实现复用会有一些挑战。==首先,如何从一个进程切换到另一个进程?虽然上下文切换的想法很简单,但实现起来却很难。第二,如何对用户进程透明的强制切换?xv6 采用一般的方式,用定时器中断来驱动上下文切换。第三,许多 CPU 可能会在进程间并发切换,需要设计一个锁来避免竞争。第四,当进程退出时,必须释放进程的内存和其他资源,但它自己不能做到这一切,因为它不能释放自己的内核栈,同时又在使用内核栈。第五,多核机器的每个内核必须记住它正在执行的进程,这样系统调用就会修改相应进程的内核状态。最后,sleep 和 wakeup 允许一个进程放弃 CPU,并睡眠等待事件,并允许另一个进程唤醒第一个进程。需要注意一些竞争可能会使唤醒丢失。xv6 试图尽可能简单地解决这些问题,但尽管如此,写出来代码还是很棘手。==
7.2上下文切换
图 7.1 概述了从一个用户进程切换到另一个用户进程所涉及的步骤:用户内核转换(系统调用或中断)到旧进程的内核线程,context 切换到当前 CPU 的调度器线程,context 切换到新进程的内核线程,以及 trap 返回到用户级进程。xv6 调度器在每个 CPU 上有一个专门的线程(保存的寄存器和栈),因为调度器在旧进程的内核栈上执行是不安全的:其他核心可能会唤醒进程并运行它,而且在两个不同的核心上使用相同的栈将是一场灾难。在本节中,我们将研究在内核线程和调度线程之间切换的机制。
从一个线程切换到另一个线程,需要保存旧线程的 CPU 寄存器,并恢复新线程之前保存的寄存器;栈指针和 pc 被保存和恢复,意味着 CPU 将切换栈和正在执行的代码。
函数 swtch 执行内核线程切换的保存和恢复。swtch 并不直接知道线程,它只是保存和恢复寄存器组,称为上下文(context)。当一个进程要放弃 CPU 的时候,进程的内核线程会调用 swtch 保存自己的上下文并返回到调度器上下文。每个上下文都包含在一个结构体context中,它本身包含在进程的结构体 proc 或 CPU 的结构体 cpu 中。swtch 有两个参数:struct context *old 和 struct context *new。它将当前的寄存器保存在old 中,从 new 中加载寄存器,然后返回。
// Per-CPU process scheduler. // Each CPU calls scheduler() after setting itself up. // Scheduler never returns. It loops, doing: // - choose a process to run. // - swtch to start running that process. // - eventually that process transfers control // via swtch back to the scheduler. void scheduler(void) { structproc *p; structcpu *c = mycpu(); c->proc = 0; for(;;){ // Avoid deadlock by ensuring that devices can interrupt. // CPU核心通过设置SSTATUS寄存器(SSTATUS_SIE)开启设备中断 intr_on(); int found = 0; for(p = proc; p < &proc[NPROC]; p++) { acquire(&p->lock); if(p->state == RUNNABLE) { // Switch to chosen process. It is the process's job // to release its lock and then reacquire it // before jumping back to us. p->state = RUNNING; c->proc = p;
// A kernel page table per process 实验开始 w_satp(MAKE_SATP(p->kernelpt)); sfence_vma(); // A kernel page table per process 实验结束
// 切换到选中的进程执行 swtch(&c->context, &p->context);
// Process is done running for now. // It should have changed its p->state before coming back. c->proc = 0; // A kernel page table per process 实验开始 w_satp(MAKE_SATP(kernel_pagetable)); sfence_vma(); // A kernel page table per process 实验结束 found = 1; } // 这一步很重要,每次通过shced()-->scheduler()-->swtch()执行,到达这里会释放之前保持的p->lock // 一个具体的例子就是sleep()沿着函数调用到达这里后释放了p->lock,wakeup()可以获得p->lock并唤醒p release(&p->lock); } if(found == 0) { intr_on(); asmvolatile("wfi"); } } }
==xv6 为每个CPU维护了一个 cpu 结构体,它记录了当前在该CPU上运行的进程(如果有的话),为CPU的调度线程保存的寄存器,以及管理中断禁用所需的嵌套自旋锁的计数。==函数 mycpu返回一个指向当前 CPU 结构体 cpu 的指针。RISC-V对其CPU进行编号,给每个CPU一个 hartid。xv6确保每个CPU的hartid在内核中被存储在该CPU的tp寄存器中。这使得 mycpu 可以使用 tp 对 cpu 结构体的数组进行索引,从而找到正确的 cpu。
// Atomically release lock and sleep on chan. // Reacquires lock when awakened. void sleep(void *chan, struct spinlock *lk) { structproc *p = myproc(); // Must acquire p->lock in order to // change p->state and then call sched. // Once we hold p->lock, we can be // guaranteed that we won't miss any wakeup // (wakeup locks p->lock), // so it's okay to release lk. // 修改进程状态和调用sched函数前,必须拿到p->lock if(lk != &p->lock){ //DOC: sleeplock0 acquire(&p->lock); //DOC: sleeplock1 release(lk); } // Go to sleep. // 进程休眠 p->chan = chan; p->state = SLEEPING; // 转去调度线程 sched();
#define PIPESIZE 512 structpipe { structspinlocklock; char data[PIPESIZE]; uint nread; // number of bytes read uint nwrite; // number of bytes written int readopen; // read fd is still open int writeopen; // write fd is still open };
// Wait for a child process to exit and return its pid. // Return -1 if this process has no children. int wait(uint64 addr) { structproc *np; int havekids, pid; structproc *p = myproc();
// hold p->lock for the whole time to avoid lost // wakeups from a child's exit(). acquire(&p->lock);
for(;;){ // Scan through table looking for exited children. havekids = 0; for(np = proc; np < &proc[NPROC]; np++){ // this code uses np->parent without holding np->lock. // acquiring the lock first would cause a deadlock, // since np might be an ancestor, and we already hold p->lock. // np可能是p的祖先进程,注意需要遵循“先父进程锁后子进程锁”的原则 if(np->parent == p){ // np->parent can't change between the check and the acquire() // because only the parent changes it, and we're the parent. acquire(&np->lock); havekids = 1; // 对于僵死的子进程需要释放其内存 if(np->state == ZOMBIE){ // Found one. // 子进程的进程ID pid = np->pid; // 子进程的退出状态复制到addr地址处 if(addr != 0 && copyout(p->pagetable, addr, (char *)&np->xstate, sizeof(np->xstate)) < 0) { release(&np->lock); release(&p->lock); return-1; } // 释放子进程内存 freeproc(np); release(&np->lock); release(&p->lock); // 返回子进程ID return pid; } release(&np->lock); } }
// No point waiting if we don't have any children. // 没有子进程直接返回-1 if(!havekids || p->killed){ release(&p->lock); return-1; } // Wait for a child to exit. // 等待子进程调用exit退出唤醒 sleep(p, &p->lock); //DOC: wait-sleep } }
// Exit the current process. Does not return. // An exited process remains in the zombie state // until its parent calls wait(). void exit(int status) { structproc *p = myproc(); if(p == initproc) panic("init exiting"); // 关闭所有打开的文件 for(int fd = 0; fd < NOFILE; fd++){ if(p->ofile[fd]){ structfile *f = p->ofile[fd]; fileclose(f); p->ofile[fd] = 0; } } // 减少对内存中inode的引用 begin_op(); iput(p->cwd); end_op(); p->cwd = 0; // we might re-parent a child to init. we can't be precise about // waking up init, since we can't acquire its lock once we've // acquired any other proc lock. so wake up init whether that's // necessary or not. init may miss this wakeup, but that seems // harmless. acquire(&initproc->lock); wakeup1(initproc); release(&initproc->lock); // grab a copy of p->parent, to ensure that we unlock the same // parent we locked. in case our parent gives us away to init while // we're waiting for the parent lock. we may then race with an // exiting parent, but the result will be a harmless spurious wakeup // to a dead or wrong process; proc structs are never re-allocated // as anything else. acquire(&p->lock); structproc *original_parent = p->parent; release(&p->lock); // we need the parent's lock in order to wake it up from wait(). // the parent-then-child rule says we have to lock it first. acquire(&original_parent->lock); acquire(&p->lock);
// Give any children to init. reparent(p);
// Parent might be sleeping in wait(). wakeup1(original_parent);
p->xstate = status; p->state = ZOMBIE;
release(&original_parent->lock);
// Jump into the scheduler, never to return. sched(); panic("zombie exit"); }
// Pass p's abandoned children to init. // Caller must hold p->lock. void reparent(struct proc *p) { structproc *pp;
for(pp = proc; pp < &proc[NPROC]; pp++){ // this code uses pp->parent without holding pp->lock. // acquiring the lock first could cause a deadlock // if pp or a child of pp were also in exit() // and about to try to lock p. if(pp->parent == p){ // pp->parent can't change between the check and the acquire() // because only the parent changes it, and we're the parent. acquire(&pp->lock); pp->parent = initproc; // we should wake up init here, but that would require // initproc->lock, which would be a deadlock, since we hold // the lock on one of init's children (pp). this is why // exit() always wakes init (before acquiring any locks). release(&pp->lock); } } }
// Kill the process with the given pid. // The victim won't exit until it tries to return // to user space (see usertrap() in trap.c). int kill(int pid) { structproc *p;
for(p = proc; p < &proc[NPROC]; p++){ acquire(&p->lock); if(p->pid == pid){ p->killed = 1; if(p->state == SLEEPING){ // Wake process from sleep(). p->state = RUNNABLE; } release(&p->lock); return0; } release(&p->lock); } return-1; }
// Disk layout: // [ boot block | super block | log | inode blocks | // free bit map | data blocks] // // mkfs computes the super block and builds an initial file system. The // super block describes the disk layout: structsuperblock { uint magic; // Must be FSMAGIC 魔数 uint size; // Size of file system image (blocks) 文件系统影像大小(块数) uint nblocks; // Number of data blocks data块数 uint ninodes; // Number of inodes inode数 uint nlog; // Number of log blocks log块数 uint logstart; // Block number of first log block log起始块号 uint inodestart; // Block number of first inode block inode起始块号 uint bmapstart; // Block number of first free map block bmap起始块号 };
structbuf { int valid; // has data been read from disk? 是否从磁盘读取了数据 int disk; // does disk "own" buf? 缓冲区的内容已经被修改需要被重新写入磁盘 uint dev; // 设备号 uint blockno;// 块号 structsleeplocklock;// 保证进程同步访问buf uint refcnt; // 引用计数 structbuf *prev;// LRU cache list structbuf *next;// LRU cache list uchar data[BSIZE]; };
struct { structspinlocklock; structbufbuf[NBUF]; // Linked list of all buffers, through prev/next. // Sorted by how recently the buffer was used. // head.next is most recent, head.prev is least. structbufhead;// bcache的缓存更新策略是LRU } bcache;
// Look through buffer cache for block on device dev. // If not found, allocate a buffer. // In either case, return locked buffer. staticstruct buf* bget(uint dev, uint blockno) { structbuf *b;
acquire(&bcache.lock);
// Is the block already cached? // 查找的块之前已被缓存,从前往后 for(b = bcache.head.next; b != &bcache.head; b = b->next){ if(b->dev == dev && b->blockno == blockno){ b->refcnt++; release(&bcache.lock); acquiresleep(&b->lock); return b; } }
// Not cached. // Recycle the least recently used (LRU) unused buffer. // 利用LRU(链表尾部是最少使用的)依次查找直到找到一个未被使用的buf,从后往前 for(b = bcache.head.prev; b != &bcache.head; b = b->prev){ if(b->refcnt == 0) { b->dev = dev; b->blockno = blockno; // b->valid = 0 可以确保 bread 从磁盘读取块数据,而不是错误地使用 buffer 之前的内容 b->valid = 0; b->refcnt = 1; release(&bcache.lock); acquiresleep(&b->lock); return b; } } panic("bget: no buffers"); }
// Release a locked buffer. // Move to the head of the most-recently-used list. void brelse(struct buf *b) { if(!holdingsleep(&b->lock)) panic("brelse"); // buf使用完毕可以释放持有的buf的sleep锁了 releasesleep(&b->lock);
acquire(&bcache.lock); b->refcnt--; // buf没有进程使用,则把buf移至LRU链表表头(最近使用,可能后面很快就用到) if (b->refcnt == 0) { // no one is waiting for it. b->next->prev = b->prev; b->prev->next = b->next; b->next = bcache.head.next; b->prev = &bcache.head; bcache.head.next->prev = b; bcache.head.next = b; } release(&bcache.lock); }
#define MAXOPBLOCKS 10 // max # of blocks any FS op writes #define LOGSIZE (MAXOPBLOCKS*3) // max data blocks in on-disk log磁盘日志块最大数量
// Contents of the header block, used for both the on-disk header block // and to keep track in memory of logged block# before commit. structlogheader { int n; int block[LOGSIZE]; };
structlog { structspinlocklock; int start; int size; int outstanding; // how many FS sys calls are executing.活跃的文件系统调用数,操作的磁盘日志块数默认最大是MAXOPBLOCKS,事务结束时实际写入日志块数就是确定的 int committing; // in commit(), please wait. int dev; structlogheaderlh; }; structloglog;
// Caller has modified b->data and is done with the buffer. // Record the block number and pin in the cache by increasing refcnt. // commit()/write_log() will do the disk write. // // log_write() replaces bwrite(); a typical use is: // bp = bread(...) // modify bp->data[] // log_write(bp) // brelse(bp) // 它将扇区号记录在内存中,在磁盘上的日志中使用一个槽,并自增 buffer.refcnt 防止该 buffer 被重用。 // 在提交之前,块必须留在缓存中,即该缓存的副本是修改的唯一记录;在提交之后才能将其写入磁盘上的位置; // 该次修改必须对其他读可见 void log_write(struct buf *b) { int i;
// 本次事务写入数据过多 if (log.lh.n >= LOGSIZE || log.lh.n >= log.size - 1) panic("too big a transaction"); // 活跃的文件系统调用数>=1 if (log.outstanding < 1) panic("log_write outside of trans");
acquire(&log.lock); for (i = 0; i < log.lh.n; i++) { // 注意,当一个块在一个事务中被多次写入时,他们在日志中的槽是相同的。这种优化通常被称为 absorption(吸收) if (log.lh.block[i] == b->blockno) // log absorbtion break; } log.lh.block[i] = b->blockno; // 日志增加一个新块 if (i == log.lh.n) { // Add new block to log? // 自增 buffer.refcnt 防止该 buffer 被重用 // 在提交之前块必须留在缓存中,即该缓存的副本是修改的唯一记录;在提交之后才能将其写入磁盘上的位置 bpin(b); log.lh.n++; } release(&log.lock); }
// called at the end of each FS system call. // commits if this was the last outstanding operation. void end_op(void) { int do_commit = 0;
acquire(&log.lock); // 文件系统调用结束,因此log.outstanding-- log.outstanding -= 1; // 此时不可能是事务提交状态 if(log.committing) panic("log.committing"); // 最后一个文件系统调用,允许事务提交 if(log.outstanding == 0){ do_commit = 1; log.committing = 1; } else { // begin_op() may be waiting for log space, // and decrementing log.outstanding has decreased // the amount of reserved space. // 此时事务还不能提交,尝试唤醒其他休眠的事务(因为预留空间不够而休眠) wakeup(&log); } release(&log.lock);
if(do_commit){ // call commit w/o holding locks, since not allowed // to sleep with locks. // 提交事务,完成文件系统调用的整体原子操作 commit(); // 重置log.committing=0 acquire(&log.lock); log.committing = 0; wakeup(&log); release(&log.lock); } }
commit四部曲
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
staticvoid commit() { if (log.lh.n > 0) { // write_log()将事务中修改的每个块从 buffer 缓存中复制到磁盘上的日志槽中。 write_log(); // Write modified blocks from cache to log // write_head()将 header块写到磁盘上就表明已提交,为提交点,写完日志后的崩溃,会导致在重启后重新执行日志。 write_head(); // Write header to disk -- the real commit // install_trans从日志中读取每个块,并将其写到文件系统中对应的位置。 install_trans(); // Now install writes to home locations // 最后修改日志块计数为 0,并写入日志空间的 header部分。这必须在下一个事务开始之前修改,这样崩溃就不会导致重启后的恢复使用这次的 header 和下次的日志块 log.lh.n = 0; write_head(); // Erase the transaction from the log } }
write_log
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
// Copy modified blocks from cache to log. staticvoid write_log(void) { int tail;
// Write in-memory log header to disk. // This is the true point at which the // current transaction commits. // 内存中的日志头写入磁盘,真正的事务提交点 staticvoid write_head(void) { structbuf *buf = bread(log.dev, log.start); structlogheader *hb = (struct logheader *) (buf->data); int i; hb->n = log.lh.n; for (i = 0; i < log.lh.n; i++) { hb->block[i] = log.lh.block[i]; } bwrite(buf); brelse(buf); }
install_trans
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
// Copy committed blocks from log to their home location // 日志块中的数据写入磁盘相应正确的位置 staticvoid install_trans(void) { int tail;
// File system implementation. Five layers: // + Blocks: allocator for raw disk blocks. // + Log: crash recovery for multi-step updates. // + Files: inode allocator, reading, writing, metadata. // + Directories: inode with special contents (list of other inodes!) // + Names: paths like /usr/rtm/xv6/fs.c for convenient naming. // // This file contains the low-level file system manipulation // routines. The (higher-level) system call implementations // are in sysfile.c.
// there should be one superblock per disk device, but we run with // only one device structsuperblocksb; // Read the super block. staticvoid readsb(int dev, struct superblock *sb) { structbuf *bp; // 块1是超级块 bp = bread(dev, 1); memmove(sb, bp->data, sizeof(*sb)); brelse(bp); }
// Bitmap bits per block 每块的bmap比特数 #define BPB (BSIZE*8) // Block of free map containing bit for block b 块b所在的bmap块号 #define BBLOCK(b, sb) ((b)/BPB + sb.bmapstart)
// Allocate a zeroed disk block. static uint balloc(uint dev) { int b, bi, m; structbuf *bp;
bp = 0; for(b = 0; b < sb.size; b += BPB){ // 外层循环遍历每一个bmap块 bp = bread(dev, BBLOCK(b, sb)); // Bitmap上的每一个比特位表示一个数据块的状态 for(bi = 0; bi < BPB && b + bi < sb.size; bi++){ m = 1 << (bi % 8); // Bitmap中对应的bit位是0表示块空闲 if((bp->data[bi/8] & m) == 0){ // Is block free? // 块标记为已使用 bp->data[bi/8] |= m; // Mark block in use. log_write(bp); brelse(bp); bzero(dev, b + bi); return b + bi; } } brelse(bp); } panic("balloc: out of blocks"); }
bfree
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// Free a disk block. staticvoid bfree(int dev, uint b) { structbuf *bp; int bi, m;
bp = bread(dev, BBLOCK(b, sb)); bi = b % BPB; m = 1 << (bi % 8); if((bp->data[bi/8] & m) == 0) panic("freeing free block"); bp->data[bi/8] &= ~m; log_write(bp); brelse(bp); }
bzero
1 2 3 4 5 6 7 8 9 10 11
// Zero a block. staticvoid bzero(int dev, int bno) { structbuf *bp;
bp = bread(dev, bno); memset(bp->data, 0, BSIZE); log_write(bp); brelse(bp); }
// On-disk inode structure structdinode { short type; // File type 文件类型:空闲/目录/文件/设备 short major; // Major device number (T_DEVICE only) short minor; // Minor device number (T_DEVICE only) // 引用这个dinode的目录项数量,当nlink为0且inode的引用数也为0时就释放磁盘上的dinode及其数据块 short nlink; // Number of links to inode in file system uint size; // Size of file (bytes) 文件大小 // addrs 数组记录了持有文件内容的磁盘块的块号 uint addrs[NDIRECT+1]; // Data block addresses };
// in-memory copy of an inode structinode { uint dev; // Device number 设备号 uint inum; // Inode number Inode编号 // 指向 inode 的指针的数量,如果引用数量减少到零,icacahe就会考虑把它重新分配 int ref; // Reference count // 保证了可以独占访问 inode 的其他字段(如文件长度)以及 inode的文件或目录内容块 structsleeplocklock;// protects everything below here int valid; // inode has been read from disk? short type; // copy of disk inode short major; short minor; short nlink; uint size; uint addrs[NDIRECT+1]; };
// Allocate an inode on device dev. // Mark it as allocated by giving it type type. // Returns an unlocked but allocated and referenced inode. // 分配设备(磁盘)上的一个dinode,类型标记为type struct inode* ialloc(uint dev, short type) { int inum; structbuf *bp; structdinode *dip;
for(inum = 1; inum < sb.ninodes; inum++){ // 由于一次只能有一个进程持有对 bp 的引用,所以 ialloc 可以正确执行。 // ialloc 可以确保其他进程不会同时看到 inode 是可用的并使用它。 // 第一次需要磁盘读,后面都是直接从块缓存中拿 bp = bread(dev, IBLOCK(inum, sb)); // 具体的dinode dip = (struct dinode*)bp->data + inum%IPB; // 空闲的dinode分配使用 if(dip->type == 0){ // a free inode memset(dip, 0, sizeof(*dip)); dip->type = type; log_write(bp); // mark it allocated on the disk brelse(bp); return iget(dev, inum); } brelse(bp); } panic("ialloc: no inodes"); }
// Drop a reference to an in-memory inode. // If that was the last reference, the inode cache entry can // be recycled. // If that was the last reference and the inode has no links // to it, free the inode (and its content) on disk. // All calls to iput() must be inside a transaction in // case it has to free the inode. // 撤销一个inode的ref引用 void iput(struct inode *ip) { acquire(&icache.lock);
// inode has no links and no other references: truncate and free. if(ip->ref == 1 && ip->valid && ip->nlink == 0){ // ip->ref == 1 means no other process can have ip locked, // so this acquiresleep() won't block (or deadlock). acquiresleep(&ip->lock); release(&icache.lock); // 调用 itrunc 将文件截断为零字节,释放数据块 itrunc(ip); // 将 inode 类型设置为 0(未分配),并将 inode 写入磁盘 ip->type = 0; iupdate(ip); ip->valid = 0;
// Copy a modified in-memory inode to disk. // Must be called after every change to an ip->xxx field // that lives on disk, since i-node cache is write-through. // Caller must hold ip->lock. // Inode缓存是直写的,修改了inode后需要写回磁盘上的dinode void iupdate(struct inode *ip) { structbuf *bp; structdinode *dip;
// Inode content // // The content (data) associated with each inode is stored // in blocks on the disk. The first NDIRECT block numbers // are listed in ip->addrs[]. The next NINDIRECT blocks are // listed in block ip->addrs[NDIRECT].
// Return the disk block address of the nth block in inode ip. // If there is no such block, bmap allocates one. static uint bmap(struct inode *ip, uint bn) { uint addr, *a; structbuf *bp;
// Read data from inode. // Caller must hold ip->lock. // If user_dst==1, then dst is a user virtual address; // otherwise, dst is a kernel address. int readi(struct inode *ip, int user_dst, uint64 dst, uint off, uint n) { uint tot, m; structbuf *bp;
// 文件偏移量不能超过文件大小或者为负 if(off > ip->size || off + n < off) return0; // 不能读取超出文件大小的数据 if(off + n > ip->size) n = ip->size - off;
// Write data to inode. // Caller must hold ip->lock. // If user_src==1, then src is a user virtual address; // otherwise, src is a kernel address. int writei(struct inode *ip, int user_src, uint64 src, uint off, uint n) { uint tot, m; structbuf *bp;
// 文件偏移量不能超过文件大小或者为负 if(off > ip->size || off + n < off) return-1; // 写入后文件大小不能超过允许的最大文件大小 if(off + n > MAXFILE*BSIZE) return-1;
if(n > 0){ if(off > ip->size) ip->size = off; // write the i-node back to disk even if the size didn't change // because the loop above might have called bmap() and added a new // block to ip->addrs[]. // 更新后的inode信息写回磁盘 iupdate(ip); } return n; }
structstat { int dev; // File system's disk device uint ino; // Inode number short type; // Type of file short nlink; // Number of links to file uint64 size; // Size of file in bytes };
stati
1 2 3 4 5 6 7 8 9 10 11
// Copy stat information from inode. // Caller must hold ip->lock. void stati(struct inode *ip, struct stat *st) { st->dev = ip->dev; st->ino = ip->inum; st->type = ip->type; st->nlink = ip->nlink; st->size = ip->size; }
// Look for a directory entry in a directory. // If found, set *poff to byte offset of entry. struct inode* dirlookup(struct inode *dp, char *name, uint *poff) { uint off, inum; structdirentde;
// inode必须是目录类型 if(dp->type != T_DIR) panic("dirlookup not DIR");
函数dirlink会在当前目录dp中创建一个新的目录项,通过给定的名称和inode号。如果名称已经存在,dirlink 将返回一个错误。主循环读取目录项,寻找一个未使用的条目。当它找到一个时,它会提前跳出循环,并将 off 设置为该可用条目的偏移量。否则,循环结束时,将 off 设置为 dp->size。不管是哪种方式,dirlink 都会在偏移量 off 的位置添加一个新的条目到目录中。
// Write a new directory entry (name, inum) into the directory dp. // 给目录增加一项dirent(name,inum) int dirlink(struct inode *dp, char *name, uint inum) { int off; structdirentde; structinode *ip;
// Check that name is not present. // 新增的一项dirent的name必须是未存在的 if((ip = dirlookup(dp, name, 0)) != 0){ iput(ip); return-1; }
// Look for an empty dirent. // 寻找一个空闲的dirent for(off = 0; off < dp->size; off += sizeof(de)){ if(readi(dp, 0, (uint64)&de, off, sizeof(de)) != sizeof(de)) panic("dirlink read"); if(de.inum == 0) break; }
namex首先确定路径解析从哪里开始。如果路径以斜线开头,则从根目录开始解析;否则,从当前目录开始解析。然后它使用 skipelem 来遍历路径中的每个元素。循环的每次迭代都必须在当前 inode ip 中查找name。迭代的开始是锁定 ip 并检查它是否是一个目录。如果不是,查找就会失败。(锁定ip是必要的,不是因为ip->type可能会改变,而是因为在ilock运行前,不能保证ip->type已经从磁盘载入)。如果调用的是nameiparent,而且这是最后一个路径元素,按照之前nameiparent的定义,循环应该提前停止,最后一个路径元素已经被复制到name 中,所以namex只需要返回解锁的ip。最后,循环使用dirlookup查找路径元素,并通过设置ip = next 为下一次迭代做准备。当循环遍历完路径元素时,它返回ip。
// Look up and return the inode for a path name. // If parent != 0, return the inode for the parent and copy the final // path element into name, which must have room for DIRSIZ bytes. // Must be called inside a transaction since it calls iput(). staticstruct inode* namex(char *path, int nameiparent, char *name) { structinode *ip, *next;
// 路径以斜杠开头,则从根目录开始 if(*path == '/') ip = iget(ROOTDEV, ROOTINO); // 否则,从当前目录开始 else ip = idup(myproc()->cwd);
// Copy the next path element from path into name. // Return a pointer to the element following the copied one. // The returned path has no leading slashes, // so the caller can check *path=='\0' to see if the name is the last one. // If no name to remove, return 0. // // Examples: // skipelem("a/bb/c", name) = "bb/c", setting name = "a" // skipelem("///a//bb", name) = "bb", setting name = "a" // skipelem("a", name) = "", setting name = "a" // skipelem("", name) = skipelem("////", name) = 0 // staticchar* skipelem(char *path, char *name) { char *s; int len;
// Write to file f. // addr is a user virtual address. int filewrite(struct file *f, uint64 addr, int n) { int r, ret = 0;
// 检查文件读写权限 if(f->writable == 0) return-1;
// 根据文件类型选择不同的写入函数 if(f->type == FD_PIPE){ ret = pipewrite(f->pipe, addr, n); } elseif(f->type == FD_DEVICE){ if(f->major < 0 || f->major >= NDEV || !devsw[f->major].write) return-1; ret = devsw[f->major].write(1, addr, n); } elseif(f->type == FD_INODE){ // write a few blocks at a time to avoid exceeding // the maximum log transaction size, including // i-node, indirect block, allocation blocks, // and 2 blocks of slop for non-aligned writes. // this really belongs lower down, since writei() // might be writing a device like the console. int max = ((MAXOPBLOCKS-1-1-2) / 2) * BSIZE; int i = 0; while(i < n){ int n1 = n - i; if(n1 > max) n1 = max;
begin_op(); ilock(f->ip); if ((r = writei(f->ip, 1, addr + i, f->off, n1)) > 0) f->off += r; iunlock(f->ip); end_op();
if(r < 0) break; if(r != n1) panic("short filewrite"); i += r; } ret = (i == n ? n : -1); } else { panic("filewrite"); }
// Look in the process table for an UNUSED proc. // If found, initialize state required to run in the kernel, // and return with p->lock held. // If there are no free procs, or a memory allocation fails, return 0. staticstruct proc* allocproc(void) { structproc *p;
// An empty user page table. // 得到一个空闲用户页表 p->pagetable = proc_pagetable(p); if(p->pagetable == 0){ freeproc(p); release(&p->lock); return0; }
// Set up new context to start executing at forkret, // which returns to user space. // 建立新的上下文信息CTX,当内核调度线程选择该进程执行用户线程时从forkret开始执行 memset(&p->context, 0, sizeof(p->context)); p->context.ra = (uint64)forkret; p->context.sp = p->kstack + PGSIZE;
return p; }
forkret
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// A fork child's very first scheduling by scheduler() // will swtch to forkret. void forkret(void) { staticint first = 1; // Still holding p->lock from scheduler. release(&myproc()->lock); if (first) { // File system initialization must be run in the context of a // regular process (e.g., because it calls sleep), and thus cannot // be run from main(). first = 0; fsinit(ROOTDEV); } usertrapret(); }
xv6 使用了与早期 UNIX 相同的 inodes 和目录的基本磁盘布局;这个方案多年来任还在使用。BSD 的 UFS/FFS 和 Linux 的 ext2/ext3 使用基本相同的数据结构。文件系统布局中最低效的部分是目录,在每次查找过程中需要对所有磁盘块进行线性扫描。当目录只有几个磁盘块时,这是合理的,但对于持有许多文件的目录来说是昂贵的。微软 Windows 的 NTFS,Mac OS X 的 HFS,以及 Solaris 的 ZFS,将一个目录在磁盘上实现了平衡树,用于查找块。这很复杂,但可以保证目录查找的时间复杂度为 O(logn)。
$ git clone git://g.csail.mit.edu/xv6-labs-2020 $ cd xv6-labs-2020 $ git checkout util $ make qemu
1.2sleep (easy)
Implement the UNIX program sleep for xv6; your sleep should pause for a user-specified number of ticks. A tick is a notion of time defined by the xv6 kernel, namely the time between two interrupts from the timer chip. Your solution should be in the file user/sleep.c.
Write a program that uses UNIX system calls to “ping-pong” a byte between two processes over a pair of pipes, one for each direction. The parent should send a byte to the child; the child should print “: received ping”, where is its process ID, write the byte on the pipe to the parent, and exit; the parent should read the byte from the child, print “: received pong”, and exit. Your solution should be in the file user/pingpong.c.
Write a concurrent version of prime sieve using pipes. This idea is due to Doug McIlroy, inventor of Unix pipes. The picture halfway down this page and the surrounding text explain how to do it. Your solution should be in the file user/primes.c.
$ primes prime 2 prime 3 prime 5 prime 7 prime 11 prime 13 prime 17 prime 19 prime 23 prime 29 prime 31 $
1.5find (moderate)
Write a simple version of the UNIX find program: find all the files in a directory tree with a specific name. Your solution should be in the file user/find.c.
Write a simple version of the UNIX xargs program: read lines from the standard input and run a command for each line, supplying the line as arguments to the command. Your solution should be in the file user/xargs.c.
In this lab you will add some new system calls to xv6, which will help you understand how they work and will expose you to some of the internals of the xv6 kernel. You will add more system calls in later labs.
对 xv6 添加一些新的系统调用,帮助加深对 xv6 内核的理解。
2.1Tracing (moderate)
准备环境,编译编译器、QEMU,克隆仓库,略过。
In this assignment you will add a system call tracing feature that may help you when debugging later labs. You’ll create a new trace system call that will control tracing. It should take one argument, an integer “mask”, whose bits specify which system calls to trace. For example, to trace the fork system call, a program calls trace(1 « SYS_fork), where SYS_fork is a syscall number from kernel/syscall.h. You have to modify the xv6 kernel to print out a line when each system call is about to return, if the system call’s number is set in the mask. The line should contain the process id, the name of the system call and the return value; you don’t need to print the system call arguments. The trace system call should enable tracing for the process that calls it and any children that it subsequently forks, but should not affect other processes.
// kernel/proc.h // Per-process state structproc { structspinlocklock;
// p->lock must be held when using these: enumprocstatestate;// Process state structproc *parent;// Parent process void *chan; // If non-zero, sleeping on chan int killed; // If non-zero, have been killed int xstate; // Exit status to be returned to parent's wait int pid; // Process ID
// these are private to the process, so p->lock need not be held. uint64 kstack; // Virtual address of kernel stack uint64 sz; // Size of process memory (bytes) pagetable_t pagetable; // User page table structtrapframe *trapframe;// data page for trampoline.S structcontextcontext;// swtch() here to run process structfile *ofile[NOFILE];// Open files structinode *cwd;// Current directory char name[16]; // Process name (debugging) uint64 syscall_trace; // Mask for syscall tracing (新添加的用于标识追踪哪些 system call 的 mask) };
In this assignment you will add a system call, sysinfo, that collects information about the running system. The system call takes one argument: a pointer to a struct sysinfo (see kernel/sysinfo.h). The kernel should fill out the fields of this struct: the freemem field should be set to the number of bytes of free memory, and the nproc field should be set to the number of processes whose state is not UNUSED. We provide a test program sysinfotest; you pass this assignment if it prints “sysinfotest: OK”.
// kernel/kalloc.c // Allocate one 4096-byte page of physical memory. // Returns a pointer that the kernel can use. // Returns 0 if the memory cannot be allocated. void * kalloc(void) { structrun *r;
In this lab you will explore page tables and modify them to simplify the functions that copy data from user space to kernel space.
探索页表,修改页表以简化从用户态拷贝数据到内核态的方法。
3.1Print a page table (easy)
Define a function called vmprint(). It should take a pagetable_t argument, and print that pagetable in the format described below. Insert if(p->pid==1) vmprint(p->pagetable) in exec.c just before the return argc, to print the first process’s page table. You receive full credit for this assignment if you pass the pte printout test of make grade.
添加一个打印页表的内核函数,以如下格式打印出传进的页表,用于后面两个实验调试用:
1 2 3 4 5 6 7 8 9 10
page table 0x0000000087f6e000 ..0: pte 0x0000000021fda801 pa 0x0000000087f6a000 .. ..0: pte 0x0000000021fda401 pa 0x0000000087f69000 .. .. ..0: pte 0x0000000021fdac1f pa 0x0000000087f6b000 .. .. ..1: pte 0x0000000021fda00f pa 0x0000000087f68000 .. .. ..2: pte 0x0000000021fd9c1f pa 0x0000000087f67000 ..255: pte 0x0000000021fdb401 pa 0x0000000087f6d000 .. ..511: pte 0x0000000021fdb001 pa 0x0000000087f6c000 .. .. ..510: pte 0x0000000021fdd807 pa 0x0000000087f76000 .. .. ..511: pte 0x0000000020001c0b pa 0x0000000080007000
RISC-V 的逻辑地址寻址是采用三级页表的形式,9 bit 一级索引找到二级页表,9 bit 二级索引找到三级页表,9 bit 三级索引找到内存页,最低 12 bit 为页内偏移(即一个页 4096 bytes)。
Your first job is to modify the kernel so that every process uses its own copy of the kernel page table when executing in the kernel. Modify struct proc to maintain a kernel page table for each process, and modify the scheduler to switch kernel page tables when switching processes. For this step, each per-process kernel page table should be identical to the existing global kernel page table. You pass this part of the lab if usertests runs correctly.
// kernel/proc.h // Per-process state structproc { structspinlocklock;
// p->lock must be held when using these: enumprocstatestate;// Process state structproc *parent;// Parent process void *chan; // If non-zero, sleeping on chan int killed; // If non-zero, have been killed int xstate; // Exit status to be returned to parent's wait int pid; // Process ID
// these are private to the process, so p->lock need not be held. uint64 kstack; // Virtual address of kernel stack uint64 sz; // Size of process memory (bytes) pagetable_t pagetable; // User page table structtrapframe *trapframe;// data page for trampoline.S structcontextcontext;// swtch() here to run process structfile *ofile[NOFILE];// Open files structinode *cwd;// Current directory char name[16]; // Process name (debugging) pagetable_t kernelpgtbl; // Kernel page table (在 proc 中添加该 field) };
// 分配一个物理页,作为新进程的内核栈使用 char *pa = kalloc(); if(pa == 0) panic("kalloc"); uint64 va = KSTACK((int)0); // 将内核栈映射到固定的逻辑地址上 // printf("map krnlstack va: %p to pa: %p\n", va, pa); kvmmap(p->kernelpgtbl, va, (uint64)pa, PGSIZE, PTE_R | PTE_W); p->kstack = va; // 记录内核栈的逻辑地址,其实已经是固定的了,依然这样记录是为了避免需要修改其他部分 xv6 代码
////// 新加部分 end //////
// Set up new context to start executing at forkret, // which returns to user space. memset(&p->context, 0, sizeof(p->context)); p->context.ra = (uint64)forkret; p->context.sp = p->kstack + PGSIZE;
// kernel/proc.c void scheduler(void) { structproc *p; structcpu *c = mycpu(); c->proc = 0; for(;;){ // Avoid deadlock by ensuring that devices can interrupt. intr_on(); int found = 0; for(p = proc; p < &proc[NPROC]; p++) { acquire(&p->lock); if(p->state == RUNNABLE) { // Switch to chosen process. It is the process's job // to release its lock and then reacquire it // before jumping back to us. p->state = RUNNING; c->proc = p;
Replace the body of copyin in kernel/vm.c with a call to copyin_new (defined in kernel/vmcopyin.c); do the same for copyinstr and copyinstr_new. Add mappings for user addresses to each process’s kernel page table so that copyin_new and copyinstr_new work. You pass this assignment if usertests runs correctly and all the make grade tests pass.
在上一个实验中,已经使得每一个进程都拥有独立的内核态页表了,这个实验的目标是,在进程的内核态页表中维护一个用户态页表映射的副本,这样使得内核态也可以对用户态传进来的指针(逻辑地址)进行解引用。这样做相比原来 copyin 的实现的优势是,==原来的 copyin 是通过软件模拟访问页表的过程获取物理地址的,而在内核页表内维护映射副本的话,可以利用 CPU 的硬件寻址功能进行寻址,效率更高并且可以受快表加速。==
// PGROUNDUP: prevent re-mapping already mapped pages (eg. when doing growproc) for(i = PGROUNDUP(start); i < start + sz; i += PGSIZE){ if((pte = walk(src, i, 0)) == 0) panic("kvmcopymappings: pte should exist"); if((*pte & PTE_V) == 0) panic("kvmcopymappings: page not present"); pa = PTE2PA(*pte); // `& ~PTE_U` 表示将该页的权限设置为非用户页 // 必须设置该权限,RISC-V 中内核是无法直接访问用户页的。 flags = PTE_FLAGS(*pte) & ~PTE_U; if(mappages(dst, i, PGSIZE, pa, flags) != 0){ goto err; } }
return0;
err: // thanks @hdrkna for pointing out a mistake here. // original code incorrectly starts unmapping from 0 instead of PGROUNDUP(start) uvmunmap(dst, PGROUNDUP(start), (i - PGROUNDUP(start)) / PGSIZE, 0); return-1; }
void kvminit() { kernel_pagetable = kvminit_newpgtbl(); // CLINT *is* however required during kernel boot up and // we should map it for the global kernel pagetable kvmmap(kernel_pagetable, CLINT, CLINT, 0x10000, PTE_R | PTE_W); }
// allocate one user page and copy init's instructions // and data into it. uvminit(p->pagetable, initcode, sizeof(initcode)); p->sz = PGSIZE; kvmcopymappings(p->pagetable, p->kernelpgtbl, 0, p->sz); // 同步程序内存映射到进程内核页表中 // ...... }
This lab explores how system calls are implemented using traps. You will first do a warm-up exercises with stacks and then you will implement an example of user-level trap handling.
It will be important to understand a bit of RISC-V assembly, which you were exposed to in 6.004. There is a file user/call.c in your xv6 repo. make fs.img compiles it and also produces a readable assembly version of the program in user/call.asm.
Read the code in call.asm for the functions g, f, and main. The instruction manual for RISC-V is on the reference page. Here are some questions that you should answer (store the answers in a file answers-traps.txt)
Q: Which registers contain arguments to functions? For example, which register holds 13 in main's call to printf? A: a0-a7; a2;
Q: Where is the call to function f in the assembly code for main? Where is the call to g? (Hint: the compiler may inline functions.) A: There is none. g(x) is inlined within f(x) and f(x) is further inlined into main()
Q: At what address is the function printf located? A: 0x0000000000000628, main calls it with pc-relative addressing.
Q: What value is in the register ra just after the jalr to printf in main? A: 0x0000000000000038, next line of assembly right after the jalr
Q: Run the following code.
unsigned int i = 0x00646c72; printf("H%x Wo%s", 57616, &i);
What is the output? If the RISC-V were instead big-endian what would you set i to in order to yield the same output? Would you need to change 57616 to a different value? A: "He110 World"; 0x726c6400; no, 57616 is 110 in hex regardless of endianness.
Q: In the following code, what is going to be printed after 'y='? (note: the answer is not a specific value.) Why does this happen?
printf("x=%d y=%d", 3);
A: A random value depending on what codes there are right before the call.Because printf tried to read more arguments than supplied. The second argument `3` is passed in a1, and the register for the third argument, a2, is not set to any specific value before the call, and contains whatever there is before the call.
For debugging it is often useful to have a backtrace: a list of the function calls on the stack above the point at which the error occurred.
Implement a backtrace() function in kernel/printf.c. Insert a call to this function in sys_sleep, and then run bttest, which calls sys_sleep. Your output should be as follows:
After bttest exit qemu. In your terminal: the addresses may be slightly different but if you run addr2line -e kernel/kernel (or riscv64-unknown-elf-addr2line -e kernel/kernel) and cut-and-paste the above addresses as follows:
In this exercise you’ll add a feature to xv6 that periodically alerts a process as it uses CPU time. This might be useful for compute-bound processes that want to limit how much CPU time they chew up, or for processes that want to compute but also want to take some periodic action. More generally, you’ll be implementing a primitive form of user-level interrupt/fault handlers; you could use something similar to handle page faults in the application, for example. Your solution is correct if it passes alarmtest and usertests.
structproc { // ...... int alarm_interval; // Alarm interval (0 for disabled) void(*alarm_handler)(); // Alarm handler int alarm_ticks; // How many ticks left before next alarm goes off structtrapframe *alarm_trapframe;// A copy of trapframe right before running alarm_handler int alarm_goingoff; // Is an alarm currently going off and hasn't not yet returned? (prevent re-entrance of alarm_handler) };
One of the many neat tricks an O/S can play with page table hardware is lazy allocation of user-space heap memory. Xv6 applications ask the kernel for heap memory using the sbrk() system call. In the kernel we’ve given you, sbrk() allocates physical memory and maps it into the process’s virtual address space. It can take a long time for a kernel to allocate and map memory for a large request. Consider, for example, that a gigabyte consists of 262,144 4096-byte pages; that’s a huge number of allocations even if each is individually cheap. In addition, some programs allocate more memory than they actually use (e.g., to implement sparse arrays), or allocate memory well in advance of use. To allow sbrk() to complete more quickly in these cases, sophisticated kernels allocate user memory lazily. That is, sbrk() doesn’t allocate physical memory, but just remembers which user addresses are allocated and marks those addresses as invalid in the user page table. When the process first tries to use any given page of lazily-allocated memory, the CPU generates a page fault, which the kernel handles by allocating physical memory, zeroing it, and mapping it. You’ll add this lazy allocation feature to xv6 in this lab.
// touch a lazy-allocated page so it's mapped to an actual physical page. voiduvmlazytouch(uint64 va) { structproc *p = myproc(); char *mem = kalloc(); if(mem == 0) { // failed to allocate physical memory printf("lazy alloc: out of memory\n"); p->killed = 1; } else { memset(mem, 0, PGSIZE); if(mappages(p->pagetable, PGROUNDDOWN(va), PGSIZE, (uint64)mem, PTE_W|PTE_X|PTE_R|PTE_U) != 0){ printf("lazy alloc: failed to map page\n"); kfree(mem); p->killed = 1; } } // printf("lazy alloc: %p, p->sz: %p\n", PGROUNDDOWN(va), p->sz); }
// whether a page is previously lazy-allocated and needed to be touched before use. intuvmshouldtouch(uint64 va) { pte_t *pte; structproc *p = myproc(); return va < p->sz // within size of memory for the process && PGROUNDDOWN(va) != r_sp() // not accessing stack guard page (it shouldn't be mapped) && (((pte = walk(p->pagetable, va, 0))==0) || ((*pte & PTE_V)==0)); // page table entry does not exist }
COW fork() creates just a pagetable for the child, with PTEs for user memory pointing to the parent’s physical pages. COW fork() marks all the user PTEs in both parent and child as not writable. When either process tries to write one of these COW pages, the CPU will force a page fault. The kernel page-fault handler detects this case, allocates a page of physical memory for the faulting process, copies the original page into the new page, and modifies the relevant PTE in the faulting process to refer to the new page, this time with the PTE marked writeable. When the page fault handler returns, the user process will be able to write its copy of the page.
COW fork() makes freeing of the physical pages that implement user memory a little trickier. A given physical page may be referred to by multiple processes’ page tables, and should be freed only when the last reference disappears.
structspinlockpgreflock;// 用于 pageref 数组的锁,防止竞态条件引起内存泄漏 int pageref[PGREF_MAX_ENTRIES]; // 从 KERNBASE 开始到 PHYSTOP 之间的每个物理页的引用计数 // note: reference counts are incremented on fork, not on mapping. this means that multiple mappings of the same physical page within a single process are only counted as one reference. this shouldn't be a problem, though. as there's no way for a user program to map a physical page twice within it's address space in xv6. // 通过物理地址获得引用计数 #define PA2PGREF(p) pageref[PA2PGREF_ID((uint64)(p))]
acquire(&kmem.lock); r = kmem.freelist; if(r) kmem.freelist = r->next; release(&kmem.lock);
if(r){ memset((char*)r, 5, PGSIZE); // fill with junk // 新分配的物理页的引用计数为 1 // (这里无需加锁) PA2PGREF(r) = 1; } return (void*)r; }
// Decrease reference to the page by one if it's more than one, then // allocate a new physical page and copy the page into it. // (Effectively turing one reference into one copy.) // // Do nothing and simply return pa when reference count is already // less than or equal to 1. // // 当引用已经小于等于 1 时,不创建和复制到新的物理页,而是直接返回该页本身 void *kcopy_n_deref(void *pa) { acquire(&pgreflock); // 这一步很关键,当子进程先执行exec释放原来的物理页时此时父进程写pa地址处内容会触发COW,由于没有子进程引用pa,父进程不需要重新分配一块物理内容可以直接复用原来的物理页,并修改PTE权限 if(PA2PGREF(pa) <= 1) { // 只有 1 个引用,无需复制 release(&pgreflock); return pa; }
// 分配新的内存页,并复制旧页中的数据到新页 uint64 newpa = (uint64)kalloc(); if(newpa == 0) { release(&pgreflock); return0; // out of memory } memmove((void*)newpa, (void*)pa, PGSIZE);
This lab will familiarize you with multithreading. You will implement switching between threads in a user-level threads package, use multiple threads to speed up a program, and implement a barrier.
实现一个用户态的线程库;尝试使用线程来为程序提速;并且尝试实现一个同步屏障。
7.1Uthread: switching between threads (moderate)
补全 uthread.c,完成用户态线程功能的实现。
这里的线程相比现代操作系统中的线程而言,更接近一些语言中的“协程”(coroutine)。原因是这里的“线程”是完全用户态实现的,多个线程也只能运行在一个 CPU 上,并且没有时钟中断来强制执行调度,需要线程函数本身在合适的时候主动 yield 释放 CPU。这样实现起来的线程并不对线程函数透明,所以比起操作系统的线程而言更接近 coroutine。
添加的部分为设置上下文中 ra 指向的地址为线程函数的地址,这样在第一次调度到该线程,执行到 thread_switch 中的 ret 之后就可以跳转到线程函数从而开始执行了。设置 sp 使得线程拥有自己独有的栈,也就是独立的执行流。
7.2Using threads (moderate)
分析并解决一个哈希表操作的例子内,由于 race-condition 导致的数据丢失的问题。
Why are there missing keys with 2 threads, but not with 1 thread? Identify a sequence of events with 2 threads that can lead to a key being missing. Submit your sequence with a short explanation in answers-thread.txt
该 lab 的实验目标,即是为每个 CPU 分配独立的 freelist,这样多个 CPU 并发分配物理页就不再会互相排斥了,提高了并行性。
但由于在一个 CPU freelist 中空闲页不足的情况下,仍需要从其他 CPU 的 freelist 中“偷”内存页,所以一个 CPU 的 freelist 并不是只会被其对应 CPU 访问,还可能在“偷”内存页的时候被其他 CPU 访问,故仍然需要使用单独的锁来保护每个 CPU 的 freelist。但一个 CPU freelist 中空闲页不足的情况相对来说是比较稀有的,所以总体性能依然比单独 kmem 大锁要快。在最佳情况下,也就是没有发生跨 CPU “偷”页的情况下,这些小锁不会发生任何锁竞争。
// note: NDIRECT=12 // On-disk inode structure structdinode { short type; // File type short major; // Major device number (T_DEVICE only) short minor; // Minor device number (T_DEVICE only) short nlink; // Number of links to inode in file system uint size; // Size of file (bytes) uint addrs[NDIRECT+1]; // Data block addresses };
本 lab 的目标是通过为混合索引机制添加二级索引页,来扩大能够支持的最大文件大小。
本 lab 比较简单,主要前置是需要对文件系统的理解,确保充分理解 xv6 book 中的 file system 相关部分。
// On-disk inode structure structdinode { short type; // File type short major; // Major device number (T_DEVICE only) short minor; // Minor device number (T_DEVICE only) short nlink; // Number of links to inode in file system uint size; // Size of file (bytes) uint addrs[NDIRECT+2]; // Data block addresses (NDIRECT+1 -> NDIRECT+2) };
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// kernel/file.h // in-memory copy of an inode structinode { uint dev; // Device number uint inum; // Inode number int ref; // Reference count structsleeplocklock;// protects everything below here int valid; // inode has been read from disk?
short type; // copy of disk inode short major; short minor; short nlink; uint size; uint addrs[NDIRECT+2]; // NDIRECT+1 -> NDIRECT+2 };
// Return the disk block address of the nth block in inode ip. // If there is no such block, bmap allocates one. static uint bmap(struct inode *ip, uint bn) { uint addr, *a; structbuf *bp;
void writeback(struct vma* v, uint64 addr, int n) { if(!(v->permission & PTE_W) || (v->flags & MAP_PRIVATE)) // no need to writeback return;
if((addr % PGSIZE) != 0) panic("unmap: not aligned");
printf("starting writeback: %p %d\n", addr, n);
structfile* f = v->file;
int max = ((MAXOPBLOCKS-1-1-2) / 2) * BSIZE; int i = 0; while(i < n){ int n1 = n - i; if(n1 > max) n1 = max;
begin_op(); ilock(f->ip); printf("%p %d %d\n",addr + i, v->off + v->start - addr, n1); int r = writei(f->ip, 1, addr + i, v->off + addr + i - v->start, n1); iunlock(f->ip); end_op(); i += r; max -= r } }