vmalloc_sync_all Sunday, 29 January 2012  
Having "understood" my nested page fault issues, I have been trying to finalise the code changes. However, any attempt to do so leads me to a lot of pain.

$ dtrace -n fbt::page_fault:

is a dangerous thing to do - we intercept the page fault handler. But the page fault handler can be called if a D script tries to access an unmapped page. We could deter users from putting a probe on page_fault, but that seems a real shame - thats a very useful and interesting function to probe.

This works brilliantly on x86/64 systems but fails abysmally on i386. Having chased the problem down to the issue of kernel page tables and user process page tables disagreeing about what is "visible", and the way the kernel does "lazy page table population", its very difficult to stop a page fault, for instance, in the breakpoint handler.

(We hit the page_fault function, which generates a breakpoint trap to execute the probe, but whilst processing the breakpoint trap, we induce a page_fault trap: BOOM!)

I've experimented with various mechanisms to avoid these lazy page faults. Theres a function in the kernel: vmalloc_sync_all() which ensures all page tables are in sync with the kernel - so that minor page faults cannot happen. If I ensure this is called during the driver load, then the problem of a nested page fault appears to go away.

(This is a better job than the code I wrote which does something similar but only for specific locations in dtrace itself; vmalloc_sync_all is a generalised function to sync all page tables of all processes).

So, I will need to recode and remove the cruft from my work-in-progress dtrace.

(I am trying to track down if vmalloc_sync_all is called by the x86/64 kernel - but not the i386 one; it would certainly explain why I see such a difference in behavior when tracing the page_fault code).

More later this week if I can successfully resolve this issue, once and for all.

Posted at 22:15:36 by fox | Permalink
  Impossible progress Wednesday, 25 January 2012  
Nigel was asking me today why I was bothering to spend so much time on a bug which is uninteresting. And if this issue happens on i386, why dont we see it on x64.

Lets catch up: dtrace, when reloaded on an i386 system, can panic or hang the system. This doesnt happen on x64.

As much as I like to dismiss i386 as yesterdays technology, it demonstrates that *something* is wrong. Ignoring this warning sign is perilous. Before going into the subject in detail: why after a driver load? Why not half way through debugging your production system?

The underlying scenario may be rare but without the deep understanding, such a tool can never promote itself in the reliability stakes.

Ok, so lets deep dive. Every process has a page table - which describes what the process can see. In Linux, process #0 (the 'swapper') also has a page table, but its a "master page table". It describes what the kernel can see.

A process is most of the time dealing with its own address space, but on a system call or interrupt, we are dealing with the kernel. The CPU contains the circuitry to allow the kernel space to be visible when the interrupts or system calls happen.

But, how does the kernel map and the per-process map keep in sync? When you load a device driver (or even plug in a USB drive, for instance), the kernel will allocate space for the code and data. This belongs to the swapper/kernel. If whilst your process is executing, the USB drive generates an interrupt, leading to the USB driver executing, it will do so in the context of your process page table. You cannot see this (normally). But those pages are *not* in your page table.

So, as the CPU tries to jump or access this memory, a page fault will be generated. The page fault handler *IS* in your page table (as is the whole monolithic kernel). The page fault handler will realise the page fault happened in kernel space, and will notice that the swapper page table and your process page table do not agree. It will copy the offending page table entry from kernel(swapper) to your process. And the system will continue - as if "by magic". (Function vmalloc_fault() is the one that does this magic).

Linux/Dtrace is special

Linux/dtrace is very special compared to all other implementations of dtrace and all other drivers on Linux: it is not only dynamically loaded, but it contains a page fault handler. (Why? Because when you do silly things in your D scripts, dtrace wants to prevent you from panicing the kernel; it has to intercept invalid page faults caused by D scripts; it doesnt care about normal page faults, and leaves the kernel to do its stuff).

If the page fault handler is not in the user page table (why should it? after a module load, it wont be), then we are in dangerous territory. You cannot simply ignore a "page fault" - you *must* process it. So, heres the scenario: when dtrace is loaded, it only exists in kernel page tables - not in any processes page table. Under normal use of dtrace, invoking probes or syscalls, the act of these probes firing would cause a page fault to ensure the dtrace code is mapped into the process table of the process.

What is happening...

After dtrace is loaded, we have two scenarios to consider: system processes (especially kernel threads, irqbalance, etc) and user procs. The system processes run in kernel space and have the page fault handler mapped. (In theory these system procs shouldnt have page faults, but they might do). The user procs have no knowledge of dtrace, and as they page fault, the CPU will try to invoke the page fault handler which is not mapped into the user proc page table. This causes another fault and we eventually have stack overflow, page table corruption and a double fault.

The solution

The solution is to ensure that every process has the page fault handler mapped as the module is loaded. Ive written/borrowed code to walk the process table, and ensure the page fault handler is properly "faulted" into the per-process page table.

My first experiments were a failure: even the tiniest of coding blips will show up as a crash/hang/panic. After validating the code very carefully: it appears to work.


When a process forks, the new process gets a copy of the same page tables as its parent. So, if a process has the page fault handler mapped, so will its child. I.e. we just need to "seed" every process on the module load, and we are done.

Why doesnt this happen on x64?

I believe the reason is: probability. It *can* happen but either has not been observed, or was assumed to be a different bug. I havent directly measured this (yet) on x64, but all it requires is that the page where the dtrace page fault handler is loaded into memory, be mapped into every processes page table *by accident*. This might happen due to bootup modprobes and other things, or it could be caused by the layout of the page table directory structure leading to likelihood of dtrace being on a previously mapped page. (Maybe even the layout of the ELF format module file might "help"). But on a large memory system, it might not be sufficient and it is likely the same bug would crash - at the least opportune time.

What next?

Next up is to tidy up the horror of code bodges I have in my VM and push back to the master dtrace source; see if I can prove it could be a problem in x64, and ensure the new code is x64 palatable. (The Linux kernel, in arch/x86/mm/fault.c has two implementations of vmalloc_fault - one for i386 and one for x64, so I cannot assume the i386 "fix" will work for x64).

Posted at 20:25:47 by fox | Permalink
  Free advertising! Get CRiSP Now! Thursday, 19 January 2012  
I rarely do what I need to do...push CRiSP into your face. You are all so busy grabbing copies of dtrace, I can't spoil your party :-)

After reading about SOPA/PIPA over the last few days and the current fate of megaupload.com, I thought I would do a search for "crisp editor crack".

Quite surprised to see how many sites are hosting the software - even out of date versions. I'm actually quite proud that it should turn up on every warez site I have never heard of.

I'm looking at one site, where the file size is posted as 58.6MB. Given that my downloads (admittedly, compressed) are less than 9MB, thats pretty impressive. I wander what they put into the download.

Its possible some of these are genuine outlets for the software (no genuine outlet would accept payment, so I am fairly certain this is fake).

I am looking at a URL which looks really nice - nice layout, lots of relevant decoration, and not very good details. I wont post the site link, except to say that crisp appears in the domain name.

Its possible this is just a DNS grab and the page is almost certainly automatically generated.

Its actually impressive the amount of effort people have put into auto generating fake sites, and CRiSP is in their targets. Thanks! I am impressed.

In case you are interested, just google "crisp editor crack" to get a feel. The "lifted" text similarities are interesting.

I suspect what you see out there is the union of two things: catalogs of software - maybe from websites or old shareware type listings, along with web site generators and automation by cheap labour to flood sites with everything other than the thing they are purporting to sell. I would expect all of these to be virus/trojan carrying candidates. One site I am looking at shows a reasonable filesize, but I cant be bothered to download and verify the version to see if its real. (The one I am looking at actually looks really genuine, carrying my partners web logo).

So, if you want to get the best programmers editor ever invented, and want to change your life, then buy CRiSP N*O*W*! :-)

(I'll update dtrace over the weekend...I know what the problem is, and the solution is slightly elusive).

Ok, boredom set in ... lets google search "dtrace crack" - not so many links to choose from. One is a very interesting link, regarding an E-Book by Jon Haslam. (Apologies to Jon - I dont know him). Sitting on that page is a lot of links to Playboy/Penthouse forums.

So, even dtrace gets the "Web" treatment.

Posted at 22:58:04 by fox | Permalink
  Update on the impossible Tuesday, 17 January 2012  
I've been walking through the scenario of why some kernel addresses are not visible to some processes. Think of a block of memory allocated by the kernel for internal use, but triggering a page fault, e.g. because page is swapped, or hasnt been touched yet, by the user space.

When the page fault handler is invoked, the address of the buffer exists only in some process page tables.

Turns out this is a (nice!) clever trick of the kernel. When things are allocated in the kernel, and should be visible to all processes, e.g. a driver module or other buffer, when the page fault kicks in, a check is made if the page is valid in the "master page table". Process #0, created on kernel boot up, houses the master table. The currently running process may not have the mapping in place, and instead of paying a large cost to update all processes page tables to represent these kernel pages, the page fault handler will update the local process page table when the fault occurs.

This explains why some processes can see the page in question, and, others can not.

Bear in mind we are putting in place a page-fault interrupt handler. This *must*, repeat *MUST* be visible at the time of a page fault, else we get a cascade of nested page-faults because the handler isnt mapped in the process page tables.

So, we need to arrange this to be true. At the moment, the options include: (a) see if anything in the kernel allows us to propagate the page-table mapping across all procs (nobody else, other than possibly a Virtualisation guest, such as Xen/VMWare/VirtualBox, is likely to do this), or, (b) do the hard work myself, (c) move the interrupt handler into an existing page of mapped memory [hard], or (d) dont patch the IDT, but patch the existing page fault handler [not sure if this doesnt just put off the problem].

Let me scrape some cobwebs off my brain...

Posted at 22:24:14 by fox | Permalink
  Naughty naughty bug. Monday, 16 January 2012  
I believe I have finally found / confirmed the root cause of the "impossible" bug.

Lets go on a journey....

The i386 virtual memory architecture relies on page tables. Each process has a (complex) array of descriptions for each page. Each page in the 4GB address space either has an entry, or is missing an entry. (Each process does not need the full 4MB to describe the address space, if the process is not using every page of the 4GB space; 4MB is what is needed to describe each and every page for a fat 4GB page).

Now, typically in that 4GB range is user-space (typically everything below around 3.5GB) and the kernel (everything in the last 0.5GB). [Details differ in release to release of the kernel].

Now the kernel can see all the user space pages, but typically, the user space process *cannot* see the kernel pages. (Would be nice if you could but that would be a security hole). Physical RAM is mapped so that the kernel can see every page of memory, but the kernel pages are marked so you can not (in user space) "see them". (With root access and access to /proc/kmem, you can poke around, but thats not normal behaviour).

Now, lets consider what happens when a (module) device driver is loaded. The kernel locates some free memory and loads the image into memory. The kernel does a lot of housekeeping to link the module into various lists and expose the /proc, /dev and other entries.

Here it gets interesting...

The driver is loaded into memory - the kernel knows about the memory before the driver is loaded - its the physical RAM in your box. But maybe/maybe not - the unallocated "free" memory in the kernel, is not really addressable - certainly not by user space, and possibly, not even by the kernel - trying to access free memory would indicate a rogue pointer or array-out-of-bounds exception. When the kernel needs a free page, in can ask the kernel allocator for it.

So, this means, if you were to examine the page table for each process in the system, these free pages are effectively "not there" and this can help detect rogue pointers and bugs in the kernel.

As the driver is loaded, the pages are flipped to "being there" and visible. Eg the code for the driver has to be visible to the rest of the kernel, because you are going to do an open/read/write/close, for example.

Now, from user space - it doesnt know or care about the physical memory for a driver. You cannot just blindly execute a subroutine in a driver - you can only get to it, by executing a system call, which takes us from user space to kernel space, and, once in kernel space, we can see the code + data for the driver.

The Bombshell

Now consider a driver which embeds its own interrupt routine. When an interrupt fires, we normally switch to supervisor mode, and the page where the interrupt routine resides, is visible and executable.

I have been trying to track down a kernel blow up with dtrace, when its loaded one or more times and a page fault fires. (Only observed in the i386 kernel, not seen it in the x64 kernel).

When the user space fires a page fault, we switch to supervisor mode and run the page fault handler. The first bit of this is in the dtrace driver. If we decide this is not interesting, we jump to the existing kernel handler.

Half way through the kernel handler, it decides to take a context switch. (I dont know why - maybe its just being polite, and giving other high priority tasks a chance to run). As we load the %CR3 register (which points to the page table directory for the new process), we suddenly lose visibilty of the dtrace driver. It is no longer visible, in the context of that process, *EVEN FROM SUPERVISOR MODE*.

That new process which just got the CPU takes a page-fault and *BANG*! GAME OVER!

The page fault handler is no longer visible. In fact, trying to take the interrupt fires a page fault exception, which in turn fires a page fault exception. The stack overflows and the CPU merrily trundles along overwriting the entirety of memory until it shoots both its feet off. (Eg, it starts to overwrite the page table itself or some other important structure). I strongly suspect that using the *page table* as a *stack* is what causing the CPU to triple fault and for VMWare and VirtualBox to report an unexpected unrecoverable event has happened, and shuts down the VM.

Eh? Whaddya say?

The evidence suggests that when a driver is loaded into kernel memory, ONLY SOME PROCESSES HAVE IT MAPPED INTO THEIR PAGE TABLE.

I did an experiment: I wrote some kernel code to let me probe each process on the system, to see if that process could see, in kernel space, a specific address. I tried a kernel address and that was fine (eg sys_open). I tried the dtrace interrupt (dtrace_page_fault), and it wasnt. I loaded a random other driver, and confirmed the same.

So, lets revisit. When a driver is loaded into kernel memory - it is touch and go as to whether the driver should exist in the page tables of all processes in the system. Loading a driver could cause a lot of page table updates, as each processes page table would need to be updated to reflect the mappings. Or, instead the kernel might decide its not worth the bother: user space cannot access system space, except via a trap into the kernel via a syscall or interrupt.

So, why do half of the procs in the system have the driver loaded and the others do not?

Heres my guess: when a driver is loaded into memory at least one page table needs to be modified. This is a special page table which belongs to process zero (the swapper process). [A data structure called the swapper_pg_dir holds the kernel page table]. Under normal circumstances, every time a new process is created, it is a fork/clone of an existing process, so that new process gets a copy of the kernel page tables.

But loading a driver means we cause a "warp" effect - the kernel gets the new mappings, but some/none of the user procs do not get this.

The solution

Is this a bug? Is this a misinterpretation by me? It feels like a bug. Maybe the dtrace driver is miscompiled and I havent put the interrupt codes into the right ELF section (so I will go and check).

If its not a compile/declaration problem, then either I need to update every processes page table to see the driver pages, or find a way to ensure that a kmalloc()ed page is visible by all processes.

The evidence

Heres some evidence to support my findings. I invoke a kernel function to probe three addresses: static kernel function, dtrace page fault handler, "other" loadable module:

5.568007271 #0 1490:1294 d87fa000
5.568007271 #0 1490:     lookup1: c182009c 1
5.568007271 #0 1490:     lookup2: ebc236d0 1
5.568007271 #0 1490:     lookup3: 00000000 3
5.568007271 #0 1490:1489 d86ef000
5.568007271 #0 1490:     lookup1: c182009c 1
5.568007271 #0 1490:     lookup2: ebc236d0 1
5.568007271 #0 1490:     lookup3: eb0c52b4 1

"1490" is the PID of the process invoking the tracing. The first entry is for pid 1294. This PID can see the kernel function and the dtrace function, but not the "other" driver. Pid 1489 can see all three addresses I specified. Theres no real logic to why pid 1294 cannot see the new driver.

root      1294     1  0 21:38 ?        00:00:00 /usr/sbin/console-kit-daemon --no-daemon
fox       1489  1289  0 21:38 ?        00:00:03 sshd: fox@pts/1

Heres the kernel code, in case anyone is interested, which dumps the output:

void xx_procs(void)                                                           
{       struct task_struct *t;

// hack to call a GPL function in the kernel pte_t *(*lookup_address)(unsigned long, unsigned int *) = 0xc10277f0; int level = -1; pte_t *p;

printk("process list:\n"); p = lookup_address(0xc10277f0, &level); printk(" lookup: %p %d\n", p, level); for_each_process(t) { struct mm_struct *mm; struct vm_area_struct *vma; printk("%d %p\n", t->pid, t->mm ? t->mm->pgd : NULL); if ((mm = t->mm) == NULL) continue; // lookup_address1() is the same as the kernels // lookup_address() - but private copy to allow a // procs mm_struct to be passed so we can probe another // processes page table. // random kernel address p = lookup_address1(mm, 0xc10277f0, &level); printk(" lookup1: %p %d\n", p, level); p = lookup_address1(mm, dtrace_page_fault, &level); printk(" lookup2: %p %d\n", p, level); // random "other" module address, gained from /proc/modules p = lookup_address1(mm, 0xeecad000, &level); printk(" lookup3: %p %d\n", p, level); } }

Posted at 22:14:08 by fox | Permalink
  Dtrace progress on the impossible Sunday, 15 January 2012  
Since using VMWare and the gdb debugger to reliably debug this issue, I have an update.

It turns out that as soon as the page-fault interrupt handler is enabled, when we take the first page fault interrupt, we pass it over to the kernel default handler. On return from the kernel, the page where the dtrace handler is located is no longer mapped in the page tables (for some reason).

On the next interrupt, the CPU jumps to a non-existant page, resulting in a nested page fault interrupt. This continues for a few thousand iterations, until the kernel stack blasts through something, leading to a double fault.

Interesting that the kernel stack contains thousands of copies of the same data (pushing of the page fault code, CS:IP and flags registers).

gdb under VMWare player lets me set hardware breakpoints, so I can single step the kernel page fault handler. I've just had my first attempt, but unfortunately, maybe because of how long I took, the page fault handler decided to reschedule a different process to run, so I lost control.

Its truly great single stepping whilst gdb is showing me the line of code we are on.

Lets see if I can to what caused the mapping to disappear.

More later.

Posted at 10:02:54 by fox | Permalink
  VMWare Player .. nice Saturday, 14 January 2012  
I used to use VMWare server - quite a few years ago. I really liked it. But I got fed up with the server product lagging kernel development. New Linux kernels would come out and either I, or someone else, would have to figure out the compile-time changes for the drivers.

VMWare created the Player product - which could be used for running VMs but not creating them. I have up with VMWare workstation.

That was a few years ago.

Today, I installed VMWare Player 4.0 - and it works as expected. Its nice to come to it after a few years of non-use. It seems highly reliable and works.

So, I transferred my Ubuntu 11.10 i386 VM from VirtualBox to VMWare Player. A bit of googling showed me how to convert the disk image.

The VM came up fine. I had a few issues with the network card. My kernel didnt have the driver - I had disabled as much as possible when creating a custom kernel, so spent a little while trying to find the PCNET32 driver. That resolved, I could remote login.

So - now time to try dtrace: pretty much the same results as VirtualBox. Some small difference - occasionally when VB would hit the double fault issue, the screen would frantically scroll for 5-10s until the VM gave up and hung (hard). Doing the same on Player - it will scroll continously. There appears to be an emulation difference, which, by the looks of it, is that VB isnt playing as well in the face of double faults.

All this would be boring. But googling for 'vmware debugger' led me to a page which shows how to enable local gdb debugging of the VM guest. Using TCP rather than the silly serial port emulation of classic kgdb setups in the kernel.

So, I tried it. The first thing to note is: it just worked. gdb got a breakpoint in the native_safe_halt function in the kernel. Whats interesting is that when a breakpoint is hit, you get a big "Pause" bitmap slap in the middle of the video console. If you hit ^C in the gdb to regain control, or we hit a panic, the pause bitmap becomes visible and you know you have stopped the VM.

The gdb debugger seems more resilient to debugging the doublefault.

Heres a fragment of the stack trace from gdb:

#253 0xc1004adb in show_registers (regs=0xc1732a6c)
    at arch/x86/kernel/dumpstack_32.c:106
#254 0xc1460113 in __die (str=0xc15865f6 "general protection fault",
    regs=0xc1732a6c, err=0) at arch/x86/kernel/dumpstack.c:275
#255 0xc10058c2 in die (str=0xc15865f6 "general protection fault",
    regs=0xc1732a6c, err=0) at arch/x86/kernel/dumpstack.c:308
#256 0xc145f836 in do_general_protection (regs=0xc1732a6c, error_code=0)
    at arch/x86/kernel/traps.c:402
#258 0xc1048c04 in vprintk (
    fmt=0xc158f150 "<0>PANIC: double fault, gdt at %08lx [%d bytes]\n",
    args=0xc1732b38 "") at kernel/printk.c:827
#259 0xc1456956 in printk (
    fmt=0xc158f150 "<0>PANIC: double fault, gdt at %08lx [%d bytes]\n")
    at kernel/printk.c:750
#260 0xc1021bee in doublefault_fn () at arch/x86/kernel/doublefault_32.c:26
#261 0x00000000 in ?? ()

So - its good that I am seeing the same thing with VMware, and I have a new "debug" route to try and diagnose.

Just need to figure out who is at fault. It has to be "me". I hope its not the Virtual emulation (VB or Player). I hope its not a CPU bug. I hope its not the Linux kernel.

Posted at 18:04:34 by fox | Permalink
  The "Great Bug" of 2011/2012 Saturday, 14 January 2012  
In my recent blog postings, I wrote about the "most difficult bug in the world" to resolve. On i386, loading dtrace and patching the page fault interrupt vector would panic/hang/double fault the kernel.

I have spent the last few weeks trying absolutely everything conceivable, to no avail. When faced with such a bug - one has to try everything, but, in the corner of your mind, you know that eventually, it may not be the place you are looking at - which is why it can be so elusive.

Lets recall the issue: when loading the driver, it works - the driver loads and full functionality is available. If we remove the driver, the system still works. If we reload the driver again, we get panics, where even the kernel stack dumper panics the kernel. Evidence shows corrupt page tables or interrupt stacks, and, using the VirtualBox debugger, we can probe (a little) to see what is happening in the VM.

Removing the providers in dtrace, removing calls to activate timers and just about removing anything that does real work, shows the problem to remain. The interrupt patching code was modified to be more careful about how it does the job. The key difference between "it works very well" and it "reliably crashes" is patching the page-fault vector (0x0E).

Even if the page-fault handler in dtrace is modified to be a jump to the original vector - no touching of registers, the problem persists.

I have modified dtrace to avoid touching the page fault vector, and instead, allow me to on-demand update the vector to the new or old interrupt location with an "echo" statement to the /proc/dtrace/trace device.

This is useful - because it helps to isolate all the things that happen when a driver is loaded, from the fault at hand.

Still: its erratic. I have times where I can reload the driver and patch the vector zillions of times, and others where, on the 2nd time, we go bang.

I've been studying in more detail the TSS register in the cpu and better understanding how Linux and x86 in general handles nested interrupts, and handling multiple interrupt stacks. I have been using kgdb and the kdebug debugger in the kernel, along with the VirtualBox debugger.

Working Backwards

Nothing works deterministically: it either works brilliantly or fails dysmally.

Lets take a trip to a different place: lets work backwards. A "double-fault" interrupt is caused because an interrupt has effectively raised a general-protection fault (invalid segment selector, invalid address, page-fault, etc). So how can this happen? Well, a likely scenario is that the segment registers (%GS, %FS, %DS, %ES) are wrong.

There are really two scenarios for a page fault: it either happens in user space or kernel space. A user page fault might happen because reference to an mmapped area has yet to page-fault-in the page just touched. Another case for user page faults is the stack - as the level of nesting in an application increases, lower pages in the stack may need to be allocated/mapped.

User page faults are typically *rare*, especially on a small system because the working set can be mapped into the address space on startup.

Kernel faults are most common (are they?!), e.g. read() into a large buffer which has been mmap()ed could cause this.

I may look at which faults really are most common. (Nothing in the kernel tracks the types of faults [I think]). The difference is the segment registers: when a user app faults, the DS/ES/CS registers will point to user space, so the interrupt routine needs to modify these to point to the kernel address space. (Otherwise, things like referring to a .data or .bss object, in C, will generate a fault). If we trap from the kernel, then these registers are already the correct values.

[Theres more complication here - the Linux interrupt vectors handle nested interrupts, double faults, NMI and other things, but lets keep it simple for now]

Now, the dtrace page fault handler keeps a count of how many interrupts it sees (/proc/dtrace/stats). So, when it works, we can see it working reliably. The figures are actually lower than I expect, but that may be the kernel doing a good job of typically preloading new processes to minimize page faults, so, I actually have to work hard to generate a high degree of page faults).

So, when we have a double fault - we know an interrupt routine had a problem. The trouble is, in this case, we dont know which interrupt routine caused the original violation, because the double-fault handler generates a new fault. (I think the i386 kernel code is broken - it is walking a stack and the cpu registers are not consistent; i see streams of stack dumps where each stack dump is causing another, nested stack dump, and it all scrolls off the screen. I have used virtualbox debugger to see the true stack, and this is only partially helpful).

When a nested fault occurs, the CPU will switch from the offending stack to a new stack, set up just for this purpose. This is a brilliant feature but its causing me a problem as I am having trouble finding the original offending stack.

Lets patch the kernel

Ok, so we know something is wrong. Maybe we can detect the issue before it happens and get a deterministic panic. I modified the code in the kernel - just before we dispatch to a new process, to validate the page-table, and also, to monitor the low water mark of the kernel stack. Suggestions on google are that the symptoms I am seeing are due to stack overflow. Neither of the mods I have made have helped. Typically, the kernel will use about 2K of the 4K stack space available to it - it rarely gets close to 3K, so I dont believe we are overloading the stack and corrupting key data structures.

Lets give up?

I refuse to give up on this. I am mentally walking thru kernel code and scenarios, trying to conjure up the "it doesnt happen often" case, to detect what can be happening.

Today, I was wondering if, in the VM, we give it a nice round number of memory or an odd number - whether this could impact. I typically give my VM about 730MB of memory. I tried giving it exactly 256MB and it worked perfectly! Briliant!. Rebooted and tried again at 256MB, and it failed.

Theres almost a flavor to the underlying problem. Often, I am getting a scenario where it works for extended periods of time: I cannot crash the machine. Other times, it crashes exactly on the second load (sometimes, even the first load, although this is rare).

Its almost as if the problem is to do with exactly what is allocated in memory. I tried a test by filling memory with a large file (full of zeros), and checking to see if the file was mutating. (Maybe the interrupt routine was firing with incorrect DS/ES registers and attempts to increment a counter was randomly patching a random page in memory - that would exactly explain the kind of problem I am seeing).

So, so far: nothing. No deterministic testing is locating the root cause of this fault. (I am also wondering if VirtualBox is broken - I have seen the many bug reports and complaints about VB on the web, but I have no evidence that the bugs are my problem. I must get emu or VMware up and running to do side-by-side comparisons).

Posted at 09:55:17 by fox | Permalink
  More on the impossible Sunday, 08 January 2012  
I wrote last time about the worst bug to try and diagnose, namely one where we lose the page table or GDT or both. Doing so can result in a double or triple fault, and no way to figure out where you came from.

In user space, a jump to a virtual method (which is in essence a call to a function via a level of indirection), can mean the PC is set to zero, and, is similar, in that it is annoying to not know where you came from. But the fact the program counter is zero tells you that you went through a null pointer.

So, where am I solving the "impossible"? Not much further forward. I have been using the VirtualBox debugger - but it is very broken and pathetic. You cannot set breakpoints or hardware breakpoints if using the VT-x/AMD-V virtualisation acceleration. If you turn these off, you can, except the semantics of breakpoints breaks the guest operating system. In addition, writing to guest CPU registers is not implemented.

I took a quick look at qemu, but I didnt like what I found -- I prefer a CLI in general, but for VM guests, I prefer a GUI to get me comfortable. The GUIs on Linux are very ugly and amateurish which didnt instill confidence in me. (I know, this is unfair of me). I may try again in the future.

I went back to kgdb - at least this let me set breakpoints and hardware breakpoints, which is useful. But the process of using kgdb is very clunky - the guest and remote debugger get out of sync on the comms protocol. In any case, hitting the bug I am interested in didnt help much, with kgdb. I could break in the doublefault_fn function, but we couldnt really figure out where we had come from.

I modified dtrace to allow access to the GDT and IDT via /proc/dtrace/gdt and /proc/dtrace/idt. (Not really needed, but useful for validating that these data structures are correct).

What I am finding is that on a double trace fault, there is a suggestion that the original offending kernel stack for a process has been set to all zeroes. When the kernel tries to dereference an argument on the stack, or return from the offending function, it generates a GPF, which in turn generates a double-fault. (I'm not totally sure of this - a GPF wouldnt normally generate a double-fault, unless the GDT, page table or IDT were screwed up).

Lets just revisit what I am doing: having cut down dtrace to a minimalist shell, we can override entry IDT[14], which is the page-table vector entry. If we put in the actual value which is there already, everything is fine.

If we modify the entry to point to our interrupt routine, and make our interrupt routine simply jump to the original kernel routine, at some time after this change (could be instantaneous to a minute later), we crash the kernel. It feels like a few pages of the kernel got overwritten, e.g. memset(random-ptr, 0, PAGE_SIZE). But tracking this down is nearly impossible.

I have been adding debug code to the kernel source to try and do extra validation (eg in the scheduler, just before the context switch occurs), but this hasnt proven fruitful so far.

Its almost like looking for a root kit in the kernel - I almost wander if the kernel has some tamper-resist code in there (it does, but not like this).

I need to somehow checksum the entirety of RAM and look for something unexpected to happen, but doing this isnt viable. RAM and processes are changing all the time. Process creation complicates things - every fork() generates a new process with a new kernel stack. I need to keep walking all processes kernel stacks to detect corruption, before we switch to the process.

I am running on a single-CPU guest, to avoid the complexity of multi cpu operations. What I cannot determine is if something is being corrupted by virtue of writing to the IDT, or a long time after.

Alas, google searching hasnt been helpful - the symptom and problem is very unique (I am not writing a rootkit, although dtrace looks an awfully lot like a rootkit in terms of what it does), and I am not booting up a new operating system. Nobody describes the scenario of modifying an in-use IDT and the things that can go wrong. (I did find two links, quoted in a couple of posts ago).

Next is to try disabling dtrace's timer code - maybe that is causing non-deterministic behavior.

Posted at 21:26:57 by fox | Permalink
  Debugging with VirtualBox Wednesday, 04 January 2012  
Earlier, I wrote about the worst type of bug in the world - one where we smash the internal CPU registers so badly, that nothing recovers - no interrupts, no double/triple faults.

Ive been experimenting with the VirtualBox debugger, and its very nice, albeit a little basic. Anyone interested in playing with this will need to read the manual.

But heres an illustration of a CPU-smashing bug.



If I run the following command, I can get a complete dump of all registers in the CPU in the VM guest:

$ VBoxManage debugvm  Ubuntu-11.10-i386 getregisters all | tee /tmp/reg
cpu0.rax               = 0x0000000000000000
cpu0.rcx               = 0x0000000000000000
cpu0.rdx               = 0x0000000000000000
cpu0.rbx               = 0x00000000c1644000
cpu0.rsp               = 0x00000000c1645f80
cpu0.rbp               = 0x00000000c1645f98
cpu0.rsi               = 0x00000000c1698fb8
cpu0.rdi               = 0x000000004fcb43de
cpu0.r8                = 0x0000000000000000

Now, I save this to a file, and then cause the host to crash. We dump the registers again and now we can diff the results. We expect to see lots of differences, but heres some of the key elements:

> cpu0.gs                = 0x00e0
> cpu0.gs_attr           = 0x00004091
> cpu0.gs_base           = 0x00000000ecc05c00
> cpu0.gs_lim            = 0x00000018
> cpu0.gs                = 0x0000
> cpu0.gs_attr           = 0x00010000
> cpu0.gs_base           = 0x0000000000000000
> cpu0.gs_lim            = 0x00000000

Not the GS register is smashed in the diff. Theres no base address for the segment definitions, so any code trying to use GS will cause a double/triple fault. Thats not good for the kernel.

> cpu0.cr2               = 0x00000000b78a0000
> cpu0.cr3               = 0x000000002a945000
> cpu0.cr2               = 0x00000000c1647040
> cpu0.cr3               = 0x0000000001748000
> cpu0.tsc               = 0x89fd0226
> cpu0.tsc               = 0x02307c70
> cpu0.msr_gs_base       = 0x00000000ecc05c00
> cpu0.msr_gs_base       = 0x0000000000000000

Register CR3 is the page table base address. In the crashed machine, CR3 looks "wrong". And the interactive VirtualBox debugger wont get very far with this wrong value as it needs the page tables to map virtual addresses to physical ones.

Likewise, the msr_gs_base (which is an internal register which holds the place where the GS register is taken from, on a kernel switch) seems corrupt.

This is why my guest is a smashed VM.

But, alas, I dont know whats causing this.

Still investigating....

Posted at 22:15:41 by fox | Permalink
  Is that the worst you can do ? Wednesday, 04 January 2012  
Whats the worst thing you can do to a CPU whilst executing code?

How about a buffer overflow .. overwriting beyond the end of a buffer. Very soon a segmentation violation (or GPF) will happen, and the application will terminate, or try to recover.

How about inside the kernel? Well, pretty much the same thing.

The x86 architecture is well thought out. When some form of memory access goes awry, an interrupt is generated (technically a 'fault' or 'trap'), and the kernel will attempt to recover from this.

The act of taking an interrupt involves pushing the current program counter on the stack, and jumping to a predefined location.

Great. So - whether a GPF occurs in user space or kernel space, something will happen. This is either recoverable, or a panic/blue-screen can happen if the kernel doesnt know what to do.

The predefined location is setup in a table called the IDT (Interrupt Descriptor Table).

If the interrupt to handle a GPF takes a fault itself, the system will generate a double-fault. Double-faults are very rare. (GPFs are very common, and can be caused under normal circumstances via memory mapped/anonymous memory, as pages are faulted into existence).

A double fault typically indicates a flaw in a driver and can be caused by using an invalid pointer or a stack exception in an existing interrupt.

A triple fault is what can happen if a double fault generates an exception. This would indicate the double-fault handling code hit an unexpected condition. On the Intel/AMD architectures, a triple fault will typically reset and reboot the CPU.

Normally, the kernel and CPU operate together on some very key data structures. We mentioned the IDT, above. Theres also the GDT - which describes how segments of memory map to real memory. And then theres the LDT - which is a per-process view of memory. Corrupting any of these can lead to double/triple fault behavior.

But theres another data structure: the page table directory. If the page table is corrupted then all bets are off. The page table can be used to indicate what blocks of memory are present/not-present in the system and is the mechanism for virtual memory support. If the page table were corrupt, then an application would generate a page fault interrupt and the kernel would quickly shut down the offending process.

But what if the kernel version of the page table were corrupt? On an interrupt, the CPU wouldnt be able to access the code to execute the interrupt handler, which in turn would lead to a double fault, and thence to a triple fault.

All of this is well documented on the web.

But I am having a hard time with dtrace on i386 architectures. After loading dtrace, and then removing from the system, on a subsequent reload of the driver, the system crashes/hangs. Most of the time there is no output on the console; when there is output on the console, its confused and corrupted. Which indicates that one of the key data structures in the kernel is corrupt (IDT, Page Tables or GDT).

And, because of this, nearly impossible to debug. Nothing in the kernel can help debug this scenario - we cannot print or signal what has happened or where we were prior to the crash.

At the moment I am using the VirtualBox debugger to poke around after a crash, but the debugger wont let me examine memory exactly because the page table is corrupt (or the CR3 register is corrupt, but I cannot tell the difference; CR3 is the register which points to the start of the page table).

So, this is the worst bug to resolve - no kernel debugger, printk statements or something in the kernel will help find the cause of the strange hang. (Strangely, this problem does not exist in in the 64b kernel).

Posted at 20:50:58 by fox | Permalink