linux fork and the x86-64 issue Monday, 30 March 2009  
After a week of grafting over various different assembler constructs to implementing a wrapper around the fork/clone syscall, I have hit a "Wow!" moment - the way this works ... wont.

Implementing a wrapper around a normal piece of C code is easy - just create a new piece of code which has the same calling sequence, no side effects, and returns the output from the function.

On x86-64, fork() is "strange": a child is created as a clone of the parent (complete with the wrapping), but then goes off into the big wide world all on its own, without coming back to Daddy (the wrapper). So the half-clothed child ends up back in user space with corrupted stack, and then core dumps.

The parent is fine - its just these wayward kids which are a problem.

So, now I understand it better, I am off to find a way to ensnare the child proc and see if I can get out of forking issues.


Posted at 21:56:58 by Paul Fox | Permalink
  forkin hell Saturday, 28 March 2009  
dtrace for x86-64 is still held up on the clone/fork syscall. Most syscalls in Unix are read-only from user space into kernel space. A few syscalls modify the incoming arguments - but fork (which is built on top of clone()) is different because of various stack/register manipulations.

On x86-64, syscalls are executed via the SYSCALL instruction which is a special optimisation on later x86 cpus compared to INT 0x80, and the kernel glue does interesting things to implement the semantics of the syscall - especiall fork/clone.

This is causing me problems since my assembler glue doesnt quite match the Linux code (despite poring over it for the last week in excruciating detail, but, obviously not enough).

I can get fork() to nearly work - but that is nowhere near sufficient. Only perfection is acceptable.

Interestingly, fbt calls on the fork code work nicely, e.g.

$ dtrace -n fbt::sys_clone:
will work. (I could cheat and emulate the syscall, but thats cheating on a bad scale).

The assembler glue needs to work.

An interesting item in this arena is that the extra cpu registers in the 64-bit chip causes more work for the kernel, as they need to be saved, but this, in general, is offset by the power of 64-bit addressing and computation, so, most people see it as a win. (I've benchmarked my key application - CRiSP - as 10-20% faster in 64-bit mode compared to the same binary compiled in 32-bit).

More, hopefully before the 1 year birthday is up !


Posted at 22:19:46 by Paul Fox | Permalink
  dtrace release 20090321 Saturday, 21 March 2009  
I have just put out the latest dtrace release which should work for all syscalls on 32 and 64 bit kernels. It wont work for all fbt functions on 64-bit kernels, but I need to work thru where the breakage is.

Theres one nasty in this code - the code makes use of pgd_offset_k() in the kernel, but the web implies this is an unsupported function, about to disappear, and so, depending on the way the kernel was built, may cause problems (inability to load). It works fine on Ubuntu 8.10 with the stock kernels, but havent managed to uncover what the replacement idiom is.


Posted at 23:35:54 by Paul Fox | Permalink
  success - sys_call_table in *Linux* can be modified Saturday, 21 March 2009  
I have spent two weeks trying to figure out how to modify the sys_call_table on x86-64 - everything I tried turned out not to work. The Linux kernel developers have stopped you from finding or writing to the sys_call_table, and I am happy with that. We are a guest of the kernel.

But there goal can never work - once we are inside the kernel, the best they can do is make life hard. dtrace is almost a root kit inside the kernel, but its a friendly one - providing facilities to aid the people who support or own the target system. dtrace is a monitoring facility.

The Linux kernel functions - which are subject to change from one kernel to another - simply forbid a read-only page from being made read-writable if its in the ".rodata" section.

Despite direct updating of the page table entries to bypass this, I failed, continuously. I ended up read this: http://www.intel.com/design/processor/applnots/317080.pdf to find the sentence which told me what I was doing wrong.

The experiments of the last two weeks have resulted in a bit of mess in the dtrace driver, so now I can clean that up, and should be in business.

Anyone who really cares about how this is done, can grab the code and find the relevant sections, and, even better, criticise/improve the code.

One thing that stands out about my journey through the kernel is that dtrace may not work for a paravirtualised kernel, or UML linux, but thats relatively minor - if people want to run dtrace on these kernels, we can work out what needs to be fixed.


Posted at 18:52:15 by Paul Fox | Permalink
  rodata and the sys_call_table in the Linux kernel Monday, 16 March 2009  
The problems I have with the 64-bit dtrace are to do with the fact, and I welcome this, that in Linux, they bundle lots of data into a read-only section. This is good practise, and helps to detect bugs, stop virus writers (not really), and makes sense.

They are good and clever too - the Linux functions for modifying page table attributes preclude you from undoing this, because they check what you are trying to do and turn off the _PAGE_RW bit of the page table entry.

Again, I applaud this - this makes the challenge more interesting. There are a few places on the web and newsgroups which talk about this and how to undo this, e.g. by rebuilding the kernel. (Magic code in the function mark_rodata_ro() and the function static_protection() in pageattr.c). Again, this shows a good plan and one can see the evolution in the kernels.

Of course, that makes the 'game' more challenging, and is forcing me to learn more and understand more about why dtrace works/doesn't work. The earlier Ubuntu release must have predated these kernel changes, and now we just need to find a way around it.

Remember, we are running in kernel space, so anything goes (its akin to running in MSDOS real mode - we can do what we like). We are not malicious or a virus writer, we simply want a tool which plays nice and in a black box fashion, so that the kernel can be probed, even in the face of systemtap or kprobes.

I have tried numerous experiments - all failing, but thats because I am trying the hard way - no debugger. The issue is that in modifying the page table entry to allow read-writability, the kernel hangs. It looks like a kernel "bug" - a page fault to a read-only page causes an infinite trap loop - we keep returning to the faulting instruction.

More when I get a solution. For now - dtrace either works for you, or it doesnt (because the kernel is a later one).


Posted at 23:19:19 by Paul Fox | Permalink
  dtrace progress for linux 64-bit .. or lack thereof Sunday, 15 March 2009  
Very strange week. Having spent a lot of effort on the 32-bit version of dtrace and fixed the many silly things in the port, the 64-bit is proving problematical.

The 64-bit version works fine on Ubuntu 7, but on Ubuntu 8 (and Fedora Core 8), with a variety of kernels, it hangs.

Its not clear what/why - whether its kernel specific, GCC specific or whatever. I've taken to doing some radical surgery just to get something to print out and point at the problem area. I've redone the assembler code to tidy up potential compiler dependencies.

The same code still works on Ubuntu 7, but on on Ubuntu 8, it hangs. I've been poring thru the Linux kernel code, and have some ideas of what could be doing this (e.g. the 2M page size for protected kernel data may be a part of it).

Another strange thing on the 2.6.27-7-generic kernel, is that even if I get it to not hang the system, it runs out of memory with 512MB of space. (I've installed Ubuntu 8 inside VMware, so at least I am seeing the same inside as outside of VMware, and dont have to keep crashing my main machine).

The strange thing is that it hangs on dtrace_casptr - which translates into a cmpxchg() function call (no need for inline assembler, since Linux headers provide this from the C level).

I will keep digging, hopefully something will point to the bad coding issue here.


Posted at 20:12:21 by Paul Fox | Permalink
  64bit troubles Saturday, 14 March 2009  
The 64-bit dtrace doesnt work on a real CPU. Works inside vmware....but first, apologies to vmware. I have enjoyed it since its inception, and have been perplexed by why it works inside but not out.

What is annoying is that most of the development of dtrace occurred in 64bit mode, thinking that porting to 32bit would be simple, and was oblivious to many issues on 32bit. So, over the last few months been perfecting 32-bit dtrace.

Now, I go back to 64-bit dtrace, and it works inside vmware but not outside.

Looks like something strange happens in the kernel - the kernel supports large pages -- 2M and 1G page sizes. The system call table is sitting inside a 2M page group (x86-64 has a three level page table).

So, the hacks I did for 32-bit (not strictly hacks), may or may not work for 64-bit Linux - because I was taking a shortcut, and it turns out bits of the kernel are protected by a 2M page which is marked read-only.

Now I know what/where the issue is, I can go and see if I can fix it.

Annoying, because I have had to repeatedly crash my master machine to test this out, but am hopefully close.

More when I have a fix, but, for now, dont use the 64-bit dtrace. The 32-bit dtrace seems to be fine tho.


Posted at 22:59:09 by Paul Fox | Permalink
  64bit dtrace -- still not quite there. Thursday, 12 March 2009  
I am a little sick of vmware - my workhorse. It fails to emulate a real cpu enough that when I try native dtrace/64 - the machine locks up. I thought it was a gcc issue - thats partially true, but I need to debug these issues. (It loads, dtrace -l works, but any probe placements are dying in dtrace_casptr; if I comment that out, I dont panic my machine).

I am trying out the latest release of VirtualBox to see if I am better placed to virtualise...lets see what happens.


Posted at 23:35:18 by Paul Fox | Permalink
  64-bit linux dtrace working (I hope) Wednesday, 11 March 2009  
I've fixed the compiler dependent assembler stuff in dtrace_asm.c so it appears to work in my vmware session. Am about to test on my non-vmware 64-bit hardware, so give it a try...but be careful!

Posted at 23:31:38 by Paul Fox | Permalink
  i give up..fedora..goodbye Tuesday, 10 March 2009  
I liked Fedora - I liked the desktop.

I absolutely loathe and detest YUM - the package updater which is broken beyond belief. So I am downloading Ubuntu 8.10/64 to give me a decent platform.

Nothing has installed properly from Fedora. Even downloading the FC10 release was going to take hours, but Ubuntu is a single CD which can bootstrap me, and then I can start using apt-get -- which is what I have been using on my two laptops and the vmware sessions to help debug dtrace.

I hate to give up on fedora - diversity is what counts, to check out configs and other software development tools, but it hurts that my main machine no longer even has GCC on it. I removed the package hoping to be able to rebootstrap myself, but yum refuses to play. Did I say I hate it?

Pirut - and python. <laughs /> No thanks

I want my software to have as few dependencies as possible, not take 10+ minutes to tell me that everything is broke.

I need to get my gcc working again so I can prove the dtrace works for 64-bit, but I fear I may be in a gardened wall where "it just works on Ubuntu, and nothing else". I know what I did wrong in the __asm__ sections of dtrace - I just need a 64-bit machine and compiler to validate my changes.

So, tonights episode is loading ubuntu on top of fedora and see how much I learn to cry :-)

For now, dont touch 64-bit dtrace until I next write up that all is safe. You will very likely panic your system. (32-bit should be good).


Posted at 21:30:39 by Paul Fox | Permalink
  64-bit oops.... Monday, 09 March 2009  
The 64-bit dtrace is broken, or rather, whether it crashes your kernel or not may depend on the version of gcc you use. The __asm__ constructs dont handle the frame pointer which is setup.

I have fixed (some of) them in my release, but am having difficulty upgrading my Redhat FC8 real machine with vmware to validate my testing.

My yum/pirut package updater is broken and building gcc from source requires things which wont build properly, despite upgrading my binutils.

Oh well, need to upgrade stuff as this is an old system now and I hate my main line system not having the tools to get the job done. (If need be I'll revamp with an FC10 at the weekend).


Posted at 23:13:54 by Paul Fox | Permalink
  thats better...but strange Sunday, 08 March 2009  
dtrace -n fbt:::
dtrace: description 'fbt:::' matched 25753 probes
...

This is on a real CPU - seems to work now.

Theres been lots of issues and discoveries to get here in the last day or so.

Firstly, vmware seem to fail me: fbt::: would work fine on my vmware session, but fail on the real cpu. Some of this may be due to kernel or compiler (vmware running 2.6.28.5, real cpu on 2.6.28.2) but I dont think its kernel related - I dont believe that many differences in the areas I am looking at have happened.

It could be compiler and/or kernel options: GCC and Linux attempt to inline lots of stuff.

Here is what I found: despite initially thinking I should have 42,000+ probes, its down to 25,753, as above. Some of this is that many functions are in non-".text" sections. The kernel makes liberal use of attributes to put functions in the following sections:

.init
.init.text
.fini.text

and others. So I added code to restrict functions only in the ".text" section. This means we may miss the chance to probe some module exit code, but this isn't really much of a loss.

Additionally, because .init sections can be jettisoned after module loading, I found numerous cases where two or more functions sat at the same address. This is bad: if FBT latches on to this, we destroy the hash table for FBT and cannot patch/unpatch when dtrace is executed. I added code to detect/disallow two or more probes at the same address. (This isnt an issue for the vmlinux kernel, but is for modules).

Another sanity check is for functions on the notifier chains used by dtrace itself. A probe on one of those would cause infinite trapping/recursion/reboot.

Finally, the __mutex_ functions are put in the toxic.c file, since the mutex functions will call those and dtrace needs these to work. We could inline our own implementation of mutex_lock/mutex_unlock, but we can live without this for now.

I had to reboot my notebook numerous times to get here, and this seems to be stable. I will run it more and see what happens. Theres still a chance that slowpaths or exceptional paths thru some critical functions could cause a kernel crash, so be careful.

TODO items: add more instruction emulations, revisit 64-bit support, and fix the process monitoring code (dtrace -c)


Posted at 11:54:33 by Paul Fox | Permalink
  Ooops Saturday, 07 March 2009  
I just wrote that 'dtrace -n fbt::*:' works. I tried it on my other (real cpu) machine, and it didnt. It definitely works for many/most functions, but some will cause problems - panics or reboots. (This is a machine showing 43,000+ available fbt probes).

More when I think this is resolved.


Posted at 09:21:26 by Paul Fox | Permalink
  26,000+ probes Saturday, 07 March 2009  
I finally managed to allow:
$ dtrace -n fbt::*:

to work without crashing the kernel. Of the many functions in the kernel, some cannot be probed (not too many), since dtrace is relying on them. Some, like do_int3(), is the assembler glue to handle a breakpoint/trace trap.

Others were causing problems because my dtrace, unlike Solaris', calls kmem_alloc() from probe context. I have put a temporary work around for this by doing a static allocation in par_setup_thread(). The problem is that in Solaris, we have call back for process creation and/or the proc structure contains extra fields for dtrace. In Linux dtrace, we dont touch the kernel code, so we need to shadow each process created (or, probed), and this requires some little extra work.

This works now. I havent heavily tortured tested it with the final pieces in place, but now I can.

Theres still some instruction emulations not implemented yet, but these form a very small part of the kernel, e.g. a few thousand functions are unprobable until I implement those instructions.

I'll look to adding more emulations, and then switch back to 64-bit dtrace as that too is missing quite a few emulations.

As again, I implore the community to give this a try. I know of at least one successfully documented user seeing dtrace actually work, and I am hoping from hereon in, it is reliable/stable. (Famous last words...I still consider this an alpha or pre-alpha release).


Posted at 09:04:17 by Paul Fox | Permalink
  timer probing issue fixed Tuesday, 03 March 2009  
$ dtrace -n fbt::nr_active:
dtrace: description 'fbt::nr_active:' matched 2 probes

CPU ID FUNCTION:NAME 0 1005 nr_active:entry 0 1006 nr_active:return ...

This was previously locking the machine up, but is now fixed. In dtrace_probe(), we call dtrace_gethrtime() which eventually calls the code to read the system clock. This is protected by a lock to avoid fluctuation as we read the multiple words of a clock.

We inline some of the code, to bypass the kernel lock (but we will suffer very occasional clock drift until this is fixed).

Now I can proceed getting to the point where:

$ dtrace -n fbt:::
works.

Stay tuned.


Posted at 15:54:14 by Paul Fox | Permalink
  one down.... Tuesday, 03 March 2009  
The problem of not running on a real cpu for FBT probes was simply the same old problem I had for syscalls: the kernel is write protected for an i386 kernel. So we just make the kernel writeable as we set the probes, and now it works perfectly (well, as good as the vmware version does).

This still leaves one major bad-ism -- some FBT probes crash the kernel - hard. I thought it was due to the derivatives of timer interrupts, but thats not true; I can place probes on some timer interrupt code, and this works, but others do not. E.g. nr_active() is a crashing probe, but the callers to this function do not cause a problem.

Lets see what a bit of mind-thinking brings. (Some bugs seem to only be worked out by thinking about the code paths the kernel, cpu and dtrace take; sometimes, no amount of printing will help, especially when we go blind in some areas and just hard lockup the kernel).


Posted at 10:37:15 by Paul Fox | Permalink
  problems..problems Monday, 02 March 2009  
As I try to ramp up the kernel probes (as distinct to the module probes), I have found an issue: it looks like any probe called from timer interrupt context will hose the kernel.

Why?

Most likely reason is the short-cut code necessary to handle the INT3 traps. This can be seen by planting a probe on nr_active() which is called from the timer interrupt to recompute the loadavg. Hopefully I can work out how to exit the probe interrupt handler properly, and this will avoid a need to blacklist lots of kernel functions.

Am also still seeing differences between a real CPU and VMware. On a real cpu, if we place lots of probes, we stand a chance of getting a timer interrupt whilst laying out the probes, which causes a kernel GPF. Its possible this is related to the above - i.e. not doing the correct job in the first place, so hopefully one will give insight over the other.

Syscalls work and many many fbt kernel/module probes work - even under heavy load, but some dont - so dont try and lay out everything just yet...


Posted at 22:58:51 by Paul Fox | Permalink
  dtrace - the next phase Sunday, 01 March 2009  
Now that dtrace is working nicely - various stress tests are working well, e.g. "dtrace -n fbt:::", its time to go one step further.

Here is what todays release can do:

/home/fox/src/dtrace@vmub32: load
Syncing...
Loading: build/driver/dtracedrv.ko
Preparing symbols...
Probes available: 23075

Note that figure - we just moved from 5000+ probes in the modules, to now include everything in the kernel.

Heres an example:

/home/fox/src/dtrace@vmub32: more /tmp/probes.current
   ID   PROVIDER            MODULE                          FUNCTION NAME
    1     dtrace                                                     BEGIN
    2     dtrace                                                     END
    3     dtrace                                                     ERROR
    4        fbt            kernel                                   entry
    5        fbt            kernel                                   entry
    6        fbt            kernel                                   entry
    7        fbt            kernel                                   entry
    8        fbt            kernel                                   entry
    9        fbt            kernel                                   entry
   10        fbt            kernel                         init_post entry
   11        fbt            kernel                     name_to_dev_t entry
   12        fbt            kernel                     name_to_dev_t return
   13        fbt            kernel                   calibrate_delay entry
   14        fbt            kernel                   calibrate_delay return
   15        fbt            kernel                    dump_task_regs entry
   16        fbt            kernel                    dump_task_regs return
   17        fbt            kernel               select_idle_routine entry
...
/home/fox/src/dtrace@vmub32: dtrace -n fbt:kernel:generic_file_mmap:
dtrace: description 'fbt:kernel:generic_file_mmap:' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0   3222          generic_file_mmap:entry
  0   3223         generic_file_mmap:return
  0   3222          generic_file_mmap:entry
  0   3223         generic_file_mmap:return
  0   3222          generic_file_mmap:entry

Dont try and do:

$ dtrace -n fbt:::

as I now need to strip the kernel symbol table of functions we rely on and would cause re-entrancy problems (my system rebooted itself when I did that!)


Posted at 21:01:05 by Paul Fox | Permalink
  probes...and more probes Sunday, 01 March 2009  
My Ubuntu vmware session seems to have about 5000 probes available in fbt. I upgraded to 2.6.28.5 kernel last night to try and figure out the crash on the real non-vmware machine (see prior blog).

My real Ubuntu has 16000+ probes available.

But the real machine seems to be unstable at more than 8000+ probes. The error strikes if you ctrl-c dtrace - a GPF occurs somewhere during the close code.

On my vmware machine, I have modprobe'd all available drivers, and can get to nearly 14000 probes (this is a 2.3GHz dual core cpu vs 1.2GHz for the real machine).

Looks like something in locking or interrupt disabling may be causing the instability.

The good thing is it does work well, but just not reliably enough for a real machine where you care about crashing or locking up the driver.

More when I know what the deal is here.

/home/fox/src/dtrace@vmub32: dtrace -n fbt:::
dtrace: description 'fbt:::' matched 13752 probes
CPU     ID                    FUNCTION:NAME
  0  11858                   epcapoll:entry
  0  11859                  epcapoll:return
  0   8587               ia_led_timer:entry
  0  12474               ipmi_timeout:entry
  0  12475              ipmi_timeout:return
  0  11858                   epcapoll:entry
  ...


Posted at 12:59:35 by Paul Fox | Permalink
  idiot. me. Sunday, 01 March 2009  
dtrace was looking so good, yet it actually failed on a real cpu again.

Spent a bit of time to trace this down to my bad changes to the interrupt enable/disable code, not working well with gcc.

This is fixed and so, dtrace should work much better now.

Still more to do to add probes, but at least the basics should work well again.


Posted at 00:31:05 by Paul Fox | Permalink