linux fork and the x86-64 issue | Monday, 30 March 2009 |
Implementing a wrapper around a normal piece of C code is easy - just create a new piece of code which has the same calling sequence, no side effects, and returns the output from the function.
On x86-64, fork() is "strange": a child is created as a clone of the parent (complete with the wrapping), but then goes off into the big wide world all on its own, without coming back to Daddy (the wrapper). So the half-clothed child ends up back in user space with corrupted stack, and then core dumps.
The parent is fine - its just these wayward kids which are a problem.
So, now I understand it better, I am off to find a way to ensnare the child proc and see if I can get out of forking issues.
forkin hell | Saturday, 28 March 2009 |
On x86-64, syscalls are executed via the SYSCALL instruction which is a special optimisation on later x86 cpus compared to INT 0x80, and the kernel glue does interesting things to implement the semantics of the syscall - especiall fork/clone.
This is causing me problems since my assembler glue doesnt quite match the Linux code (despite poring over it for the last week in excruciating detail, but, obviously not enough).
I can get fork() to nearly work - but that is nowhere near sufficient. Only perfection is acceptable.
Interestingly, fbt calls on the fork code work nicely, e.g.
$ dtrace -n fbt::sys_clone:will work. (I could cheat and emulate the syscall, but thats cheating on a bad scale).
The assembler glue needs to work.
An interesting item in this arena is that the extra cpu registers in the 64-bit chip causes more work for the kernel, as they need to be saved, but this, in general, is offset by the power of 64-bit addressing and computation, so, most people see it as a win. (I've benchmarked my key application - CRiSP - as 10-20% faster in 64-bit mode compared to the same binary compiled in 32-bit).
More, hopefully before the 1 year birthday is up !
dtrace release 20090321 | Saturday, 21 March 2009 |
Theres one nasty in this code - the code makes use of pgd_offset_k() in the kernel, but the web implies this is an unsupported function, about to disappear, and so, depending on the way the kernel was built, may cause problems (inability to load). It works fine on Ubuntu 8.10 with the stock kernels, but havent managed to uncover what the replacement idiom is.
success - sys_call_table in *Linux* can be modified | Saturday, 21 March 2009 |
But there goal can never work - once we are inside the kernel, the best they can do is make life hard. dtrace is almost a root kit inside the kernel, but its a friendly one - providing facilities to aid the people who support or own the target system. dtrace is a monitoring facility.
The Linux kernel functions - which are subject to change from one kernel to another - simply forbid a read-only page from being made read-writable if its in the ".rodata" section.
Despite direct updating of the page table entries to bypass this, I failed, continuously. I ended up read this: http://www.intel.com/design/processor/applnots/317080.pdf to find the sentence which told me what I was doing wrong.
The experiments of the last two weeks have resulted in a bit of mess in the dtrace driver, so now I can clean that up, and should be in business.
Anyone who really cares about how this is done, can grab the code and find the relevant sections, and, even better, criticise/improve the code.
One thing that stands out about my journey through the kernel is that dtrace may not work for a paravirtualised kernel, or UML linux, but thats relatively minor - if people want to run dtrace on these kernels, we can work out what needs to be fixed.
rodata and the sys_call_table in the Linux kernel | Monday, 16 March 2009 |
They are good and clever too - the Linux functions for modifying page table attributes preclude you from undoing this, because they check what you are trying to do and turn off the _PAGE_RW bit of the page table entry.
Again, I applaud this - this makes the challenge more interesting. There are a few places on the web and newsgroups which talk about this and how to undo this, e.g. by rebuilding the kernel. (Magic code in the function mark_rodata_ro() and the function static_protection() in pageattr.c). Again, this shows a good plan and one can see the evolution in the kernels.
Of course, that makes the 'game' more challenging, and is forcing me to learn more and understand more about why dtrace works/doesn't work. The earlier Ubuntu release must have predated these kernel changes, and now we just need to find a way around it.
Remember, we are running in kernel space, so anything goes (its akin to running in MSDOS real mode - we can do what we like). We are not malicious or a virus writer, we simply want a tool which plays nice and in a black box fashion, so that the kernel can be probed, even in the face of systemtap or kprobes.
I have tried numerous experiments - all failing, but thats because I am trying the hard way - no debugger. The issue is that in modifying the page table entry to allow read-writability, the kernel hangs. It looks like a kernel "bug" - a page fault to a read-only page causes an infinite trap loop - we keep returning to the faulting instruction.
More when I get a solution. For now - dtrace either works for you, or it doesnt (because the kernel is a later one).
dtrace progress for linux 64-bit .. or lack thereof | Sunday, 15 March 2009 |
The 64-bit version works fine on Ubuntu 7, but on Ubuntu 8 (and Fedora Core 8), with a variety of kernels, it hangs.
Its not clear what/why - whether its kernel specific, GCC specific or whatever. I've taken to doing some radical surgery just to get something to print out and point at the problem area. I've redone the assembler code to tidy up potential compiler dependencies.
The same code still works on Ubuntu 7, but on on Ubuntu 8, it hangs. I've been poring thru the Linux kernel code, and have some ideas of what could be doing this (e.g. the 2M page size for protected kernel data may be a part of it).
Another strange thing on the 2.6.27-7-generic kernel, is that even if I get it to not hang the system, it runs out of memory with 512MB of space. (I've installed Ubuntu 8 inside VMware, so at least I am seeing the same inside as outside of VMware, and dont have to keep crashing my main machine).
The strange thing is that it hangs on dtrace_casptr - which translates into a cmpxchg() function call (no need for inline assembler, since Linux headers provide this from the C level).
I will keep digging, hopefully something will point to the bad coding issue here.
64bit troubles | Saturday, 14 March 2009 |
What is annoying is that most of the development of dtrace occurred in 64bit mode, thinking that porting to 32bit would be simple, and was oblivious to many issues on 32bit. So, over the last few months been perfecting 32-bit dtrace.
Now, I go back to 64-bit dtrace, and it works inside vmware but not outside.
Looks like something strange happens in the kernel - the kernel supports large pages -- 2M and 1G page sizes. The system call table is sitting inside a 2M page group (x86-64 has a three level page table).
So, the hacks I did for 32-bit (not strictly hacks), may or may not work for 64-bit Linux - because I was taking a shortcut, and it turns out bits of the kernel are protected by a 2M page which is marked read-only.
Now I know what/where the issue is, I can go and see if I can fix it.
Annoying, because I have had to repeatedly crash my master machine to test this out, but am hopefully close.
More when I have a fix, but, for now, dont use the 64-bit dtrace. The 32-bit dtrace seems to be fine tho.
64bit dtrace -- still not quite there. | Thursday, 12 March 2009 |
I am trying out the latest release of VirtualBox to see if I am better placed to virtualise...lets see what happens.
64-bit linux dtrace working (I hope) | Wednesday, 11 March 2009 |
i give up..fedora..goodbye | Tuesday, 10 March 2009 |
I absolutely loathe and detest YUM - the package updater which is broken beyond belief. So I am downloading Ubuntu 8.10/64 to give me a decent platform.
Nothing has installed properly from Fedora. Even downloading the FC10 release was going to take hours, but Ubuntu is a single CD which can bootstrap me, and then I can start using apt-get -- which is what I have been using on my two laptops and the vmware sessions to help debug dtrace.
I hate to give up on fedora - diversity is what counts, to check out configs and other software development tools, but it hurts that my main machine no longer even has GCC on it. I removed the package hoping to be able to rebootstrap myself, but yum refuses to play. Did I say I hate it?
Pirut - and python. <laughs /> No thanks
I want my software to have as few dependencies as possible, not take 10+ minutes to tell me that everything is broke.
I need to get my gcc working again so I can prove the dtrace works for 64-bit, but I fear I may be in a gardened wall where "it just works on Ubuntu, and nothing else". I know what I did wrong in the __asm__ sections of dtrace - I just need a 64-bit machine and compiler to validate my changes.
So, tonights episode is loading ubuntu on top of fedora and see how much I learn to cry :-)
For now, dont touch 64-bit dtrace until I next write up that all is safe. You will very likely panic your system. (32-bit should be good).
64-bit oops.... | Monday, 09 March 2009 |
I have fixed (some of) them in my release, but am having difficulty upgrading my Redhat FC8 real machine with vmware to validate my testing.
My yum/pirut package updater is broken and building gcc from source requires things which wont build properly, despite upgrading my binutils.
Oh well, need to upgrade stuff as this is an old system now and I hate my main line system not having the tools to get the job done. (If need be I'll revamp with an FC10 at the weekend).
thats better...but strange | Sunday, 08 March 2009 |
dtrace -n fbt::: dtrace: description 'fbt:::' matched 25753 probes ...
This is on a real CPU - seems to work now.
Theres been lots of issues and discoveries to get here in the last day or so.
Firstly, vmware seem to fail me: fbt::: would work fine on my vmware session, but fail on the real cpu. Some of this may be due to kernel or compiler (vmware running 2.6.28.5, real cpu on 2.6.28.2) but I dont think its kernel related - I dont believe that many differences in the areas I am looking at have happened.
It could be compiler and/or kernel options: GCC and Linux attempt to inline lots of stuff.
Here is what I found: despite initially thinking I should have 42,000+ probes, its down to 25,753, as above. Some of this is that many functions are in non-".text" sections. The kernel makes liberal use of attributes to put functions in the following sections:
.init
.init.text
.fini.text
and others. So I added code to restrict functions only in the ".text" section. This means we may miss the chance to probe some module exit code, but this isn't really much of a loss.
Additionally, because .init sections can be jettisoned after module loading, I found numerous cases where two or more functions sat at the same address. This is bad: if FBT latches on to this, we destroy the hash table for FBT and cannot patch/unpatch when dtrace is executed. I added code to detect/disallow two or more probes at the same address. (This isnt an issue for the vmlinux kernel, but is for modules).
Another sanity check is for functions on the notifier chains used by dtrace itself. A probe on one of those would cause infinite trapping/recursion/reboot.
Finally, the __mutex_ functions are put in the toxic.c file, since the mutex functions will call those and dtrace needs these to work. We could inline our own implementation of mutex_lock/mutex_unlock, but we can live without this for now.
I had to reboot my notebook numerous times to get here, and this seems to be stable. I will run it more and see what happens. Theres still a chance that slowpaths or exceptional paths thru some critical functions could cause a kernel crash, so be careful.
TODO items: add more instruction emulations, revisit 64-bit support, and fix the process monitoring code (dtrace -c)
Ooops | Saturday, 07 March 2009 |
More when I think this is resolved.
26,000+ probes | Saturday, 07 March 2009 |
$ dtrace -n fbt::*:
to work without crashing the kernel. Of the many functions in the kernel, some cannot be probed (not too many), since dtrace is relying on them. Some, like do_int3(), is the assembler glue to handle a breakpoint/trace trap.
Others were causing problems because my dtrace, unlike Solaris', calls kmem_alloc() from probe context. I have put a temporary work around for this by doing a static allocation in par_setup_thread(). The problem is that in Solaris, we have call back for process creation and/or the proc structure contains extra fields for dtrace. In Linux dtrace, we dont touch the kernel code, so we need to shadow each process created (or, probed), and this requires some little extra work.
This works now. I havent heavily tortured tested it with the final pieces in place, but now I can.
Theres still some instruction emulations not implemented yet, but these form a very small part of the kernel, e.g. a few thousand functions are unprobable until I implement those instructions.
I'll look to adding more emulations, and then switch back to 64-bit dtrace as that too is missing quite a few emulations.
As again, I implore the community to give this a try. I know of at least one successfully documented user seeing dtrace actually work, and I am hoping from hereon in, it is reliable/stable. (Famous last words...I still consider this an alpha or pre-alpha release).
timer probing issue fixed | Tuesday, 03 March 2009 |
$ dtrace -n fbt::nr_active: dtrace: description 'fbt::nr_active:' matched 2 probesCPU ID FUNCTION:NAME 0 1005 nr_active:entry 0 1006 nr_active:return ...
This was previously locking the machine up, but is now fixed. In dtrace_probe(), we call dtrace_gethrtime() which eventually calls the code to read the system clock. This is protected by a lock to avoid fluctuation as we read the multiple words of a clock.
We inline some of the code, to bypass the kernel lock (but we will suffer very occasional clock drift until this is fixed).
Now I can proceed getting to the point where:
$ dtrace -n fbt:::works.
Stay tuned.
one down.... | Tuesday, 03 March 2009 |
This still leaves one major bad-ism -- some FBT probes crash the kernel - hard. I thought it was due to the derivatives of timer interrupts, but thats not true; I can place probes on some timer interrupt code, and this works, but others do not. E.g. nr_active() is a crashing probe, but the callers to this function do not cause a problem.
Lets see what a bit of mind-thinking brings. (Some bugs seem to only be worked out by thinking about the code paths the kernel, cpu and dtrace take; sometimes, no amount of printing will help, especially when we go blind in some areas and just hard lockup the kernel).
problems..problems | Monday, 02 March 2009 |
Why?
Most likely reason is the short-cut code necessary to handle the INT3 traps. This can be seen by planting a probe on nr_active() which is called from the timer interrupt to recompute the loadavg. Hopefully I can work out how to exit the probe interrupt handler properly, and this will avoid a need to blacklist lots of kernel functions.
Am also still seeing differences between a real CPU and VMware. On a real cpu, if we place lots of probes, we stand a chance of getting a timer interrupt whilst laying out the probes, which causes a kernel GPF. Its possible this is related to the above - i.e. not doing the correct job in the first place, so hopefully one will give insight over the other.
Syscalls work and many many fbt kernel/module probes work - even under heavy load, but some dont - so dont try and lay out everything just yet...
dtrace - the next phase | Sunday, 01 March 2009 |
Here is what todays release can do:
/home/fox/src/dtrace@vmub32: load Syncing... Loading: build/driver/dtracedrv.ko Preparing symbols... Probes available: 23075
Note that figure - we just moved from 5000+ probes in the modules, to now include everything in the kernel.
Heres an example:
/home/fox/src/dtrace@vmub32: more /tmp/probes.current ID PROVIDER MODULE FUNCTION NAME 1 dtrace BEGIN 2 dtrace END 3 dtrace ERROR 4 fbt kernel entry 5 fbt kernel entry 6 fbt kernel entry 7 fbt kernel entry 8 fbt kernel entry 9 fbt kernel entry 10 fbt kernel init_post entry 11 fbt kernel name_to_dev_t entry 12 fbt kernel name_to_dev_t return 13 fbt kernel calibrate_delay entry 14 fbt kernel calibrate_delay return 15 fbt kernel dump_task_regs entry 16 fbt kernel dump_task_regs return 17 fbt kernel select_idle_routine entry ... /home/fox/src/dtrace@vmub32: dtrace -n fbt:kernel:generic_file_mmap: dtrace: description 'fbt:kernel:generic_file_mmap:' matched 2 probes CPU ID FUNCTION:NAME 0 3222 generic_file_mmap:entry 0 3223 generic_file_mmap:return 0 3222 generic_file_mmap:entry 0 3223 generic_file_mmap:return 0 3222 generic_file_mmap:entry
Dont try and do:
$ dtrace -n fbt:::
as I now need to strip the kernel symbol table of functions we rely on and would cause re-entrancy problems (my system rebooted itself when I did that!)
probes...and more probes | Sunday, 01 March 2009 |
My real Ubuntu has 16000+ probes available.
But the real machine seems to be unstable at more than 8000+ probes. The error strikes if you ctrl-c dtrace - a GPF occurs somewhere during the close code.
On my vmware machine, I have modprobe'd all available drivers, and can get to nearly 14000 probes (this is a 2.3GHz dual core cpu vs 1.2GHz for the real machine).
Looks like something in locking or interrupt disabling may be causing the instability.
The good thing is it does work well, but just not reliably enough for a real machine where you care about crashing or locking up the driver.
More when I know what the deal is here.
/home/fox/src/dtrace@vmub32: dtrace -n fbt::: dtrace: description 'fbt:::' matched 13752 probes CPU ID FUNCTION:NAME 0 11858 epcapoll:entry 0 11859 epcapoll:return 0 8587 ia_led_timer:entry 0 12474 ipmi_timeout:entry 0 12475 ipmi_timeout:return 0 11858 epcapoll:entry ...
idiot. me. | Sunday, 01 March 2009 |
Spent a bit of time to trace this down to my bad changes to the interrupt enable/disable code, not working well with gcc.
This is fixed and so, dtrace should work much better now.
Still more to do to add probes, but at least the basics should work well again.