toxic ranges Wednesday, 29 April 2009  
A toxic range is an area we cannot place a probe, such as the internals of dtrace itself, since this can cause a recursive probe issue and crash the kernel.

With the switch to single-stepping in the kernel - this is a problem, because now we have to be more careful about reachability - for instance, calling printk() from inside the trap handler in dtrace, means we must mark as toxic every function printk call.

This is not nice. It potentially wipes out a lot of useful low level probes (spin locks, mutexex, interrupt disablers, etc).

With instruction emulation, this wasn't so bad - we didnt rely on recursive trap handlers and it worked.

kprobes works by marking unprobable functions with a GCC function attribute, and storing the info in a special ELF section of the kernel. (__kprobes macro)). I was going to rely on this to help me out, but (a) thats cheating, and (b) it means I may inherit more toxicity than we actually need.

"printk" as an example is only a problem if trap debugging is left turned on.

I have solved this problem in a novel way. I hereby condemn my solution as GPL/CDDL compatible.

In the event we have a nested single-step trap, we auto remove the nested trap probe point. The kernel remains fully functional - we just disable those problem areas for probing. (Something is written to the kernel logs, e.g. dmesg and /var/log/messages to help analyse this). So, initially the kernel gets into a tailspin and then corrects itself - with no hand tuning or per-kernel worries.

$ dtrace -n fbt:::
now works real well - I've tested on 2.6.29/64bit. Next to test AS4 (the golden nirvana), and then to 32bit.

Assuming all goes well, I will put out a release.

At a later date, I hope to blog about potential new providers we can write - it will be nice to get out of the 'get it working mode' into 'adding value mode'.

Posted at 22:54:27 by Paul Fox | Permalink
  I think I found you Mr. RBP Monday, 27 April 2009  
I wrote in the previous few blogs about issues on AS4 and having trouble finding the RBP register. I was right - sort of. The act of pushing a register involves touch parts of the stack which havent been allocated yet. I knew this because I solved the problem for 32bit kernels by jumping out of the kernel direct to user space and bypassing the last part of the kernel entry/exit points. Once we are in a trap, the word at %ESP isnt available to us because we just took an interrupt and our return address (or EFLAGS) is sitting there.

I dont know why this ever worked on 64b kernels - or why Solaris works - it shouldnt. (Well, normally theres a trap-type field which we overwrite and we get away with it; not for some traps, alas).

I converted dtrace to use single-stepping (TF mask in the CPU flags register), and this works a treat. This means the 32 and 64 bit cpu emulators can be jettisoned and the horror assembly in cpu_32bit.c can evaporate - making us a much cleaner 'citizen'.

I am just finalising the quality of experience for the various kernels, so may be a little while doing this - may have a new release if it hasnt regressed on Ubuntu kernels tonight, but otherwise may need to delay til later in the week.

Posted at 21:06:55 by Paul Fox | Permalink
  Where in the world is RSP ? Sunday, 26 April 2009  
Following on from my previous blog - where is the RSP register?!

With FBT, we are in the kernel, we hit a breakpoint (INT3) trap. So thats a trap-within-a-trap.

In 2.6.9, the original syscall that got us into the kernel didnt save a few registers (RBP), but does save RSP in the task structure (%gs:0x18). But on a nested trap, we cannot overwrite the %gs:0x18 pointer.

So, when the INT3 callback is called we dont have a linkage to RSP. My previous blog wondered about where RBP is hanging out. But I think the problem is where is the RSP register from the interrupted stack frame.

Intriguing the mess 2.6.9 got into, but later kernels resolved by ensuring all callbacks had a fully populated pt_regs structure.

Posted at 22:19:30 by Paul Fox | Permalink
  The Story of PUSH %RBP Sunday, 26 April 2009  
Over the last two weeks, I have been trying to get FBT to work with 2.6.9 (AS4) kernels. Its interesting doing so - what appears to work in later kernels, doesnt in 2.6.9.

I have some experimental code which works - but not reliably.

One issue is the PUSH %RBP instruction. If I do this:

$ dtrace -n fbt::sys_chdir:entry
the kernel will crash. (fbt::sys_chdir:return doesnt crash - in fact fbt:::return works fine for all functions!)

Ive been scratching my head and trying things out, and I think I understand the story now.

When a system call is executed, the user mode app switches to kernel mode and *nearly* all registers are saved. Not %RBP. RBP is usually the frame pointer - saved as the first instruction in every routine.

In the 2.6.9 kernel, a system call (SYSCALL instruction) doesnt save all registers (but does in later kernels, which is why I didnt see this in the 2.6.19+ kernels).

So, from the entry point at system_call:, we save registers (except RBP and a few others) and dispatch to the first instruction of the syscall handler, which kindly does:

for us which is great.

Problem is, we modified that PUSH instruction into an INT3 (bkpt trap). So now, the distance from the original user land syscall is further - we not take an INT3 trap (which does save all the registers), then dive into C code to handle the notifier chain for INT3 handlers (including us, kprobes, and whoever else cares).

But now, our handler is being passed a struct pt_regs which is effectively two hops away from userland.

I dont think this matters, except, remember the original syscall didnt save %RBP? Well, that means it wont bother restoring it either. Which means that somewhere between SYSCALL -> INT3 handler, we lose/corrupt the RBP register. When dtrace cpu_64bit.c gets to emulate the PUSH RBP register, we are using a bogus value.

Now, this is a problem, because we dont know how we got to where we are -- using FBT on a syscall function is nice and easy and really shows the problem. Doing so from an inner C function in the kernel wont necessarily show the problem.

So, the illusion of not perturbing the C/assembler virtual machine is shattered.

What to do?

Well, now I understand it, I can hopefully fix it. kprobes doesnt worry about this (seemingly) since it uses single-stepping in the kernel to avoid this, but, it too must be careful not to affect what the kernel thinks is going on. Because it steps the PUSH RBP, it will do it in the context of the caller (where RBP will be its original value).

This may mean I too need to migrate to single step rather than instruction emulation. (If I do this, the code will actually be shorter/simpler than with emulation, and I wouldnt need to worry about all the instructions we have yet to implement - which means more probes against more functions).

Its interesting how much work for a legacy operating system there is, but more important is the education that we are doing everything for a reason.

More when I have news...

Posted at 20:29:16 by Paul Fox | Permalink
  to single step or not to single step... Saturday, 25 April 2009  
For AS4 with the 2.6.9 kernel, fbt works for :::return probes, but not for :::entry. Very strange as there is little difference here except for the specific instruction we hit.

FBT works in the Sun code by patching the target entry and exit points of a function with breakpoint traps. When the breakpoint is hit, they emulate the overwritten instruction.

In Linux/kprobes, they single step the overwritten instruction, just like a normal debugger. When I originally looked at kprobes I thought it was too complex and would be forever trying to debug strange scenarios. (FYI, I have written Z80, and 80186 and 80386 debuggers before, so I shouldnt be averse to doing this). Anyway, I continued with the Sun approach.

On Linux, things are much more complex than Sun, because the instruction sequences emitted by GCC are much more varied, and the handful of use cases Sun have in dtrace were no good. (Look at fbt_linux.c, cpu_32bit.c and cpu_64bit.c).

With a handful of lines of change to cpu_64bit.c, I am experimenting with CPU single step tracing to see if this is easier and smaller.

One potential issue here is deciding what to single step. If we single step the original patched location, we need to remove the INT3 breakpoint trap, which can lead (on an SMP system) to races where other cpus may skip the trap. I think kprobes works by creating an instruction buffer and single stepping that. Maybe thats a better approach.

At the moment, on AS4, it seems like the :::entry is failing for some non-dtrace reason (i.e. my understanding of what the stack looks like is probably lacking). I will continue to dig.

BTW, for those following the blog, as far as I am concerned, both 32 and 64 bit kernels are supported and working - its purely the legacy 2.6.9 kernel (64b) which I am getting to work at present.

Theres still functionality lacking, but the key ones -- syscall::: and fbt::: seem to work.

BTW#2: scripts/ is a new script to provide a simple getting started way to use dtrace, so you dont have to cobble your own D scripts together. I hope to evolve it to embed all my (and your) knowledge into it so that people dont have the learning pains I have had with dtrace.

More .. when theres something to report.

Posted at 15:39:17 by Paul Fox | Permalink update Tuesday, 21 April 2009  
Whilst my connection is down, and before getting dirty with FBT, made some minor changes to the blogger software - mainly to fix the archives/ links which were double-barrelled, and to ensure the random pictures are on each page.

In case anyone is interested (I know you are not !), the pictures are random bad images from various walking trips. I am not a photographer and couldnt take a good photo if my life depended on it, but pictures which look "pretty" can add a little bit of life to the web site.

Posted at 17:23:28 by Paul Fox | Permalink
  2.6.9 syscalls working now Tuesday, 21 April 2009  
As I write this, my internet connection is down. Hopefully back in a short while. Happens very occasionally with VirginMedia (previously NTL), and usually when the weather changes (probably affecting the cabling).

Spent last few days getting the 2.6.9 syscalls working, since the assembler code is subtley different in later kernels, and the ways I had done this in systrace.c was a little too delicate. The current way is better - using pattern matching to find the code and avoid duplicating it unnecessarily, along with a little error checking to avoid falling over on the wrong kernels. (I havent validated 32-bit 2.6.9 and wont bother unless theres a call to support that).

I've verified the 64 + 32 bit builds work across the many kernels I have (I dont have all ready at hand, and its a hog to build every 2.6 kernel if no one is really using those kernels anymore).

If the current code wont work, I may have to go further and generate code dynamically at runtime, since the glue of assembler to C that the kernel uses makes it difficult to ensure that under all kernels, compilers and flag optimisations, that things will work.

Next up is to validate the issues with FBT on 2.6.9 since I know that can crash the kernel.

Annoyingly, on the older kernel, it can take 1 minute of kernel cpu time to load the dtrace driver (or rather, for the initial pass at mapping kernel addresses to tracable functions in fbt_linux.c). It takes less than 2s on the later kernels. I assume the later kernels optimised the symtab handling functions.

Also, very annoyingly, under 2.6.9, the kernel clock is almost stopped. A google reference implicates issues with host clock and guest clocks using differing mechanisms or HZ values, which is fair enough. I may need to play with the grub boot options to fix this - its more an annoyance, than anything. (I wander if the extra cpu time is clocked related; can't think why, but then, one never knows when it comes to computers).

Posted at 16:20:08 by Paul Fox | Permalink
  execve() syscall for 2.6.9 kernel Sunday, 19 April 2009  
I have a version of dtrace which works for execve() on 2.6.9 (64-bit) kernel, but its kinda ugly, and it breaks the code in systrace.c. I've put up a private build (dtrace-tmp.tar.bz2) which people should ignore. These private builds are works-in-progress for my own benefit, and should await a proper dated release.

Why is it ugly? Well, some syscalls, like execve() wants a copy of the "struct pt_regs" on the stack as an arg (not a pointer, but the actual invoking struct). This is different from the later kernels when pt_regs *is* a pointer.

This shouldnt be a problem, but the C language (even with asm()) makes it difficult to get the stack in the right place, reasonably portably.

The big area of difficulty is knowing what happened from 2.6.9 to 2.6.27 or so - at some point the calling sequence (and syscall assembler wrapper changed), and I can only really test/validate with the kernels to hand.

I'll look at my code more to determine how to normalise/factorise it so that I dont have to have lots of code for each kernel release.

(I havent looked at 32-bit 2.6.9 kernels yet).

Strange how some things are cleaner in 32-bit kernels, and cleaner in 64-bit ones.

Posted at 22:25:35 by Paul Fox | Permalink
  dtrace progress for 2.6.9 (AS4 kernel) Thursday, 16 April 2009  
Although AS4 is an old release of the system, its instructive to build for this platform. This was always the driving force for dtrace for linux.

In taking the more-or-less working dtrace for 2.6.27 Ubuntu, and trying it on this old kernel lead to a lot of head scratching, kernel hangs, and even exposing horrible bugs in VMWare Server (1.0.8).

First, VMWares bugs: occasionally, when reverting a snapshot, I would find one or more of my virtual disk slices as owned by 'root', and not me. Bizarre, but have to chown them to allow my reverts to work.

Also, occasionally the kernel would power off the VM and even give rise to situations where I cannot revert a snapshot.

Worse and strange. Having done a revert, I would occasionally find my ssh login session, complete with CRiSP edit session "unreverted", i.e. I could carry on editing despite having reverted a snapshot. This is really horrible and exposes potential issues with VMware.

I dont mind - VMware has boosted my productivity enormously and avoid many long winded reboots, and I still love it.

Back to dtrace on 2.6.9: 2.6.9 is a strange world. Some kernel calls are missing, and the kernel is more delicate when the API contracts are broken, leading to panics and other strangeness. This is good - its helping to refine the source code, help me track some memleaks and device leads when loading/unloading the driver, and generally giving the code a good cook-in (or kick-in).

I have some issues to resolve, e.g. FBT not quite there (hope to fix that this evening), and the odd syscalls (clone/fork/execve/etc) have different calling arrangements compared to 2.6.2*, but then, thats my fault because of the way I had to do this, but at least I know whats involved. Plus timers need to be made to work (no hrtimer's in 2.6.9).

So, for those of you who have tried/failed on 2.6.10+ kernels, these issues may explain the peculiarities.

I'm not tracking the major technical changes from one release to another - only my memory serves to help remind me that things have changed, but theres too many kernels to keep track of and easier to work with extremes - very old and very new kernels - to avoid bad coding or lack of portability.

More later.

Posted at 21:00:01 by Paul Fox | Permalink
  fbt now fixed Monday, 13 April 2009  
$ dtrace -n fbt:::

now works, after guaranteeing I am at the head of the INT3 notifier chain. The kernel API wont let us do this, but by hand-manipulating the list mens we are at the front. (New 20090413 release for download available).

Just dont turn on driver printk() tracing if you are looking at fbt as you could hit the same issue. May need to provide a mechanism to semi-toxic the functions we rely on to debug stuff in the future.

Made a submission to for dtrace, to see if we can get more interest and pick up on the project.

Posted at 11:03:51 by Paul Fox | Permalink
  FBT and the double fault Monday, 13 April 2009  
I wrote recently about some issues if you enable all fbt probes at once:
$ dtrace -n fbt:::

This appears to work on 32-bit but on 64-bit we hit an issue - most likely due to reentrancy. The notifier chain we sit on for the INT3 traps is also used by kprobes (and some other kernel code). This means when a trap occurs, we all get called, with a chance to handle it if it is ours, or pass it on if it is not.

The problem here is "who gets called first?". kprobes wants to go first, but thats a problem since if we are tracing a function kprobes uses, then we get infinite recursion.

Actually, we dont get infinite recursion, because there is a lock in the notifier chain code, and it blocks on itself: result == hung kernel.

Ideally, we should go first, and am experimenting with that approach. We shouldnt be calling anything from probe context which in turn is being probed, else we have the reentrancy issue.

One has to be careful marking certain funcions as toxic, e.g. printk() (which I use for debugging) since that would preclude putting useful probes on these kinds of functions.

The alternate solution I am trying is to take over the INT3 interrupt vector and avoid some kernel code aspects. This is potentially problematic in getting the interrupt code to work so it plays nicely with other citizens (and maybe subject to kernel-isms too).

Stay tuned.

(At the same time as this, I am getting the freetype code in CRiSP to work - it works, but its not visible in the setup menus; I'll detail this more later).

Posted at 09:44:11 by Paul Fox | Permalink
  CRiSP v9.4.1 Sunday, 12 April 2009  
I'm going to release a new version of CRiSP this weekend, and update the version number. (People will need to get a new license - its been 18+ months since 9.3 came out).

Although the changes arent major at this point, its a good time to cut over to a new version. The major feature for this release will be FreeType font support for the edit window.

Not sure why people want this - as FreeType/TrueType font display involves blurring to create a clearer (?!) image. It does look nice at large/huge fonts.

I need to work on the other GUI controls to let them support the technology too.

Given how diverse Linux platforms are in terms of shared libs and kernels, I have had to use runtime dynamic linking to avoid a startup dependency on something not installed on your system - e.g. "ldd crisp" wont show the dependency on and friends, but its there. (As of now, its an environment variable to turn it on, but am about to change that).

Posted at 08:59:01 by Paul Fox | Permalink
  iopl / Gnome desktop fixed Tuesday, 07 April 2009  
After locating the difference in systrace.c, the gnome desktop started flawlessly whilst running scripts/syscalls3 (which polls all syscalls, and dumps out the stats every 3 seconds). Nicely, performance was almost unnoticably different from an undtraced startup.

Need to do more testing and repeat the exercise with all fbt probes in place.

(This is 64-bit only; need to also repeat for 32-bit Ubuntu).

Posted at 23:58:17 by Paul Fox | Permalink
  iopl kernel segmentation violation Tuesday, 07 April 2009  
Got seems to happen when iopl() is called from a thread in a process, not in a standalone non-thread app. I have modified the test (utils/iopl.c) to reproduce this.

Next, is to fix it !

Posted at 23:45:09 by Paul Fox | Permalink
  x86-64 running 32-bit binaries Tuesday, 07 April 2009  
Looks like this doesnt work, or rather, syscall tracing doesnt see them, because there are two syscall tables in the kernel, and currently its not intercepting those.

Should be easy enough.

The issue I am debugging is why:

$ sudo gdm
Fails if we trace the iopl() syscall - individual calls to iopl() work, but when called from the Xorg server, causes a kernel fault. This is a nuisance, since iopl() takes a single argument, but otherwise is not that interesting. systrace.c has special assembler code because of the SYSCALL wrapper in the kernel, but other than that, its nothing special - either it should fail always, or not at all.

But, that, is the nature of debugging - the unexpected happens, until you understand it, and then its blindingly obvious.

Posted at 23:14:20 by Paul Fox | Permalink
  more 64-bit syscall issues Sunday, 05 April 2009  
Found that we have issues with sigreturn() and execve() (the exec family), so these may fail strangely (they dont seem to panic the machine, just core dump the caller).

Similar issue to the stuff I just fixed for fork and friends, so hopefully this will only take a short while to fix.

Posted at 21:30:48 by Paul Fox | Permalink
  New release of dtrace Sunday, 05 April 2009  
I've hopefully fixed the 64-bit syscalls issues now, as I mentioned in the prior blog. There were some issues with execvp and friends, but looks like I forgot to do something in systrace.c, and thats now fixed.

So - we should be at parity for 32 + 64 bit dtrace.

What of the future? I need to track down an FBT issue on 64-bit, since tracing all functions seems to crash the kernel (should be easy to fix...will try later on).

I though I might mention the impending IBM/Sun takeover - if IBM takes over Sun, then what happens to dtrace? No idea, but am hoping that since IBM is very GPL friendly, the license can change, and if that happens, then this port can become a GPL licensed derivative.

BTW, had a scare this morning on turning on the machine, my 750GB root partition had filled up. I knew I had about 300GB free, but, I found a "wget -r" of some Ubuntu kernels was running infinitely and had taken a week to down load a ton of rubbish (git-links). I quickly killed the rubbish files, and have a bunch of kernels for compiling against now (fixed one portability issue by doing 'make kernels').

BTW#2, I setup a twitter account - I dont know if I will use it much - I may automate changes to the source to feed twitter, but its worth having just to learn a little bit more about this Interweb thing people keep talking about.


Its just about 1 year old now !

Posted at 20:50:39 by Paul Fox | Permalink
  forking crack Saturday, 04 April 2009  
Ok, I am on the home straight now. After staring at the assembler hooks for SYSCALL and the path for the fork() code, as it turns into a kernel subroutine call to sys_clone() [fork() on linux is a call to the clone() syscall], we are just about there.

After a lot of false starts, and self-confusion, I have the fork() syscalls (i.e. clone) working, without crashing the system. The bottom line here is that four a few syscalls, the way the stack is setup is different from all other calls. I believe the issue here is the complexity of handling the CPU instruction to handle SYSCALL/SYSEXIT, for which there are various internet references to the difficulty in handling this, since SYSCALL jumps to kernel mode, with no stack saving and no register saving.

Because of this, the kernel goes through hoops to set up the pt_regs struct on the stack, so that the syscalls can return.

But fork() is complicated because we give birth to a new child -- and that child doesnt return the way the parent does - it is simply created and put on the scheduling queue.

I'm amazed that this ever worked on i386, but, reading the documentation on the net shows a lot of permutations for SYSCALL and SYSENTRY on 32 + 64 bit chips, along with 32-bit kernels on 64-bit chips and with variations and bugs on AMD and Intel !

Next up is to tidy the code, and handle the 3 or 4 calls which are similar to fork (clone, fork, vfork, sigaltstack, and iopl).

Look out for a release this weekend.

Posted at 00:29:09 by Paul Fox | Permalink