I dont want to release just yet.... Sunday, 31 May 2009  
Whilst enabling all valid opcodes for FBT tracing is great - it does show up issues which I had been able to ignore before in terms of stability.

There are certain paths of code which can cause probe traps from within the probe handler, and I get differing results from my differing kernels. The best kernels are those that crash on me - allowing me to see an issue, rather than hiding these latent instabilities.

"fbt:::" works, but theres more to it than this, as many probes never fire, or a probe may cause an issue when another probe fires. (The SYSCALL instruction is a prime case - it causes the kernel to enter at 'system_call', but at this time, the stack and state of the kernel is not consistent, and we cannot (yet) probe on that, so I have added it to the toxic list for now).

There are many entries on the toxic list which can be removed, but am working thru the failure scenarios I can see at present.

I've put in an interrupt handler for interrupt #13 (GPF), since when we do things wrong in dtrace, its good to get a chance to shut us up and avoid an infinite cascade of console messages, resulting in a total panic. (If we get a GPF caused by us, we disable all probes to try and give me a chance to debug what is going on; in theory this should never happen, but at this time, can do whilst I iron out some of the thorny issues).

Keep watching the ftp site - I will upload when I feel happy.


Posted at 19:24:49 by Paul Fox | Permalink
  dtrace progress Friday, 29 May 2009  
The 32-bit dtrace in the kernel is looking good now - the function disassembler now allows all instructions in the first slot of a function, and correctly single steps them. (Not strictly true - since many instructions dont or wont occur in this slot, not for normal C code or even assembler, just as JNZ, for instance). But for my kernel (one of my kernels) the number of probes has jumped from around 23000+ to 24000+ (which illustrates how rare many instructions are, such as a CALL or LOOP or REP instruction in the first slot).

I need to validate the 64b code - I have added a new file (and removed the older cpu_32bit.c cpu_64bit.c) to store this emulation.

Interestingly, in looking at kprobes in the kernel, for hints about what I was doing wrong, I am seeing that I now do much more -- i.e. I believe kprobes cannot handle many functions or instructions in the kernel, so they can borrow from this code if they like, or use it to educate themselves whats missing.

I am happy to share the code and ideas, because this way, dtrace and its competitors can improve, and people who do not contribute to the code, get something-for-nothing - better quality tools. I dont look at dtrace as the competitor to annihilate all others. I am pleased with the quality and thought process Sun put into the construction, and the raw dtrace engine has given me almost zero issues.

Sun had an easy starting point - tight kernel integration and a dumb C compiler that doesnt show up the issues that inlined assembler gives to us Linux people. Maybe they can learn something too.

Someone asked me the other day of the performance differential for Linux dtrace vs Sun/Apple. My response then, as now, is that near zero difference. Just because Linux dtrace is not a part of the kernel, but an addin module doesnt deter dtrace from doing things. (Ok, SDT is going to be a challenge but much less so, I hope, than getting to where we are).

Also, I was asked about the 'alpha' nature of dtrace. I either write optimistic blogs ("It works!"), or short cryptic ("No it doesnt!") entries - depending on how positive I feel.

When dtrace works for more than a few minutes without crashing a kernel, thats good. But we dont have a scheme to handle coverage (maybe I will add that) - so we can tell, of the 25000+ probes, which ones fired and which didnt. Certainly, 20-30+% of the kernel is executing all the time, but the rest may depend, e.g. if a CD is in the drive, or if a TCP packet is dropped, or a user space app core dumps, and so on. And even if a probe fires, it may be handled incorrectly.

The truth is, if it stands up to scrutiny from people using it, and I am not writing about "Oops!" moments, then we are making progress.

Next on my todo fix list is Ctrl-C to dtrace. When I run:

$ dtrace -n fbt:::

it works fine. But Ctrl-C-ing it causes a *long* delay - sometimes 5-10s before the shell comes back. *Sometimes* a kernel GPF is raised, indicating that as the probes are being removed, an interrupt or something is creeping in, firing a probe about to be destroyed, and possibly hanging or causing disruption. For small numbers of probes, the window of opportunity is tiny. But for all probes, its big enough to be a real problem, whilst ensuring you cannot panic a running production system.

Hopefully this will be easy to fix.


Posted at 21:22:55 by Paul Fox | Permalink
  Nuts - i am wrong Wednesday, 27 May 2009  
I wrote in the last two articles that Sun's disassembler is wrong in not handling prefix instructions properly, but that is rubbish, on my behalf. It does handle them, and I confused myself because of the changes to fbt_linux.c I am presently working on.

Apologies to Sun - and am glad that I can trust their code!

Now to find why a few F0 instructions arent stepped properly...


Posted at 21:00:32 by Paul Fox | Permalink
  REPZ/REPNZ prefix Wednesday, 27 May 2009  
I just wrote about the semantics of the LOCK instruction being a prefix or not, and I now have proof that REPZ/REPNZ should be treated similarly.

Heres a small dump from the dmesg output after loading dtrace when it can process the F0 series of opcodes (F0..FF).

[52358.586721] fbt:F instr ptype_seq_stop:c02c5ae0 size=1 f3 c3 8d b4 26
[52358.589013] fbt:F instr neigh_stat_seq_stop:c02ccb60 size=1 f3 c3 8d b4 26
[52358.601019] fbt:F instr seq_stop:c02e3f50 size=1 f3 c3 8d b4 26
[52358.601484] fbt:F instr seq_stop:c02e41d0 size=1 f3 c3 8d b4 26
[52358.601760] fbt:F instr rt_cpu_seq_stop:c02e49f0 size=1 f3 c3 8d b4 26
[52358.601828] fbt:F instr ipv4_rt_blackhole_update_pmtu:c02e4ae0 size=1 f3 c3 8d b4 26
[52358.615720] fbt:F instr icmp_address:c030e3b0 size=1 f3 c3 8d b4 26
[52358.615743] fbt:F instr icmp_discard:c030e3c0 size=1 f3 c3 8d b4 26
[52358.623705] fbt:F instr xfrm_link_failure:c0322210 size=1 f3 c3 8d b4 26
[52358.639776] fbt:F instr __read_lock_failed:c0347a90 size=1 f0 ff 00 f3 90
[52358.641537] fbt:F instr kprobe_seq_stop:c034a9a0 size=1 f3 c3 8d b4 26
Note the size=1 which shows that Sun's disassembler has mistreated the f3 instruction (REPNZ).

Posted at 20:32:41 by Paul Fox | Permalink
  LOCK: Prefix or instruction byte? Wednesday, 27 May 2009  
I think I found a problem with Sun's instruction disassembler. The disassembler is needed to work out how big an instruction is, and is used by FBT, to work out entry and return instructions.

For opcode 0xF0 (LOCK prefix), the disassembler says we have an instruction length of 1 byte, i.e. it treats this as standalone, and not as a prefix.

If we plant an FBT probe on this instruction (and there are a few in the kernel), then when we single step - we will step the LOCK all on its own and not have the following instruction, leading, more than likely to a kernel crash or bad semantics.

I am amending the disassembler to detect for this, and treat LOCK properly as a prefix.

I found this whilst working thru all x86 instruction bytes, so we can enable the entire lot, and not special case "known scenarios".

The other prefixes (such as REP/REPNZ, etc should likely be treated similarly, but I havent found an example in the first instruction of a function where these instructions are fetched).


Posted at 20:26:24 by Paul Fox | Permalink
  mcount and gcc -pg Saturday, 23 May 2009  
Found it. Some kernels are compiled with profiling turned on (-pg), which means even our driver has this enabled. I cant find an easy way to turn it off without interposing my own gcc wrapper, so the easiest thing is to define our own 'mcount' subroutine which does nothing..

This means we wont call into the kernel, and we can now safely do:

$ dtrace -n fbt::mcount:
and see all the calls to it.

So, we are safe again - we can probe all functions.

I am seeing a funny in Ubuntu 8.10/64, whereby if I probe too many functions, I get a kernel trace in /var/log/messages like below, where it looks like as we enable all probes, we fire before we are really ready, and subsequently dont fire any probes at all. Reloading the driver fixes this, but not sure I understand how/why this happens fully to diagnose as yet:

[74745.164017] Call Trace:
[74745.164017]  [] warn_on_slowpath+0x64/0x90                 [74745.164017]  [] ? dtrace_int3_handler+0x1f5/0x2f0 [dtracedrv]
[74745.164017]  [] ? dtrace_int3+0x47/0x53 [dtracedrv]
[74745.164017]  [] ? warn_on_slowpath+0x0/0x90
[74745.164017]  [] smp_call_function_mask+0x22c/0x240
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? dtrace_int3+0x47/0x53 [dtracedrv]
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] ? do_flush_tlb_all+0x0/0x70
[74745.164017]  [] smp_call_function+0x20/0x30
[74745.164017]  [] on_each_cpu+0x24/0x50
[74745.164017]  [] flush_tlb_all+0x1c/0x20
[74745.164017]  [] unmap_kernel_range+0x2cd/0x2e0
[74745.164017]  [] remove_vm_area+0x84/0xa0
[74745.164017]  [] ? remove_vm_area+0x0/0xa0
[74745.164017]  [] __vunmap+0x55/0x120
[74745.164017]  [] ? __vunmap+0x0/0x120
[74745.164017]  [] vfree+0x2a/0x30
[74745.164017]  [] kmem_free+0x50/0x70 [dtracedrv]
[74745.164017]  [] dtrace_ecb_create_enable+0x16d/0x20b0 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x9/0x10 [dtracedrv]
[74745.164017]  [] ? dtrace_match_probe+0xa2/0x100 [dtracedrv]
[74745.164017]  [] dtrace_match+0x1f5/0x2e0 [dtracedrv]
[74745.164017]  [] ? dtrace_ecb_create_enable+0x0/0x20b0 [dtracedrv]
[74745.164017]  [] ? error_exit+0x0/0x70
[74745.164017]  [] ? kfree+0x21/0x100
[74745.164017]  [] dtrace_probe_enable+0xb7/0x190 [dtracedrv]
[74745.164017]  [] ? dtrace_match_string+0x0/0x50 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x0/0x10 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x0/0x10 [dtracedrv]
[74745.164017]  [] ? dtrace_match_nul+0x0/0x10 [dtracedrv]
[74745.164017]  [] dtrace_enabling_match+0x9e/0x200 [dtracedrv]
[74745.164017]  [] dtrace_ioctl+0x214e/0x23f0 [dtracedrv]
[74745.164017]  [] ? __mod_zone_page_state+0x9/0x70
[74745.164017]  [] ? __rmqueue_smallest+0x11c/0x1b0
[74745.164017]  [] ? ext3_get_branch+0x21/0x140 [ext3]
[74745.164017]  [] ? put_page+0x20/0x110
[74745.164017]  [] ? prep_new_page+0x103/0x180
[74745.164017]  [] ? buffered_rmqueue+0x1b2/0x2a0
[74745.164017]  [] ? get_page_from_freelist+0x2a6/0x380
[74745.164017]  [] ? find_get_page+0x23/0xb0
[74745.164017]  [] ? find_lock_page+0x37/0x80
[74745.164017]  [] ? mark_page_accessed+0xe/0x70
[74745.164017]  [] ? filemap_fault+0x1a3/0x430
[74745.164017]  [] ? __wake_up_bit+0xd/0x40
[74745.164017]  [] ? page_waitqueue+0xa/0x90
[74745.164017]  [] ? unlock_page+0x32/0x40
[74745.164017]  [] ? __do_fault+0x134/0x440
[74745.164017]  [] ? __inc_zone_page_state+0x2a/0x30
[74745.164017]  [] ? handle_mm_fault+0x1ee/0x470
[74745.164017]  [] ? __up_read+0x8f/0xb0
[74745.164017]  [] ? up_read+0xe/0x10
[74745.164017]  [] ? do_page_fault+0x372/0x750
[74745.164017]  [] dtracedrv_ioctl+0x2d/0x50 [dtracedrv]
[74745.164017]  [] vfs_ioctl+0x85/0xb0
[74745.164017]  [] do_vfs_ioctl+0x283/0x2f0
[74745.164017]  [] sys_ioctl+0xa1/0xb0
[74745.164017]  [] system_call_fastpath+0x16/0x1b
[74745.164017]


Posted at 23:21:39 by Paul Fox | Permalink
  working again Saturday, 23 May 2009  
I put out a new release earlier which seems to work across the various platforms and kernels.

I had disabled many opcodes, and have been adding them back in. RIP relative addressing (x86-64) is used in the kernel and have been getting that to work (again!).

I've hit a hopefully minor issue with 'mcount' - which is in the 2.6.27+ kernels (ftrace facility). This starts with a RIP relative instruction, e.g.

mcount:
       cmpq $ftrace_stub, ftrace_graph_return
       jnz ftrace_graph_caller
       cmpq $ftrace_graph_entry_stub, ftrace_graph_entry
       jnz ftrace_graph_caller       
but I dont believe its single stepping over that initial CMPQ which is causing the issue, but, possibly whoever is calling it, e.g. the interrupt handlers themselves. Hopefully will get to fix this shortly, as that would open up the possibility to enable any instruction at the start of a function in fbt_linux.c.

I've also fixed a couple of crisp bugs/issues today, and I have one more before I put out an update for that.


Posted at 22:24:55 by Paul Fox | Permalink
  some progress Saturday, 23 May 2009  

I *think* I have just put out a stable release -- works on 32b+64b kernels. Some silly re-entrancy issues not being handled properly. So, I need to test it more, but full-fbt tracing seems to be working.

What does this mean? Well, if this stands the test of running on my various kernels and my real non-VMware hardware, I need to start moving along.

Next up is maybe to look at the userland tracing or look at kernel stack trace operations, since thats a mess - due to the fact that kernels may be compiled with or with framepointers, and if you dont have frame-pointers, then a stack trace can only ever be guess of where you are. (The kernel uses '?' to indicate a stack trace is not necessarily valid - it walks the stack, word by word, to see if anything looks like a kernel text address).

I really need to get some CRiSP fixes done this weekend, along with a new driver to provide low overhead TCP port to PID enumeration. I may go onto to describe that in more detail later.


Posted at 09:58:37 by Paul Fox | Permalink
  some progress Wednesday, 20 May 2009  
I think my recent instability is being caused by a sillyism in code which I have not released yet (to do with a nested interrupt trap).

I rewrote the trap handlers to clean them up and put in a more powerful state machine, and its dtracing beautifully on the 64b kernel (I need to revalidate 32b and more kernels).

I am moving the debug output in /dev/dtrace to /proc/dtrace:

/home/fox/src/dtrace@vmubuntu: ls /proc/dtrace
total 0
0 ./  0 ../  0 debug  0 security  0 stats  0 trace
The key one is /proc/dtrace/trace - I am trying to move away from printk for kernel debugging, and using an internal printf-to-a-buffer mechanism (like FreeBSD), because debugging the trap handlers is painful if, by virtue of invoking printk, we invoke a recursive fault.

So, /proc/dtrace/trace is log /proc/kmsg, - a private internal memory buffer to log trace info.

More in a while when I feel happier.


Posted at 22:22:34 by Paul Fox | Permalink
  reliability issues Tuesday, 19 May 2009  
Its strange. For most of the last year the 64-bit kernel has 'just worked', even where it shouldnt have, and the nice interrupt handling of the kernel shielded me from issues.

In fixing the 32-bit kernel issues, and redoing the INT3 handling via raw interrupt patching, has caused the 64-bit kernel to be unreliable. (Unreliable means within a few seconds of fbt::: probing, we crash the kernel).

I *think* at this point its due to a page fault firing whilst handling the breakpoint and single step trap.

I am therefore revamping this core code to allow a nested page fault, and tidying up the code which had started to become a bit untidy.

Hopefully have an answer in the next day or so...


Posted at 22:18:20 by Paul Fox | Permalink
  64-bit issues Sunday, 17 May 2009  
At present, dtrace cannot trace around the irq_return (IRETQ instruction) in the kernel. I am attempting to fix this, so for now, fbt::: will hang or panic the kernel.

The IRETQ (return from interrupt) can be returning from user mode or kernel mode, but the interrupt handler in my code doesnt/didnt handle this.

More news in a while when I have a fix. (And I need to reverify 32-bit as well).


Posted at 21:08:10 by Paul Fox | Permalink
  modrm - RIP relative addressing Saturday, 16 May 2009  
Oops. I forgot something. 64-bit instructions can use position independent code, via the %RIP register (PC). When we single step such an instruction, we need to ensure we handle the relocation inside the single step buffer.

Looking at kprobes reminded me of this, so we need to do that. (I could do with collecting stats of where such an instruction exists in the first instruction of a function; I could be worrying unnecessarily). I am seeing some 64-bit instability (especially as I bring more probes into scope by enabling more instruction opcodes).

Let me see if this improves things....


Posted at 17:34:03 by Paul Fox | Permalink
  a problem... Friday, 15 May 2009  
Theres a problem with dtrace which appears if you do this:

$ dtrace -n fbt:::

immediately after loading the driver. If you give the driver a chance to 'breathe', and do this:

$ dtrace -n fbt::sys_chdir:
control-c it, and then redo the big test, then all seems well. I suspect first time page fault or something triggering this, but still diagnosing.

Posted at 23:04:04 by Paul Fox | Permalink
  Security policy for dtrace now working Friday, 15 May 2009  
I've written a security policy mechanism for dtrace, which supports the documented mechanisms for Solaris dtrace. Since, under Linux, things work differently, this is achieved by loading the security rules into the kernel just after driver load time. This is done via read/write to /dev/dtrace.

Heres an example:

/home/fox/src/dtrace@vmubuntu: cat /dev/dtrace
here=0
cpuid=1
all  priv_kernel
uid  200 priv_proc
gid  201 priv_kernel
all  priv_owner
(The first two entries are just debug). The other entries form a table which is scanned. The first entry -- "all priv_kernel" enables everyone to do everything. This would not normally be put into a system, unless you want to allow anyone to do things.

The next example shows user id 200, has priv_proc (dtrace_proc) privilege which allows users to monitor only their own processes.

Enumerating uid's may be too much, so you can also specify a gid, as in the third example.

So, the format is:

[uid | gid] <nnn> [priv_user | priv_kernel | priv_proc | priv_owner]
or
all [priv_user | priv_kernel | priv_proc | priv_owner]
Multiple privs can be specified on the same line, e.g.
uid 200 priv_proc priv_owner
(Not sure this is necessary since the perms form a hierarchy, each one embracing more priviledge than the prior, but am mirroring the Solaris model).

By default, /etc/dtrace.conf is read and loaded into the kernel. You can cat /dev/dtrace to verify what it has is what you think it has.


Posted at 22:15:34 by Paul Fox | Permalink
  security time Wednesday, 13 May 2009  
Now that things seem to have stabilised in dtrace, I am looking at the security model. (I dont claim dtrace is perfect - I have seen some stability issues, such as a kernel panic when putting a real CPU to sleep (32-bit kernel), even when nothing was being traced).

However a defining security model is needed - mostly based on the solaris equivalent, but we dont do things that way on Linux. Here is a comment added to dtrace_linux.h defining the mechanism.

If people object or have extra wish-list items, feel free to suggest them.

/**********************************************************************/
/*   The following implements the security model for dtrace. We have  */
/*   a list of items which are used to match process attributes.      */
/*                                                                    */
/*   We  want  to be generic, and allow the user to customise to the  */
/*   local   security   regime.   We  allow  specific  ids  to  have  */
/*   priviledges and also to do the same against group ids.           */
/*                                                                    */
/*   For  each security item, we can assign a distinct set of dtrace  */
/*   priviledges   (set   of   flags).   These   are  based  on  the  */
/*   DTRACE_PRIV_xxxx  definitions.  On  Solaris, these would be set  */
/*   via  policy  files accessed as the driver is loaded. For Linux,  */
/*   we  try  to generalise the mechanism to provide finer levels of  */
/*   granularity  and  allow  group  ids  or  groups  of ids to have  */
/*   similar settings.                                                */
/*                                                                    */
/*   The  load.pl script will load up the security settings from the  */
/*   file  /etc/dtrace.conf  (if  available). See the etc/ directory  */
/*   for an example config file.                                      */
/*                                                                    */
/*   The format of a command line is as follows:                      */
/*                                                                    */
/*   clear                                                            */
/*   uid 1 priv_user priv_kernel                                      */
/*   gid 23 priv_proc                                                 */
/*   all priv_owner                                                   */
/*                                                                    */
/*   Multiple  priviledge  flags  can be set for an id. The array of  */
/*   descriptors   is  searched  linearly,  so  you  can  specify  a  */
/*   fallback, as in the example above.                               */
/*                                                                    */
/*   There  is  a limit to the max number of descriptors. This could  */
/*   change  to  be  based  on  a dynamic array rather than a static  */
/*   array,  but  it  is  expected  that a typical scenario will use  */
/*   group  mappings,  and  at  most, a handful. (Current array size  */
/*   limited to 64).                                                  */
/**********************************************************************/
BTW, if people wonder how those comments are neatly boxed, is because I use the CRiSP editor to do my work, and a single hilight followed by Ctrl-F reformats in that mode. (It handles re-formatting also, so its not tedious to make changes to the comments). In addition, spell checking inside the comment let me notice that I have mispelled privilege in the above comment (but, fixed in the source).


Posted at 23:07:44 by Paul Fox | Permalink
  fixed predicates Monday, 11 May 2009  
Thanks to mauritz.sundell for pointing out predicates were broken. I had seen evidence of this myself, where some entry probes didnt have matching return probes, and a simple test, using:
dd if=/dev/zero of=/dev/null
showed this up with a simple dtrace call like:
$ dtrace -n 'syscall::read: /pid == 7331/ {printf("pid=%d", pid);}'

The process shadowing code in dtrace_linux.c (par_alloc) had been set to only monitor a single process during earlier debug sessions. I unset that and fixed some other sillynesses, and now it will create a new proc structure as it sees them. (Theres still no garbage collection for the structs, and the data structure is a horrid linked-list, so this can cause some slowdowns as each probe will go on a search for the process to see if we have seen it before).

That code needs a hash table, or, at least, move the new proc to the front of the chain, or something, but thats relatively minor.

Of course, theres likely to be more bugs, and I hope my recent fix here wont destabilise.

I put more debug trace hooks (which are visible in /dev/dtrace), but will likely remove them or move them around, so dont depend on them for anything consistent.


Posted at 23:25:47 by Paul Fox | Permalink
  And so the theory comes to pass... Monday, 11 May 2009  
Hopefully fixed the 32-bit instability issue. I modified dtrace_invop to let the underlying callbacks (notable fbt_invop) know we shouldnt execute a probe, if could from a re-entrant INT3 handler. This seems to fix the problem.

If, for any reason, we hit a probe whilst processing a probe, all bets are off. This is most likely because of the way dtrace folds around the kernel, rather than being a part of the kernel source, but its frustrating to have a hung kernel or the CPU taking a triple fault (which simply causes a reboot) and having no way to diagnose.

I note in the FreeBSD version of dtrace, they have an in-memory version of printf() to log trace info, which can then be seen in a stable context. I was going to do that, but havent gotten around to it. (dtrace_linux.c sort of has that - but a minimal "this is what happened when I was blind/deaf").

What it uncovers is that occasionally we get a page fault whilst processing the probe, which was causing the recursion error.

Not sure why -- Linux allows kernel page faults -- but I thought they wouldnt/couldnt happen inside a probe - since the kernel should be entirely memory resident (but maybe I am misreading).

Ideally, if this is the only reentrancy issue, then I need to fix/allow for a page fault to be handled, because this is a very valuable probe. We may or may not be interested in kernel page faults, but we definitely want user page faults (I am not sure we can get one of these in a INT3 probe handler, since the probe will have fired from kernel space, and even if another cpu is doing user-land stuff, that other cpu will see the fault).

It leads to potentially interesting scenarios, such as a probe firing BECAUSE of a userland page fault, which in turn causes the kernel page fault. Only time and a clear head will tell.

Someone has kindly reported a bug where calls to the read() syscall dont always have matching entry/return probes. Every read should return, unless some form of exception happens.

So, lets see if this is easy to fix, and then I can go back to the security wrapper.

By the way, "cat /dev/dtrace" will give some stats - not many, but a key one is "dcnt" which tells you how many times dtrace_probe() called. Useful when trying to see if "anything is happening". Theres a simple script -- utils/dmon -- which monitors this every second (so I can tell if my kernel has hung or is just 'busy').


Posted at 20:51:50 by Paul Fox | Permalink
  mutexes and semaphores Sunday, 10 May 2009  
I've updated the code to use semaphores instead of mutexes as previously noted. Fortunately, this was pretty easy with a bit of work in the linux_types.h include file.

To be honest, I am seeing the same behaviour - 64-bit 2 cpu seems to work fine, but 32-bit 2 cpu, seems to be flaky.

Next .. off to look at the places where I *dont* think the problem is, because, chances are, thats where the problem is. (My normal way of working is when poking a piece of code, and trying everything, chances are high it is not in the area you thought it was).

The main difference between 32 and 64 bit dtrace is really in the interrupt handler. Most other code differences are due to the size of registers or odd portability issues in the kernel...


Posted at 22:01:35 by Paul Fox | Permalink
  That Coyote moment.... Sunday, 10 May 2009  
In Roadrunner vs Coyote, we always have that scene where coyote runs over the edge of a cliff, and not until he looks down, does he realise the dire situation he is in, and then *KAPOW* !

Thats just happened to me with mutexes. I have transliterated Solaris kmutex and mutex types into Linux mutexes, but this is illegal. A kernel mutex is part of a process context - it can block and the caller process sleeps, allowing another process to run.

The documentation specifically states you cannot call the mutex API from an interrupt.

Guess what my code does ?! .... *KAPOW*

I believe I should be using semaphores and not mutexes, so am in the process of converting over to see what happens.

I believe this is a part of the puzzle of SMP kernels where under load the kernel will panic, hang or triple-fault the CPU.

More in a few days when I feel happier.

Meep-meep.


Posted at 20:59:35 by Paul Fox | Permalink
  glibc: I take some of that back Saturday, 09 May 2009  
Hmm..my last blog complained about glibc. Thats unfair - for this particular problem.

This is GCC's fault - it compiles in a stack protection mechanism, by default, and looks like it is using %GS register as part of this mechanism.

Heres an example:

int main()
{
	char	buf[10];
	xreadlink("/tmp/x", buf, sizeof buf);
}
and the assembler generated by compiling:
$ gcc -m32 x.c
$ cat x.s
        .file   "x.c"
        .section        .rodata
.LC0:
        .string "/tmp/x"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $36, %esp
        movl    %gs:20, %eax  // HERE
        movl    %eax, -8(%ebp)
        xorl    %eax, %eax
        movl    $10, 8(%esp)
        leal    -18(%ebp), %eax
        movl    %eax, 4(%esp)
        movl    $.LC0, (%esp)
        call    xreadlink
        movl    -8(%ebp), %edx
        xorl    %gs:20, %edx
        je      .L3
        call    __stack_chk_fail
.L3:
        addl    $36, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .ident  "GCC: (Ubuntu 4.3.2-1ubuntu12) 4.3.2"
        .section        .note.GNU-stack,"",@progbits

This means a binary compiled with this gcc wont run on an AS4 (glibc 2.3) system.

The solution is to turn off this 'feature' (which I thought I had done).

Now off to recheck the gcc man page...


Posted at 23:25:55 by Paul Fox | Permalink
  glibc is brain dead Saturday, 09 May 2009  
I really cannot believe what has happened to glibc - it is a real shame. Sun and AT&T had it right, from the start - backwards compatibility. No binary shall be compiled with a dependency on the operating system. Thats why we have shared libraries - the application ensures a compliance with an ABI.

But glibc thinks different. glibc thinks it can break applications and cause a nightmare for application maintainers.

We're talking CRiSP here, and not dtrace.

What has gotten my goat ?

"readlink()" - a relatively rarely used syscall, but yet, glibc developers decided that when you call readlink, it should "optimise" the call and use the %GS register (used for threads - not sure what else.. yet), rather than simple do a function call, exactly as you would expect.

Why do I care? Because glibc stops you creating a binary on one release of Linux and having it run on any other (upwards or downwards).

Recently, I wrote a tool to remove the stupidity that glibc has when it attacks your object and executable files. It works nice. An application, like crisp, which compiled on say, glibc 2.7, can run on 2.3 as well.

Unless we call readlink(), in which case we get a segmentation violation.

Now I know what the dastardly /usr/include files are up to, I shall put a wrapper, and allow my terminal emulator (fcterm) and CRiSP, to be portable.

There...thats off my chest.


Meanwhile...back in dtrace land...

There appears to be a bug in multicpu environments, where cross-cpu function calls is broken - leading to crashes or hangs. The function we call isnt safe from an interrupt routine, so I need to do some engineering work to allow the inner parts of dtrace_probe to work, without the crash.

Not sure why I dont see this on the 64-bit platform, but, you never know what bad things I am doing, which is why we need to all test and prove the validity of the dtrace releases.


Posted at 23:04:02 by Paul Fox | Permalink
  32-bit problems .. again Saturday, 09 May 2009  
I put out a new release today, but 32-bit dtrace seems to have a problem on multicpu boxes. Looks ok on a single cpu.

Interesting that I have been running vmware Ubuntu 32-bit on a single virtual cpu - and didnt detect issues, but now I have it set to 2 cpus, they show up (just like my real machines).

Hope to fix very soon....


Posted at 18:33:48 by Paul Fox | Permalink
  The SDT Provider Friday, 08 May 2009  
The SDT provider, which presently, does pretty much nothing in Linux dtrace, is potentially one of the more interesting providers to get working.

In reading the recent FreeBSD 7.2 release notes, and looking at other peoples dtrace blogs, mention is made of the various providers which hide in this single/simple source file.

SDT is a provider which lets you compile into the kernel, calls to dtrace. Probes at certain points will compile to a NOP, unless the probe is enabled.

Remember, we are not going to break the GPL terms or the CDDL licensing terms, and, anyway, we are not a part of the kernel source tree, and patching kernel source code is too tedious and error prone. (See previous posting on issues with simply validating compiles).

I have a new, novel, and GPL patent-free idea to resolve this.

SDT is mostly a set of probes on key parts of the various drivers which increment statistics. For example, the NFS driver has a lot of stats exposed via the "nfsstat" tool.

Patching the source code is painful, but that is only one way to get the desired effect.

Whilst working on FBT - an idea struck me, and I think we can using this in SDT. FBT disassembles the kernel, looking for function entry and exit points. What if we looked for something else? How about all "INC" instructions, or "ADD" instructions?

That would make for a very boring set of probes - probes based on instruction types!

But, say we looked at the relocation info whilst we disassembled - we could intercept the memory reads/writes to certain areas of memory - specifically, anywhere that falls inside a struct full of counters. Now we could instrument every/any struct reference in the kernel (easier if we talk about a single static struct, not dynamically allocated structs, such as the process tree). We can see these symbols in /proc/kallsyms, and we can create a table of ones we want to intercept, and create on-the-fly providers.

That is my plan - and see what falls out.

Ideally we could do it with any/every struct like this in the kernel - maybe a piece of Perl code to find them, and some other glue (we may not have kernel source, but we will have the headers).

Anyway, need to finish the cross-compile stuff (2.6.22 kernel is on my bug report list), and then to write the security wrapper for dtrace, so you can lock down or liberate the use of dtrace for use in commercial environments.

Then, we play...


Posted at 21:58:26 by Paul Fox | Permalink
  cross-compiling Friday, 08 May 2009  
With so many issues to worry about, one issue is validating compilation on 32 and 64 bit kernels. There are a huge number of kernels, and I was hoping that with 2.6.2[6789], 2.6.30 and 2.6.9, that this would be it.

Alas, I get build reports of issues on 2.6.19 and 2.6.22 and probably every other release out there. C'est la via.

Although the included makefile includes a target to build all the kernels in /lib/modules, I have extended it to do cross compilation for 32bit versions of your 64-bit kernels. This is mainly for my benefit - the output of this cross-built driver is almost certainly useless, but at least I get to see some of the compile errors and config issues.

Unlike most systems and tools, dtrace can be a better product because it supports the old systems, which, by their very nature will never get "fixed", because a later kernel is available, but this doesnt help you if you are stuck with an ancient distro.


Posted at 21:55:13 by Paul Fox | Permalink
  Success ! dtrace_gethrtime - you naughty boy. Wednesday, 06 May 2009  
Spent last few days trying to figure out why AS4 (2.6.9) kernel was so close, but so far. "dtrace -n fbt::::" (probe on all probable functions in the kernel) would crash.

I tried my normal debug tricks, e.g. divide-by-two search, to find the problematic functions. When this failed, I had to re-engineer the interrupt hooks for INT1+INT3, so we attach directly to the vectors. The beneficial side effect here is we are more decoupled from the kernel and can trace int1/int3 traps from the user or kprobes.

But this just failed.

Annoying!

Very annoying!

Annoying because its so difficult to debug - no way to print out where we got to. This required a lot of "thought-experiments".

I knew it must be something like "a probe inside an interrupt routine", and/or "a probe interrupt invoking a recursive INT3 probepoint", but some test cases showed it not to be so simple. It might run for a few seconds or crash straightaway.

Next up was to deliberately force kernel panics to prove which bits of code were reached.

None of this was really conclusive, except to tell me "something is wrong".

I even read up on the Intel Architecture manual to refresh my mind on finer points of kernel traps. (I now know that when a trap is taken, the callers SS:ESP is pushed onto the stack, but only on the transition from user -> kernel. A kernel trap whilst in kernel mode does not do this. (All makes perfect sense if you do a "thought experiment").

Anyhow, I hit on the perfect test case theory: lets assume its something in dtrace_probe() which is causing the issue. My first reaction was, well:

$ nm -u build/driver/dtracedrv.ko
and stare at the list of functions which are undefined (linked in at module load time). Theres quite a few, but most of the hairy and simple ones (like strchr, memchr, memcmp, sprintf) are not called (or are they?) during probe time, only during the module startup.

Ok, next step, lets short-circuit dtrace_probe(). Success! No kernel hang. Making it return without doing anything proves the interrupt vector assignments work, and normal kernel interrupts and workloads are unaffected.

I added a counter so I could see we got there (cat /dev/dtrace).

Next, started letting more and more of the code to run.

Then I hit on the call to dtrace_gethrtime(). This is an interesting function. dtrace_probe() keeps track of the time since the userland program was started. If the kernel is looking unresponsive, e.g. a bug in dtrace, but, more likely, too complex probing, then it will disassociate the D program and you get a message from dtrace saying that the system appears to be unresponsive: an air bag! Very nice design point.

So, now I knew it was dtrace_gethrtime(). Why? Well, how do we get accurate (nanosecond level) times out of the kernel? Well, in userland we typically use the time() system call or gettimeofday(), and other variants. Internally, the kernel is driven by timer interrupts, and increments an internal counter.

This internal counter is then exposed to those parts of the system via a lock (mutex like) mechanism. Since time is recorded in nanosecond units, a 32-bit entity will overflow in 4 seconds. A 64-bit quantity (typically, seconds + nanoseconds). So, there is no atomic increment, or atomic read. You might read the seconds just as the nanoseconds rollover. From an application point of view, the clock may be seen to go backwards (3.97, 3.98, 3.99 4.99, 4.00, 4.01).

This is no issue for the later 2.6.20+ kernels - we have code to directly read the counter (but am missing the dtrace inspired code to handle correct atomic reads; thats for another day). Anyhow, on 2.6.9, the interface/naming conventions is different.

I cheated. I call do_gettimeofday(), which is effectively the userland system call interface. Well, ... it worked, the day I implemented it ! But its horrible - we cannot call this from an interrupt routine - if the clock is ticking and we hit a probe (very high probability if we plant probes on every function in the kernel), then we deadlock inside the interrupt routine.

So, that explains why we became unresponsive - no panic. Nothing. Niente.

After reviewing the dtrace_gethrtime() code, and do_gettimeofday in the earlier kernel, theres a simple workaround.

And voila:

/home/fox/src/dtrace@vmas4: dtrace -n fbt::: | head -50
dtrace: description 'fbt:::' matched 15754 probes
CPU     ID                    FUNCTION:NAME
  0  12452                   schedule:entry
  0    793           recalc_task_prio:entry
  0    794          recalc_task_prio:return
  0     33                __switch_to:entry
  0    878          remove_wait_queue:entry
  0    879         remove_wait_queue:return
  0  14313        e1000_watchdog_task:entry
  0  14442         e1000_update_stats:entry
  0  14491         e1000_read_phy_reg:entry
  0  14337    e1000_swfw_sync_acquire:entry
  0  14335 e1000_get_hw_eeprom_semaphore:entry
  0  14336 e1000_get_hw_eeprom_semaphore:return
  0  14339    e1000_swfw_sync_release:entry
  0  14492        e1000_read_phy_reg:return
  0  14491         e1000_read_phy_reg:entry
  0  14337    e1000_swfw_sync_acquire:entry
  0  14335 e1000_get_hw_eeprom_semaphore:entry
  0  14336 e1000_get_hw_eeprom_semaphore:return
  0  14339    e1000_swfw_sync_release:entry
  0  14492        e1000_read_phy_reg:return
  0    403  smp_local_timer_interrupt:entry
  0    950               profile_tick:entry
  0    404 smp_local_timer_interrupt:return
  0    125                     do_IRQ:entry
  0    117           handle_IRQ_event:entry
  0    189            timer_interrupt:entry
  0   1122       update_process_times:entry
  ....

and this seems to work nicely.

For those of you with eager eyes, look at the functions we call. (I am ssh'ed into the machine, which is why the ethernet driver is busy).

I just got a kernel hang when piping the dtrace into 'head', so I will fix that, and hope I will do a new release tonight...


Posted at 20:31:04 by Paul Fox | Permalink
  nearly there... Monday, 04 May 2009  
I'm still fixing some regressions in 32-bit and 64-bit kernels. The 2.6.9 kernel has proven problematic, because, quite simply, the later kernels are so good, they are doing stuff for me I never realised.

The current problem *looks like* double-fault handling. In the context of a INT3 interrupt, we may hit another one, and the kernel panics with a double-fault Oops message.

Later kernels seem to perform some magic with TSS states for processes and kernels, which protects from this. Again, this is difficult to debug, and am looking to understand/fix the latest issues.

More in a few days.


Posted at 22:27:15 by Paul Fox | Permalink
  why did it work before? Sunday, 03 May 2009  
Finally found why 64-bit dtrace worked, using instruction emulation, when it never should have done, and why on a 2.6.9 kernel, it didnt.

Turns out the later kernels are nice and clever. When a kernel mode interrupt/exception occurs, a new stack is given to the interrupt. This new stack does not abut the callers area, so, when the instruction emulator wrote below %RSP, we never clobbered anything.

Kernels 2.6.28 upwards work real nicely, but 2.6.9 lacks this nested kernel facility.

Of course, we need it on the earlier kernels even if they dont support it, but we can emulate that, now we know what is going on.

The magic is in entry_32.S and entry_64.S of the kernel, where it judiciously manages the stack frame entry/exit.

Lets see if I can be as good a citizen...


Posted at 23:32:25 by Paul Fox | Permalink
  more problems... Sunday, 03 May 2009  
The new int1/int3 direct trap handlers nearly work, but not quite...something is causing a panic or kernel GPFs sporadically and until I fix that, the new release is not better than the old one.

I *think* this is due to random interrupts - either interrupts during a INT3 handler or vice versa. Whats strange is that it does work for a bit, but then crashes.

I think this may be due to stack switching - my int handler doesnt do anything special for the stack switch (or touching the %GS register), and maybe we end up with a miscommunication with the kernel - it assumes we havent switched to a debug stack, but in fact we have...sort of.

Difficult to debug because theres not much to print and with the kernel crashing, difficult to pick up the issue.

I am currently debugging on a 32bit 2.6.27 kernel (lot easier than the 64bit 2.6.9 kernel which just goes silent when things die).

Hope to get this fixed very shortly...


Posted at 18:39:59 by Paul Fox | Permalink
  depression..or why is it so difficult ?! Saturday, 02 May 2009  
Got home last night to get cracking on fixing the issues, and was making good progress. I decided to switch to direct INT3 vector interception, since relying on the kernel notifier call chain means we hit lots of issues on older kernels, with non-reentrant trap handlers, and using a binary search to find the functions for which "fbt:::" hangs/crashes the kernel.

Minutes to get this working. Then...nothing.

I kept staring at the code, wondering why it didnt work. (Side note: I was doing this on a 2.6.27 kernel; luckily the panics in the trap handler were recoverable, so I didnt need to reboot or restore my snapshots; I *do* like that in the later kernels - they try really hard to help you with bad drivers. Nice).

Anyway, having gone a few hours with no forward progress, I stumbled on it today (good nights sleep):

	call dtrace_int3_handler
	cmp 1,%rax
	je int3_handled
Call the function, check the return code. Couldnt be easier.

Took me a long time to realise what is wrong in those lines of code above. (See bottom of update for the answer!).

Once I fixed that - it worked - enabling all probes on 2.6.27 was fine. I checked for userland INT3 traps, i.e. a debuggers normal workload, and made a fix to avoid this interception for userland - we can just let the kernel handle user land so that all registers, contexts, and stacks are correct. For kernel trapping, we can short cut some of the complexity in the kernel, with no loss/effect on anyone else.

Next up is to address a compile issue in patching the debug interrupt gate on 2.6.9, since the structures changed a bit. Then I need to do the same assembler for 32-bit cpus, and this phase should be complete.


Answer: "1" means address 0x0000000000000001, not constant 1. It needs to be "$1" in the compare instruction.

Posted at 13:47:00 by Paul Fox | Permalink
  The INT1/3 traps, re-entrancy, and reachability Friday, 01 May 2009  
I am late in putting out a new release, but am being pedantic to check I am happy before people crash their kernels with known issues.

The key difference I note in 2.6.9 and much later kernels is re-entrancy. Think about an INT3 trap - we are in the kernel, we step on a breakpoint instruction, and trap into the kernel.

Now, whilst processing that INT3 trap, we step on another function - the path to the int3 handler in dtrace and the path out of that handler can double and triple trap. As I mentioned in a prior article, this is handled by detecting the fact we hit a critical path probe, and auto-disable it.

Now, back to 2.6.9 and AS4. The kernel interrupt code doesnt seem to understand double/triple nested faults, and screws the stack causing a crash. This means I am having to validate toxic.c to ensure these unanticipated functions/probes are on the blacklist.

I dont like this - so this is an intermediate solution - ensure:

$ dtrace -n fbt:::
is stable for the kernels I have, and then fix this properly.

My next solution is to overwrite the int3/int1 trap handler vectors so we take control, save all registers, do our stuff, and get out quick.

This means we can put probes on the real int1/int3 handlers, along with the die_chain notifier list, and all the kprobes functions.

The only downside is a little more assembler in dtrace to save the stack and "do the right thing", but means we are more kernel independent.

Hope to release over the weekend.


Posted at 19:10:40 by Paul Fox | Permalink