dtrace bug of the day Saturday, 28 February 2009  
dtrace is working well, but just found a serious bug which can hang your system. On running dtrace with just a few probes on my real (non-vmware) system with 1GB RAM, from inside Gnome desktop, I get this kernel message:
[ 4336.530339] vmap allocation failed: use vmalloc= to increase size.
[ 4336.640674] vmap allocation failed: use vmalloc= to increase size.
[ 4336.758243] vmap allocation failed: use vmalloc= to increase size.
[ 4336.843121] BUG: unable to handle kernel paging request at 0020027e
[ 4336.843139] IP: [] 0xe354db6d
[ 4336.843151] *pde = 00000000
[ 4336.843158] Oops: 0002 [#1] SMP

At this point, dtrace cannot be killed, although the rest of the system is fine (I am typing this logged into to another machine). Looks like I need to research memory allocation more on Linux.

In a previous blog I had mentioned how use of kmalloc() would cause frantic heavy I/O as the freemem got fragmented. Use of vmalloc() solved that question, but we may be sensitive to whether how much contiguous kernel RAM is available.

Posted at 08:45:35 by Paul Fox | Permalink
  dtrace torture Friday, 27 February 2009  
I havent released anything this week (until now), because I have been seeking why my methodical FBT debugging was causing random lockups.

There are missing function probes in FBT, but the 5000+ available in my vmware Ubuntu (as seen via 'dtrace -l'), would cause the machine to panic or hang. I tried various techniques to track down which function was causing the problem, but no specific one was.

This work made much more difficult since when we die in the interrupt handler for INT3, we have very little ways to communicate to the outside world (without putting in more effort; it can be done).

After a lot of experimentation I tracked it down to badness on my behalf in dtrace_interrupt_enable()/dtrace_interrupt_disable().

When dtrace_probe() is called, it can happen from interrupt context - this is how FBT works, by planting INT3 instructions at the probe points. What was happening was nested interrupts - very rare - but within seconds to minutes of heavy probing.

After studying the trivial routines in dtrace_asm.c (which I had already had to rewrite once), I realised the silliness. The silliness is that these two routines allow nested disable/enable - but my implementation didnt do that, and so we prematurely enabled interrupts inside dtrace_probe() and the INT3 trap handler.

A slight complication is that even tho these lines are a few lines of trivial assembler, they depend on the GCC calling conventions as specified by the kernel build parameters. Fortunately, GCC's __asm__() allows us to not have to worry where calling parameters are (in the stack or in a register).

What does this mean?

The following example illustrates how to profile the functions in the kernel. Here, I did a kernel build ( The kernel had been built already, so the make just spent a bit of time working out that nothing needed rebuilding (but it did do a relink).

/home/fox/src/dtrace@vmub32: dtrace -n 'fbt:::entry{ @[probefunc] = count(); }'
dtrace: description 'fbt:::entry' matched 2593 probes
  ext3_reserve_inode_write                                     288067
  __ext3_get_inode_loc                                         313572
  walk_page_buffers                                            368757
  __ext3_journal_stop                                          384168
  ext3_journal_start_sb                                        384168
  journal_start                                                384168
  journal_stop                                                 384168
  __ext3_journal_get_write_access                              470892
  journal_get_write_access                                     470892
  __ext3_journal_dirty_metadata                                557924
  journal_dirty_metadata                                       557924
  do_get_write_access                                          559167
  journal_cancel_revoke                                        559270
  ext3_htree_store_dirent                                      614816
  call_filldir                                                 617133
  ext3fs_dirhash                                               630767
  str2hashbuf                                                  655621
  journal_add_journal_head                                     747785
  ext3_check_dir_entry                                         754274
  journal_put_journal_head                                     786996

Posted at 20:19:13 by Paul Fox | Permalink
  dtrace torture Wednesday, 25 February 2009  
I made good progress up until a few days ago. The FBT code probes the kernel for functions you can trace. You can trace as much or as little as you like. The dtrace probe syntax is great.

Inside fbt, we disassemble each function to find the instructions to put breakpoints on. Because GCC emits different code to Solaris CC, we have to do more work. Each possible instruction in a probe target must be emulated. Solaris worries about 5-6 instructions, but we must worry about nearly the entire x86 instruction set. (Its fun, and I am sad, so its no big deal).

cpu_emulate.c has this code in.

What I am doing is picking off the majority and working through them. Many instructions arent mapped (yet) - which is fine - they just become non-probable functions. (BTW FBT doesnt handle non-module code, ie. the kernel itself, yet, because I didnt know what I was doing at the time, and greatful I never wrote the code, since I have to solve the "how to trace printk, whilst using printk inside dtrace to debug itself!").

Anyway, i took the output of 'dtrace -l' and made a script to try probing each function individually, to find where dtrace is broken.

This worked well - you can really torture Linux with dtrace.

However, *something* is wrong. I enable 1300+ probes, and it can run for ages, and die (no panic, no message - just a hang) randomly. I tried bisecting the probe list to find the culprit, but the problem goes away.

My fear is that either we have something like a double interrupt - a rare and non-deterministic event, or the end-of-interrupt code is missing something (it is missing something, but I havent proved to myself if it matters, as garnered by looking at Linux's normal end of interrupt code where it needs to know if its returning to kernel mode or user mode, and restore IRQ triggers).

Alas, because I am debugging blind, its not helpful.

I need kgdb to help, but after my last kernel build with kgdb, Ubuntu refused to recognize the root filesystem.

I *do feel* its random interrupt 'edginess' rather than bad code in cpu_emulate.c. If it is bad code, it should be deterministic and I can track down which probe is the faulty one.

In any case, for mild dtrace attacks, it works. Just use it on a kernel you care about :-)

Posted at 21:46:22 by Paul Fox | Permalink
  fixed syscall tracing Sunday, 22 February 2009  
I mentioned in the previous note that syscall tracing could do strange things (core dump apps) if we tried to trace them all, showing something was wrong for at least one call.

I tracked it down to sys_clone - not copying enough of the stack. This works now, and you can do this:

 dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
and other variants, showing the execname and/or pid, etc.

The beauty of this one liner is that you can see the whole system in terms of who is doing what, and with a few small variations, you can get more info. The output is exactly as you would want to home in on a candidate process, e.g. sorted by frequency.

This is exactly what dtrace is all about.

Next up is to track down more fbt issues.

Posted at 13:11:17 by Paul Fox | Permalink
  fixes for syscall tracing Sunday, 22 February 2009  
Found out that not all syscalls were safe, e.g. sigreturn() needs to patch the return register set. The call to the underlying syscall wasnt faking the stack properly, and this is fixed (for i386).

This now works (nearl):

/home/fox/src/dtrace@vmub32: dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 321 probes

clone 2 munmap 2 pipe 2 setpgid 2 socketcall 2 sigreturn 3 fstat64 4 waitpid 4 stat64 5 mmap2 6 writev 8 close 11 _newselect 15 open 18 gettimeofday 22 poll 27 rt_sigaction 54 rt_sigprocmask 74 ioctl 81 write 141 read 172 time 1414 futex 1424 clock_gettime 7047

I say 'nearly', because whilst it is running, some apps were core dumping, so obviously its not 100% correct, but at least the kernel didnt panic or complain about nastiness

Posted at 11:33:34 by Paul Fox | Permalink
  memory allocation issue resolved (hopefully) Saturday, 21 February 2009  
Found that I should call vmalloc() instead of kmalloc() for big allocations. Ive set it up to use that if we ask for more than 100K of memory, and that looks much better.

Lets see how it goes.

Posted at 21:40:37 by Paul Fox | Permalink
  linux kernel memory allocation Saturday, 21 February 2009  
I reported the other day that after running dtrace on my 2GB RAM machine, the box started paging like crazy - so much so, I had to reboot.

I tried again today and I fear Linux kernel memory allocation is broken, again. dtrace uses the kmalloc() calls to allocate memory. It allocates quite a lot of memory in tiny pieces for things like the FBT and other probes, but when you run dtrace for real, it allocates some big buffers to store the probe-time records. dtrace must preallocate this space because its not safe to do at probe time.

dtrace seems to try around 2-4MB per buffer, but needs one per CPU. kmalloc nearly has a heart attack when it sees this - it causes pain to the memory fragmentation, and I suspect it then goes crazy trying to swap everything out in an attempt to 'defrag' memory and avoid problems with memory fragmentation.

All malloc() libraries have this problem sooner or later.

It makes me wonder if Sun tuned dtrace for big RAM machines, e.g. 16+GB RAM. (I need to look at the default).

I have temporarily modified the memory allocator to avoid allocating anything bigger than 260K, and this works well - no frantic paging (even after dtrace is removed from the kernel).

The next step is to look at the buffer allocation code and see if we can allocate a (scattered) array of pages, which would avoid any such fragmentation problems.

Theres still more work to do in FBT instruction sequences, am slowly working my way through the 32-bit list (we only trap about 4000 of 40,000 available functions at present). I noticed the 64bit kernel has a similar issue - namely, that GCC generates more distinct assembler, and on my ubuntu, where there are around 60,000+ functions available, we only allow trapping on about 2,000. So, hopefully this will be easy to address, simply going through piece by piece.

(FYI, tests/mkprobeall is useful to iterate over every probe to validate safety).

Posted at 19:16:28 by Paul Fox | Permalink
  dtrace and mail Friday, 20 February 2009  
Sorry, but if people want to correspond with me, its going to be plain text or html or both, but not mime-only encoding. My emailer is CRiSP - its homebrew and is virus resistant because I wont run anything that will autoexecute foreign code or plugins - Windows or Linux. I dont want to spend my life patching other peoples security holes.

I will likely attract more spam than I do already because I freely publicise my name, email and web site, and thats fine - I can learn something from the spam, and I can recognize it without opening files.

Sorry to be a pedant !

Posted at 21:13:37 by Paul Fox | Permalink
  dtrace progress Thursday, 19 February 2009  
Things are looking good in dtrace - it seems to work as I try out more test cases. However, there are quirks and some missing stuff, which is slowly being worked on.

The quirks are anoying - maybe not repeatable. E.g. I crashed my laptop tonight; after a reboot dtrace and the same test was fine. (Maybe kmalloc memory pressure causes an issue?) The heavy paging I experienced last night hasnt happened. (Maybe a fight with vmware?)

Added a special case handler for "SUB $12,ESP".

The goal is just about everything in /proc/kallsyms is suitable for FBT probing. (49869 on Fedora Core 8; 58699 in my Ubuntu laptop). Presently only 3000+ entries are probable because many are disallowed until safe instruction emulations are in place.

So, we are looking at quality and polish, and the user community trying it out, feeding back.

Reliability is an issue: if I am doing anything bad, I will be caught out, and only after a lot of hard work to find out where that problem area lays.

I left dtrace doing something boring (intercepting mkdir() syscalls all night). It was fine.

But its torture and stress testing.

I need at some point to go back to 64-bit testing/debugging. At present I am just doing 32-bit testing.

Posted at 23:34:31 by Paul Fox | Permalink
  dtrace .. more fixes .. strange VM behavior Wednesday, 18 February 2009  
What I thought I had fixed yesterday, I hadnt..quite.

The issue about /bin/ld and binutils is true - it didnt like what dtrace generates for a USDT application, but I only applied a fix for 32-bit ELF files, and not also for 64-bit ones.

This is fixed, and hopefully "make all" should now run to completion.

I loaded dtrace into a non-VMware machine - my main 2GB RAM laptop (32bit Ubuntu). It worked well - I could trace failed open() system calls (scripts/nonexist.d), but what is strange is that when Ctrl-C-ing dtrace, the kernel went funny forcibly evicting every page from the RAM cache.

Maybe I am using a bogus kmalloc/kfree syscall which is triggering that. (My vmware sessions tend to only have 256MB of RAM, so its not so noticable that this may have occurred).

Certainly makes machine unusable til the page flushing has completed.

Posted at 23:03:25 by Paul Fox | Permalink
  dtrace build failures Tuesday, 17 February 2009  
Ive been compiling dtrace on various platforms, but on AS4 and some user reported platforms there are build errors.

An annoying error for me to track down is why the USDT userland application fails to build. If you do another make, progress is made and the driver is then built.

There seems to be an issue with an invalid ELF file being generated. Hopefully I can track it down tonight -- it could be a problem with versions of the development kit (/bin/ld, gcc, libelf etc) or maybe a sillyism in the porting of the code.

To those that are trying - the usdt piece doesnt matter, so you can skip over it (e.g. 'make -i'), but hopefully I will get this done.

I am also adding a bug report script to help on some of these - not sure if its that useful, but it will avoid delays to get consistent info from people about platforms and bit-ness.

Posted at 20:17:43 by Paul Fox | Permalink
  more dtrace asm Monday, 16 February 2009  
There are now 3 cpu instructions emulated in assembler. This means we have access to hundreds or maybe thousands of probes. Three is a relative number - there are lots of instructions possible in the first part of a function using FBT, but this is a good start/cross-section.

I have a little script in the tests directory:

$ cat tests/mkprobeall
# Simple script to try out each probe in turn so we can 
# know where we may die
grep fbt.*entry /tmp/probes.current | 
while read id provider module func startstop
	echo $func $startstop
	dtrace -n "fbt::$func:entry  { @[probefunc] = count(); }
		profile:::tick-1sec { exit(0); }
which is useful to try each function in turn to see where we crash, so I can start doing more work on the others.

If I am lucky, in the future I may be able to create decent regressions on these instruction sequences. (The latest - MOV $nnn,%ESP - needs some fixing since it may not cope with a small stack frame - the registers may overwrite themselves, but my initial test was a $0x48 size frame, so this works -- ext3_free_inode()).

To date, I am only doing i386 rather than x86-64, since this is where the most interest is.

Posted at 23:22:30 by Paul Fox | Permalink
  more dtrace and fbt Sunday, 15 February 2009  
Just in case...I wrote that we had reached achievement yesterday with FBT. Its not complete - it will still cause a GPF, but the major aspect of how it works is resolved.

In Linux and GCC, we have a more difficult life than Solaris with their C compiler. This means more instruction sequences need to be patched and emulated. It works for ext3_mkdir:entry, and I am just getting ext3_mkdir:return to work, and then I can hopefully blitz lots more functions.

Interestingly Solaris uses the LOCK prefix to intercept instructions for the i386, but we are going to use INT3, since LOCK is very dangerous, since we tripped up on the following instruction byte. I dont know why they did it that way, but I kind of get the feeling that Solaris dtrace just-about works, and a change in their compiler technology could cause a fair bit of work to get it working again. (I havent verified what Apple do, since they use GCC but they are restricting themselves to 64-bit only kernels so they may have gotten away with it too).

Again, will update here if I feel the dtrace is usable on an i386 for FBT.

Posted at 10:56:22 by Paul Fox | Permalink
  dtrace and fbt and i386 Saturday, 14 February 2009  
Gaaaak! Took a week, but it appears to work now - or at least the first part works. Heres the code that I spent all week poring over:
                        "mov %0, %%esp\n"

"pop %%ebx\n" "pop %%ecx\n" "pop %%edx\n" "pop %%esi\n" "pop %%edi\n" "pop %%ebp\n" "pop %%eax\n" "pop %%ds\n"

"pop %%es\n" "pop %%fs\n"

"push %%eax\n" "mov 8(%%esp), %%eax\n" // EIP "inc %%eax\n" "mov %%eax,4(%%esp)\n" "mov 12(%%esp),%%eax\n" // CS "mov %%eax,8(%%esp)\n" "mov 16(%%esp),%%eax\n" // Flags "mov %%eax,12(%%esp)\n" "mov %%ebp,16(%%esp)\n" // emulated push EBP "pop %%eax\n" "iret\n" : : "a" (regs) );

Ive put a comment in the code (cpu_emulate.c) to describe whats going on.

This has been interesting and frustrating. Half of the code is lifted from opensolaris - or - heavily borrowed. I had written most of the above in plain C, and did all sorts of silly things and experiments to figure out where I was going wrong and none of them worked. Almost certainly, I was nearly close at all times, but this stuff requires more accuracy than normal C code, since we cannot single step the assembler.

Whats almost hilariously funny is that I dont use kgdb. Mainly because I dont have time to set it up - but I should. I relented. I grabbed the latest linux kernel -, and built it for kgdb use, but my kernel failed to boot. So I quickly gave up.

Exactly at that point, I spotted the fatal flaw in my emulation code, and doing a FBT on ext3_mkdir works - when a new directory is created, I can see the calls fire, and I *DONT* panic the kernel.

Now, hopefully I can clean up FBT for i386 and reverify 64bit kernel works.

This is it: almost functionally complete. I have more tidyups to do and I need to ensure we cannot probe a dtrace function, and then we are done, with cosmetics left to do.

Posted at 23:50:52 by Paul Fox | Permalink
  dtrace stuff Thursday, 12 February 2009  
My inbox is starting to bloat with dtrace inquiries. I will try to respond when I can to some people, but I may need to setup a mailing list or something to try and offset the issues/support/query load.

Someone has set up a dtrace git repository based on my nightly tarballs - this is great stuff. Thanks Pete.


This could be great news - people contributing in areas I dont have time for at the moment. (dtrace is a part time hobby and its interfering with other things I need to get on with).

Just some background here: people should download and build dtrace. The more kernels and distros the better. Then changes can be made so it can compile better everywhere. I have a few kernels but not every and not every distro, so the makefile will need some fine tuning if necessary to fix these issues.

At some point in the future I may release dtrace v1.00, but until then, treat all my hype as exactly that. When I say it 'works' it means it works better than it did yesterday for *me*. It may not work for you - or it may break - or it may cause a plague of penguins to visit you.

I remain optimistic (else I would give up). You can remain watching and monitoring and/or helping out, but I need to get the featureset complete first before I can think of what comes next.

As always, if its quiet, I am fighting a battle with bugs.

Posted at 22:04:40 by Paul Fox | Permalink
  dtrace and the curse of the interrupt Wednesday, 11 February 2009  
I was making good progress over the weekend but slowed down this week: 15-20 lines of assembler are the root cause - the backend of a fbt interrupt when a trace trap is hit.

The problem is that to emulate the intercepted instruction we cannot 'return' from the C code back to the caller because we need to mangle the stack.

On Solaris, code on the backend of an interrupt does this, but we arent patching Linux remember, so we have to pretend we are running under MS-DOS, do the nasty manipulations ourself and never return, but emulate what would have happened if we had returned (pop the regs, reti).

The problem is that diagnosing what I am doing wrong is tricky (no more printk()'s to trace what happened!). I could call in the troops, e.g. kgdb, but thats too easy :-)

Nope, trial and error so I understand backwards whats going on. (I am surprised the 64bit dtrace works at all in this area, because it shouldnt; I'll diagnose why it works or why it might fail when I have finished the 32bit version).

Mostly in the kernel we deal with a struct pt_regs structure but this structure lies about whats really happening on an invalid opcode trap (and many other traps).

Fortunately, VMwares 'revert to snapshot' is a real time saver, about 3 seconds between a kernel panic and back in action to try the next test is a real timer saver.

More when I know more!

Posted at 21:34:18 by Paul Fox | Permalink
  Dtrace - fbt under i386 + Purify Sunday, 08 February 2009  
FBT for 32-bit mode kernel wont work at present. It works under 64-bit kernels, but the key issue is that the existing Sun supplied dtrace code doesnt understand GCC or Linux.

A normal C function has an entry and exit prologue (eg PUSHL %EBP). This may be true for the sun compiler, and no __inline assembler code, but not true for Linux. Many Linux functions in the kernel do not start with that prologue. In addition, use of --regparms=3 when compiling the kernel means theres more permutations.

When FBT patches a functions entry prologue, it notes what the opcode underneath the patch was and then emulates it (since we cannot easily single step in kernel mode).

Suns dtrace supports 5 scenarios -- common to 32/64 bit, but a real kernel has tens of permutations.

I am modifying the code to support the required permutations.

I mention Purify in the title. Why?

Many many years ago (about 10 years) ago, I started writing a Purify emulator (prior to valgrind being available). To avoid the key 1992 patent that Purify uses, I wrote a i386 CPU emulator. It nearly worked - I gave up, because debugging the exceptions was proving time consuming - but I did manage to run a crisp binary under it at one point (albeit at a 200x performance hit, this on a Pentium-233 processor if my memory holds true).

But, that old code is useful because I can lift some of the instruction emulations out for dtrace, so hopefully it wont take too long to get probes in.

In the prior releases of dtrace, it showed many probes under i386, but it shouldnt have: if you enabled FBT tracing, you would panic the kernel, because I allowed thru unsafe translations. This is fixed - it will only allow safe translations thru, but I have to treat each new instruction as a new case, to be emulated, so, as these get done, 'dtrace -l' will show many more fbt probes.

Posted at 11:29:53 by Paul Fox | Permalink
  Dtrace progress 20090207 Saturday, 07 February 2009  
Been working on the FBT provider issue: it doesnt work, which is not surprising. The fbt code was one of the first bits of code to get compiled, and 'work' but I had to quickly move on to other bits of dtrace to flesh out the workings.

Now that I am returning to it, life is much easier. The existing driver, along with kernel source, and OpenSolaris code is like having a google-map to wade around, comparing/understanding pieces of code.

Today I understood how fbt works - when a trap is set in kernel space, it works by planting a breakpoint in the kernel code. The trap interrupt happens, but rather than what would normally happen in user space, where we unpatch the code, single step, then plant the breakpoint back in, instead we have to emulate the patched over instruction, which is not too difficult since theres only a handful of potential instructions we can patch (push %ebp,%esp; leave/ret, nop).

Interestingly, this code is not in the Dtrace code, but in the interrupt trap handler.

Linux - especially the later kernels - provides good notifier call chain mechanisms to intercept the traps; I dont have to patch the kernel code - I can just exist as a 'nice' citizen.

So - this bit nearly works - hopefully get this to work later.

A problem is that dtrace makes it easy to place a probe on every function in the kernel in one line. I need to detect which external dependencies we have (like printk()) and avoid us trying to call a patched function from inside dtrace, else we will have a recursion issue or a double fault trap and panic the kernel. On the other hand, the ability to patch each function is good - because its easy to prove the exercise is finished.

Posted at 18:57:52 by Paul Fox | Permalink
  Dtrace progress 20090205 Friday, 06 February 2009  
I've put out a new release of dtrace. Progress has been pretty good this week, but thats *my* opinion. A certain dtrace author[1] has had troubles with the release, and I need to resolve that.

I've revamped the directory layout slightly to allow a single src tree to build against all valid kernels on the machine (as found in /lib/modules, but this can be overridden to use any tree you have on the system).

I had spent some effort to see if I can compile for RedHat AS4 (2.6.9 kernel), and a few others too, and the kernel differences, although annoying, meant I needed to clean up. This includes things like the pt_regs register layouts, which is now set up in <sys/privregs.h> to handle 64+32 and differing kernel releases.

Builds now go to the build-2.X.Y directory with build/ being a symlink to the default running kernel.

The bottom line is that I can at least validate compilation issues across kernels before putting out a release, even if I am not yet doing any sanity checking across all kernels.

A number of issues were fixed in USDT (I have a crisp binary with a USDT probe in, which is useful, since it was firing kernel issues when I had forgotten the probes were even present). Also, the issues with a real 32-bit cpu are fixed.

I'm now looking at FBT - which fails because its relying on something which I havent yet plugged together - am just researching if this is an interrupt trap or bad parsing and placement of the patch instruction. (Kernel panics with an unimplemented instruction if you use fbt).

[1] The dtrace author will remain nameless, unless he's happy for me to use his name.

Posted at 22:54:07 by Paul Fox | Permalink
  dtrace progress 32-bit cpus Tuesday, 03 February 2009  
Weird ! Dtrace works differently in a VMware session vs on a real cpu. I always thought it would be difficult to tell if you were in a VM, but dtrace found a crink.

Linux protects the syscall table (read-only). I put in code to let us patch it so that syscall tracing would work. This worked fine under a VMware guest OS, but on a real x86 CPU, the kernel would trap the fault.

After a lot of printk()ing and resolving it to very basic function calls, I ended up with a solution which is good - direct manipulation of the page table entries (followed by a TLB cache flush!)

32-bit dtrace can now trace calls again, and it will work.

Of course, theres more bugs for me to fix and some other unfinished business buts its rapidly taking final shape.

New release today with these fixes in.

Posted at 13:44:22 by Paul Fox | Permalink