dtrace bug of the day | Saturday, 28 February 2009 |
[ 4336.530339] vmap allocation failed: use vmalloc=to increase size. [ 4336.640674] vmap allocation failed: use vmalloc= to increase size. [ 4336.758243] vmap allocation failed: use vmalloc= to increase size. [ 4336.843121] BUG: unable to handle kernel paging request at 0020027e [ 4336.843139] IP: [ ] 0xe354db6d [ 4336.843151] *pde = 00000000 [ 4336.843158] Oops: 0002 [#1] SMP
At this point, dtrace cannot be killed, although the rest of the system is fine (I am typing this logged into to another machine). Looks like I need to research memory allocation more on Linux.
In a previous blog I had mentioned how use of kmalloc() would cause frantic heavy I/O as the freemem got fragmented. Use of vmalloc() solved that question, but we may be sensitive to whether how much contiguous kernel RAM is available.
dtrace torture | Friday, 27 February 2009 |
There are missing function probes in FBT, but the 5000+ available in my vmware Ubuntu (as seen via 'dtrace -l'), would cause the machine to panic or hang. I tried various techniques to track down which function was causing the problem, but no specific one was.
This work made much more difficult since when we die in the interrupt handler for INT3, we have very little ways to communicate to the outside world (without putting in more effort; it can be done).
After a lot of experimentation I tracked it down to badness on my behalf in dtrace_interrupt_enable()/dtrace_interrupt_disable().
When dtrace_probe() is called, it can happen from interrupt context - this is how FBT works, by planting INT3 instructions at the probe points. What was happening was nested interrupts - very rare - but within seconds to minutes of heavy probing.
After studying the trivial routines in dtrace_asm.c (which I had already had to rewrite once), I realised the silliness. The silliness is that these two routines allow nested disable/enable - but my implementation didnt do that, and so we prematurely enabled interrupts inside dtrace_probe() and the INT3 trap handler.
A slight complication is that even tho these lines are a few lines of trivial assembler, they depend on the GCC calling conventions as specified by the kernel build parameters. Fortunately, GCC's __asm__() allows us to not have to worry where calling parameters are (in the stack or in a register).
What does this mean?
The following example illustrates how to profile the functions in the kernel. Here, I did a kernel build (2.6.28.5). The kernel had been built already, so the make just spent a bit of time working out that nothing needed rebuilding (but it did do a relink).
/home/fox/src/dtrace@vmub32: dtrace -n 'fbt:::entry{ @[probefunc] = count(); }' dtrace: description 'fbt:::entry' matched 2593 probes .... ext3_reserve_inode_write 288067 __ext3_get_inode_loc 313572 walk_page_buffers 368757 __ext3_journal_stop 384168 ext3_journal_start_sb 384168 journal_start 384168 journal_stop 384168 __ext3_journal_get_write_access 470892 journal_get_write_access 470892 __ext3_journal_dirty_metadata 557924 journal_dirty_metadata 557924 do_get_write_access 559167 journal_cancel_revoke 559270 ext3_htree_store_dirent 614816 call_filldir 617133 ext3fs_dirhash 630767 str2hashbuf 655621 journal_add_journal_head 747785 ext3_check_dir_entry 754274 journal_put_journal_head 786996
dtrace torture | Wednesday, 25 February 2009 |
Inside fbt, we disassemble each function to find the instructions to put breakpoints on. Because GCC emits different code to Solaris CC, we have to do more work. Each possible instruction in a probe target must be emulated. Solaris worries about 5-6 instructions, but we must worry about nearly the entire x86 instruction set. (Its fun, and I am sad, so its no big deal).
cpu_emulate.c has this code in.
What I am doing is picking off the majority and working through them. Many instructions arent mapped (yet) - which is fine - they just become non-probable functions. (BTW FBT doesnt handle non-module code, ie. the kernel itself, yet, because I didnt know what I was doing at the time, and greatful I never wrote the code, since I have to solve the "how to trace printk, whilst using printk inside dtrace to debug itself!").
Anyway, i took the output of 'dtrace -l' and made a script to try probing each function individually, to find where dtrace is broken.
This worked well - you can really torture Linux with dtrace.
However, *something* is wrong. I enable 1300+ probes, and it can run for ages, and die (no panic, no message - just a hang) randomly. I tried bisecting the probe list to find the culprit, but the problem goes away.
My fear is that either we have something like a double interrupt - a rare and non-deterministic event, or the end-of-interrupt code is missing something (it is missing something, but I havent proved to myself if it matters, as garnered by looking at Linux's normal end of interrupt code where it needs to know if its returning to kernel mode or user mode, and restore IRQ triggers).
Alas, because I am debugging blind, its not helpful.
I need kgdb to help, but after my last kernel build with kgdb, Ubuntu refused to recognize the root filesystem.
I *do feel* its random interrupt 'edginess' rather than bad code in cpu_emulate.c. If it is bad code, it should be deterministic and I can track down which probe is the faulty one.
In any case, for mild dtrace attacks, it works. Just use it on a kernel you care about :-)
fixed syscall tracing | Sunday, 22 February 2009 |
I tracked it down to sys_clone - not copying enough of the stack. This works now, and you can do this:
dtrace -n 'syscall:::entry { @num[probefunc] = count(); }'and other variants, showing the execname and/or pid, etc.
The beauty of this one liner is that you can see the whole system in terms of who is doing what, and with a few small variations, you can get more info. The output is exactly as you would want to home in on a candidate process, e.g. sorted by frequency.
This is exactly what dtrace is all about.
Next up is to track down more fbt issues.
fixes for syscall tracing | Sunday, 22 February 2009 |
This now works (nearl):
/home/fox/src/dtrace@vmub32: dtrace -n 'syscall:::entry { @num[probefunc] = count(); }' dtrace: description 'syscall:::entry ' matched 321 probes
clone 2 munmap 2 pipe 2 setpgid 2 socketcall 2 sigreturn 3 fstat64 4 waitpid 4 stat64 5 mmap2 6 writev 8 close 11 _newselect 15 open 18 gettimeofday 22 poll 27 rt_sigaction 54 rt_sigprocmask 74 ioctl 81 write 141 read 172 time 1414 futex 1424 clock_gettime 7047
I say 'nearly', because whilst it is running, some apps were core dumping, so obviously its not 100% correct, but at least the kernel didnt panic or complain about nastiness
memory allocation issue resolved (hopefully) | Saturday, 21 February 2009 |
Lets see how it goes.
linux kernel memory allocation | Saturday, 21 February 2009 |
I tried again today and I fear Linux kernel memory allocation is broken, again. dtrace uses the kmalloc() calls to allocate memory. It allocates quite a lot of memory in tiny pieces for things like the FBT and other probes, but when you run dtrace for real, it allocates some big buffers to store the probe-time records. dtrace must preallocate this space because its not safe to do at probe time.
dtrace seems to try around 2-4MB per buffer, but needs one per CPU. kmalloc nearly has a heart attack when it sees this - it causes pain to the memory fragmentation, and I suspect it then goes crazy trying to swap everything out in an attempt to 'defrag' memory and avoid problems with memory fragmentation.
All malloc() libraries have this problem sooner or later.
It makes me wonder if Sun tuned dtrace for big RAM machines, e.g. 16+GB RAM. (I need to look at the default).
I have temporarily modified the memory allocator to avoid allocating anything bigger than 260K, and this works well - no frantic paging (even after dtrace is removed from the kernel).
The next step is to look at the buffer allocation code and see if we can allocate a (scattered) array of pages, which would avoid any such fragmentation problems.
Theres still more work to do in FBT instruction sequences, am slowly working my way through the 32-bit list (we only trap about 4000 of 40,000 available functions at present). I noticed the 64bit kernel has a similar issue - namely, that GCC generates more distinct assembler, and on my ubuntu, where there are around 60,000+ functions available, we only allow trapping on about 2,000. So, hopefully this will be easy to address, simply going through piece by piece.
(FYI, tests/mkprobeall is useful to iterate over every probe to validate safety).
dtrace and mail | Friday, 20 February 2009 |
I will likely attract more spam than I do already because I freely publicise my name, email and web site, and thats fine - I can learn something from the spam, and I can recognize it without opening files.
Sorry to be a pedant !
dtrace progress | Thursday, 19 February 2009 |
The quirks are anoying - maybe not repeatable. E.g. I crashed my laptop tonight; after a reboot dtrace and the same test was fine. (Maybe kmalloc memory pressure causes an issue?) The heavy paging I experienced last night hasnt happened. (Maybe a fight with vmware?)
Added a special case handler for "SUB $12,ESP".
The goal is just about everything in /proc/kallsyms is suitable for FBT probing. (49869 on Fedora Core 8; 58699 in my Ubuntu laptop). Presently only 3000+ entries are probable because many are disallowed until safe instruction emulations are in place.
So, we are looking at quality and polish, and the user community trying it out, feeding back.
Reliability is an issue: if I am doing anything bad, I will be caught out, and only after a lot of hard work to find out where that problem area lays.
I left dtrace doing something boring (intercepting mkdir() syscalls all night). It was fine.
But its torture and stress testing.
I need at some point to go back to 64-bit testing/debugging. At present I am just doing 32-bit testing.
dtrace .. more fixes .. strange VM behavior | Wednesday, 18 February 2009 |
The issue about /bin/ld and binutils is true - it didnt like what dtrace generates for a USDT application, but I only applied a fix for 32-bit ELF files, and not also for 64-bit ones.
This is fixed, and hopefully "make all" should now run to completion.
I loaded dtrace into a non-VMware machine - my main 2GB RAM laptop (32bit Ubuntu). It worked well - I could trace failed open() system calls (scripts/nonexist.d), but what is strange is that when Ctrl-C-ing dtrace, the kernel went funny forcibly evicting every page from the RAM cache.
Maybe I am using a bogus kmalloc/kfree syscall which is triggering that. (My vmware sessions tend to only have 256MB of RAM, so its not so noticable that this may have occurred).
Certainly makes machine unusable til the page flushing has completed.
dtrace build failures | Tuesday, 17 February 2009 |
An annoying error for me to track down is why the USDT userland application fails to build. If you do another make, progress is made and the driver is then built.
There seems to be an issue with an invalid ELF file being generated. Hopefully I can track it down tonight -- it could be a problem with versions of the development kit (/bin/ld, gcc, libelf etc) or maybe a sillyism in the porting of the code.
To those that are trying - the usdt piece doesnt matter, so you can skip over it (e.g. 'make -i'), but hopefully I will get this done.
I am also adding a bug report script to help on some of these - not sure if its that useful, but it will avoid delays to get consistent info from people about platforms and bit-ness.
more dtrace asm | Monday, 16 February 2009 |
I have a little script in the tests directory:
$ cat tests/mkprobeall # Simple script to try out each probe in turn so we can # know where we may die grep fbt.*entry /tmp/probes.current | while read id provider module func startstop do echo $func $startstop dtrace -n "fbt::$func:entry { @[probefunc] = count(); } profile:::tick-1sec { exit(0); } " donewhich is useful to try each function in turn to see where we crash, so I can start doing more work on the others.
If I am lucky, in the future I may be able to create decent regressions on these instruction sequences. (The latest - MOV $nnn,%ESP - needs some fixing since it may not cope with a small stack frame - the registers may overwrite themselves, but my initial test was a $0x48 size frame, so this works -- ext3_free_inode()).
To date, I am only doing i386 rather than x86-64, since this is where the most interest is.
more dtrace and fbt | Sunday, 15 February 2009 |
In Linux and GCC, we have a more difficult life than Solaris with their C compiler. This means more instruction sequences need to be patched and emulated. It works for ext3_mkdir:entry, and I am just getting ext3_mkdir:return to work, and then I can hopefully blitz lots more functions.
Interestingly Solaris uses the LOCK prefix to intercept instructions for the i386, but we are going to use INT3, since LOCK is very dangerous, since we tripped up on the following instruction byte. I dont know why they did it that way, but I kind of get the feeling that Solaris dtrace just-about works, and a change in their compiler technology could cause a fair bit of work to get it working again. (I havent verified what Apple do, since they use GCC but they are restricting themselves to 64-bit only kernels so they may have gotten away with it too).
Again, will update here if I feel the dtrace is usable on an i386 for FBT.
dtrace and fbt and i386 | Saturday, 14 February 2009 |
__asm( "mov %0, %%esp\n""pop %%ebx\n" "pop %%ecx\n" "pop %%edx\n" "pop %%esi\n" "pop %%edi\n" "pop %%ebp\n" "pop %%eax\n" "pop %%ds\n"
"pop %%es\n" "pop %%fs\n"
"push %%eax\n" "mov 8(%%esp), %%eax\n" // EIP "inc %%eax\n" "mov %%eax,4(%%esp)\n" "mov 12(%%esp),%%eax\n" // CS "mov %%eax,8(%%esp)\n" "mov 16(%%esp),%%eax\n" // Flags "mov %%eax,12(%%esp)\n" "mov %%ebp,16(%%esp)\n" // emulated push EBP "pop %%eax\n" "iret\n" : : "a" (regs) );
Ive put a comment in the code (cpu_emulate.c) to describe whats going on.
This has been interesting and frustrating. Half of the code is lifted from opensolaris - or - heavily borrowed. I had written most of the above in plain C, and did all sorts of silly things and experiments to figure out where I was going wrong and none of them worked. Almost certainly, I was nearly close at all times, but this stuff requires more accuracy than normal C code, since we cannot single step the assembler.
Whats almost hilariously funny is that I dont use kgdb. Mainly because I dont have time to set it up - but I should. I relented. I grabbed the latest linux kernel - 2.6.28.5, and built it for kgdb use, but my kernel failed to boot. So I quickly gave up.
Exactly at that point, I spotted the fatal flaw in my emulation code, and doing a FBT on ext3_mkdir works - when a new directory is created, I can see the calls fire, and I *DONT* panic the kernel.
Now, hopefully I can clean up FBT for i386 and reverify 64bit kernel works.
This is it: almost functionally complete. I have more tidyups to do and I need to ensure we cannot probe a dtrace function, and then we are done, with cosmetics left to do.
dtrace stuff | Thursday, 12 February 2009 |
Someone has set up a dtrace git repository based on my nightly tarballs - this is great stuff. Thanks Pete.
http://github.com/pmccormick/dtrace-for-linux/tree/master
This could be great news - people contributing in areas I dont have time for at the moment. (dtrace is a part time hobby and its interfering with other things I need to get on with).
Just some background here: people should download and build dtrace. The more kernels and distros the better. Then changes can be made so it can compile better everywhere. I have a few kernels but not every and not every distro, so the makefile will need some fine tuning if necessary to fix these issues.
At some point in the future I may release dtrace v1.00, but until then, treat all my hype as exactly that. When I say it 'works' it means it works better than it did yesterday for *me*. It may not work for you - or it may break - or it may cause a plague of penguins to visit you.
I remain optimistic (else I would give up). You can remain watching and monitoring and/or helping out, but I need to get the featureset complete first before I can think of what comes next.
As always, if its quiet, I am fighting a battle with bugs.
dtrace and the curse of the interrupt | Wednesday, 11 February 2009 |
The problem is that to emulate the intercepted instruction we cannot 'return' from the C code back to the caller because we need to mangle the stack.
On Solaris, code on the backend of an interrupt does this, but we arent patching Linux remember, so we have to pretend we are running under MS-DOS, do the nasty manipulations ourself and never return, but emulate what would have happened if we had returned (pop the regs, reti).
The problem is that diagnosing what I am doing wrong is tricky (no more printk()'s to trace what happened!). I could call in the troops, e.g. kgdb, but thats too easy :-)
Nope, trial and error so I understand backwards whats going on. (I am surprised the 64bit dtrace works at all in this area, because it shouldnt; I'll diagnose why it works or why it might fail when I have finished the 32bit version).
Mostly in the kernel we deal with a struct pt_regs structure but this structure lies about whats really happening on an invalid opcode trap (and many other traps).
Fortunately, VMwares 'revert to snapshot' is a real time saver, about 3 seconds between a kernel panic and back in action to try the next test is a real timer saver.
More when I know more!
Dtrace - fbt under i386 + Purify | Sunday, 08 February 2009 |
A normal C function has an entry and exit prologue (eg PUSHL %EBP). This may be true for the sun compiler, and no __inline assembler code, but not true for Linux. Many Linux functions in the kernel do not start with that prologue. In addition, use of --regparms=3 when compiling the kernel means theres more permutations.
When FBT patches a functions entry prologue, it notes what the opcode underneath the patch was and then emulates it (since we cannot easily single step in kernel mode).
Suns dtrace supports 5 scenarios -- common to 32/64 bit, but a real kernel has tens of permutations.
I am modifying the code to support the required permutations.
I mention Purify in the title. Why?
Many many years ago (about 10 years) ago, I started writing a Purify emulator (prior to valgrind being available). To avoid the key 1992 patent that Purify uses, I wrote a i386 CPU emulator. It nearly worked - I gave up, because debugging the exceptions was proving time consuming - but I did manage to run a crisp binary under it at one point (albeit at a 200x performance hit, this on a Pentium-233 processor if my memory holds true).
But, that old code is useful because I can lift some of the instruction emulations out for dtrace, so hopefully it wont take too long to get probes in.
In the prior releases of dtrace, it showed many probes under i386, but it shouldnt have: if you enabled FBT tracing, you would panic the kernel, because I allowed thru unsafe translations. This is fixed - it will only allow safe translations thru, but I have to treat each new instruction as a new case, to be emulated, so, as these get done, 'dtrace -l' will show many more fbt probes.
Dtrace progress 20090207 | Saturday, 07 February 2009 |
Now that I am returning to it, life is much easier. The existing driver, along with kernel source, and OpenSolaris code is like having a google-map to wade around, comparing/understanding pieces of code.
Today I understood how fbt works - when a trap is set in kernel space, it works by planting a breakpoint in the kernel code. The trap interrupt happens, but rather than what would normally happen in user space, where we unpatch the code, single step, then plant the breakpoint back in, instead we have to emulate the patched over instruction, which is not too difficult since theres only a handful of potential instructions we can patch (push %ebp,%esp; leave/ret, nop).
Interestingly, this code is not in the Dtrace code, but in the interrupt trap handler.
Linux - especially the later kernels - provides good notifier call chain mechanisms to intercept the traps; I dont have to patch the kernel code - I can just exist as a 'nice' citizen.
So - this bit nearly works - hopefully get this to work later.
A problem is that dtrace makes it easy to place a probe on every function in the kernel in one line. I need to detect which external dependencies we have (like printk()) and avoid us trying to call a patched function from inside dtrace, else we will have a recursion issue or a double fault trap and panic the kernel. On the other hand, the ability to patch each function is good - because its easy to prove the exercise is finished.
Dtrace progress 20090205 | Friday, 06 February 2009 |
I've revamped the directory layout slightly to allow a single src tree to build against all valid kernels on the machine (as found in /lib/modules, but this can be overridden to use any tree you have on the system).
I had spent some effort to see if I can compile for RedHat AS4 (2.6.9 kernel), and a few others too, and the kernel differences, although annoying, meant I needed to clean up. This includes things like the pt_regs register layouts, which is now set up in <sys/privregs.h> to handle 64+32 and differing kernel releases.
Builds now go to the build-2.X.Y directory with build/ being a symlink to the default running kernel.
The bottom line is that I can at least validate compilation issues across kernels before putting out a release, even if I am not yet doing any sanity checking across all kernels.
A number of issues were fixed in USDT (I have a crisp binary with a USDT probe in, which is useful, since it was firing kernel issues when I had forgotten the probes were even present). Also, the issues with a real 32-bit cpu are fixed.
I'm now looking at FBT - which fails because its relying on something which I havent yet plugged together - am just researching if this is an interrupt trap or bad parsing and placement of the patch instruction. (Kernel panics with an unimplemented instruction if you use fbt).
[1] The dtrace author will remain nameless, unless he's happy for me to use his name.
dtrace progress 32-bit cpus | Tuesday, 03 February 2009 |
Linux protects the syscall table (read-only). I put in code to let us patch it so that syscall tracing would work. This worked fine under a VMware guest OS, but on a real x86 CPU, the kernel would trap the fault.
After a lot of printk()ing and resolving it to very basic function calls, I ended up with a solution which is good - direct manipulation of the page table entries (followed by a TLB cache flush!)
32-bit dtrace can now trace calls again, and it will work.
Of course, theres more bugs for me to fix and some other unfinished business buts its rapidly taking final shape.
New release today with these fixes in.