rtdb framework | Monday, 29 June 2009 |
The rtdb functions are interfaces to the runtime linker (ld.so.1), and, although very nice, rely on intimate behavior of the Solaris linker. This doesnt exist on Linux (i.e. the corresponding functions). So, copying the code into dtrace means copying more and more dependencies (avlist, linked list, msg locales and other stuff), for little benefit.
dtrace uses these functions in a very specific way: get the symtab of the target process we are tracing, along with the symtab for the loaded shared libraries.
I am going to draw a line and see how much I can do without dragging it in. (I dragged it in and have kicked it out again, as I just spend more and more time porting Solaris to Linux, which isnt the end goal).
The end goal is making the PID provider and user space stack traces "as they should be".
This will likely take a while, so will update periodically if I feel what I have is no worse than before.
dtrace progress - symtabs | Sunday, 28 June 2009 |
I am currently looking at this:
$ dtrace -n pidXXX::: -p XXXI tried this on my MacOS system, and was intrigued by the fact that for a sample Perl app, tens of thousands of new probes sprang into life. It looks to me that you can DOS attack a kernel with these privs, since if you do this on lots of processes, you can eat the probe memory that dtrace will set aside, and either run out, or affect performance of a system.
At the moment I am knee deep in more ELF/dynamic stuff, so that we can get the symtab of a running process so that the PID provider is more usable.
SDT probes - what? | Thursday, 25 June 2009 |
Thats how I read the SDT: SDT can provide a probe like "received_packet" and provide an argument which represents the packet so you can dissect it.
But, the question is - are they useful ?!
I dont really understand the probes despite staring at the code for a while. I understand lots of the technicalities, but not the rationale. Is my first paragraph spot on? Feel free to send me feedback about why they are a *must*.
Why?
Well, many of the probes in Solaris relate to Solaris internals. The concepts of scheduling on solaris dont match the Linux kernel. Solaris has a process and a lwp (lightweight kernel thread). In Linux, all threads are really processes.
So, if you have a D script written for Solaris, it wont work on Linux, unless I provide as close an emulation as possible. I have found the FBT is more than enough to keep me entertained, but I am trying to find if we need SDT.
There are a lot of values exposed in /proc such as statistic counters. And there is a lot of code in the kernel which increments those counters. But the counters on their own are not directly interesting (you can put an FBT on the functions that manipulate those counters). So, maybe I am missing something, like, with dtrace/linux today, you cannot easily inspect processes, io, vm, packets, etc.
fixed the 32b problems? | Tuesday, 23 June 2009 |
I will have to analyse this more to see why that hook (which shouldnt fire, and we do put it back on a rmmod) causes a problem.
32b drat | Tuesday, 23 June 2009 |
After trying a few experiments, it appears that reloading the dtrace driver will panic/crash/reboot the 32b kernel. (After 3 times for my test machine, and in vmware, a reboot occurs, indicating a likely triple-fault).
I suspect maybe on driver unload, something is not being undone which happened on a load (maybe reset/unhooking the interrupt vectors).
I am investigating.
SDT Progress
Ive done some research on how to get SDT into the kernel without touching the kernel source. I was hoping for key subsystems like the scheduler, VM, NFS, that we would find a structure containing counters which are incremented at key parts of the driver, and the ones exposed in /proc. If we did, we could modify the instruction provider to look for these increments, and auto-create the probes.
What I have found so far in looking around, is that some/all drivers have either a disconnected adhoc collection of counters or have per "instance" counters. (I found references to zones in the MM code), so it wont be as easy as I hoped, but I am continuing to look for a pattern.
dtrace -p now works | Monday, 22 June 2009 |
It still concerns me that you can kill -9 the dtrace and the child can be left stuck in an indeterminate state. Whilst thinking about this, I have a possible solution, namely to let the dtrace driver know what we are doing, and should the dtrace process die, we could force a SIGCONT (PTRACE_CONT) on the debuggee, so all is not lost, and we dont need to do what Solaris does in the /proc filesystem.
So, next up is either ustack() (and user space symbol tables), or the SDT driver. I am still a little confused by SDT and the "transform" keyword in a D script which provides struct-level access to kernel and user space params, but I know what I am expecting to see/work, so I just need to play.
SDT will be interesting - I have a plan to use the Instruction Provider to disassemble the kernel and intercept ADD instructions which apply to a global memory area corresponding to a struct of interest. I hope this will work for some/most of the desired areas, and if so, we have a way to intercept processes which trigger various kernel counters.
One thing to note with dtrace -c/-p - the way dtrace works is to get the process going and then to kick off the kernel rules engine. The kernel doesnt really know whats going in user space - you can elect to monitor probes for the process or any sibling (like truss -f or strace -f) by virtue of your predicates on the probes you write. This really is very powerful, since dtrace can (in theory) do everything strace and truss can do, but via lower level primitives.
Dtrace emulating truss is available as some scripts on the internet show, but some aspects of the way this is done is a little "clunky". I will experiment at a later date to see if we can more closely emulate strace/truss so that dtrace can be a one-stop-shop for these kinds of things.
New release available today whilst I go off and do some more real work.
One step closer - dtrace -c works | Sunday, 21 June 2009 |
You can now do this:
$ dtrace -n 'syscall::mu*:/pid==$target/{printf("%d",pid);}' -c df dtrace: description 'syscall::mu*:' matched 6 probes Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 5874396 5097744 478248 92% / varrun 255460 72 255388 1% /var/run varlock 255460 0 255460 0% /var/lock udev 255460 44 255416 1% /dev devshm 255460 0 255460 0% /dev/shm dtrace: pid 7689 has exited CPU ID FUNCTION:NAME 0 87377 munmap:entry 7689 0 87378 munmap:return 7689 0 87377 munmap:entry 7689 0 87378 munmap:return 7689 0 87377 munmap:entry 7689 0 87378 munmap:return 7689You may see some debug printf's I have left in there, but next thing is to tackle the symtab (stack/ustack) stuff, and consider library probes.
The -c stuff (and -p, which I havent tested yet) may have some issues. Theres a horrible sleep(1) in the child after a fork() to let the parent catch up with the child. I found the Linux kernel seemed to be broken in some areas (I believe threads which inherit ptrace() children have problems).
The sleep can be solved easily with some form of shm mutex or maybe even a futex, but I havent tried.
What is worrying is that in Solaris /proc fs, you can signal a child process to continue on its own if you, the parent, die. On Linux, this isnt there, so, consider:
$ dtrace -n ... -p <pid>If you kill -9 the dtrace process, then the target process may be left in an indeterminate state. This is true for strace. dtrace and strace can work hard to intercept SIGINT/SIGHUP/SIGTERM/etc, but cannot do anything about SIGKILL. I can think of a not-nice to partially solve this (or maybe we could put something into the kernel to handle this), but that is a reason why /proc/pid/ctl wins on Solaris.
Linux user threads - bug ? | Friday, 19 June 2009 |
I found something interesting, which had been confusing me, having spent so long inside the kernel.
In Unix, we have the ptrace() system call - which is the basis of all debuggers. You can attach to a process and do things like set breakpoints or intercept events of interest, like signals.
The way the works is in one of two ways: if you are a debugger (which dtrace, gdb, strace, etc all are), then you fork yourself. The child notifies the kernel it is happy to be traced (via ptrace(PTRACE_ME)), and then forks+exec's the target process.
The parent debugger attaches to the target pid (it knows the pid, because we just forked). It does this via ptrace(PTRACE_ATTACH), and from then on can peek/poke the target process, or continue after an event.
So, here is the bug. In order to ptrace a process you need to attach to it. Two debuggers (eg gdb + strace) cannot attach to the same process at the same time.
Now, consider this. You are a process. You create a new thread. This thread forks() + execs the target. The new thread tries to attach to the process, but fails, because the master thread is considered the 'parent' of the child, and the thread you spawned is considered to be a distinct process - not a thread of the main process.
The issue here is that in Linux, threads are implemented as if you had forked a new process, but the thread shares the address space of the parent. This is not true of a proper multithreaded and POSIX compliant system. E.g. in Solaris, a thread is really a separate 'slice' of a process, and it shares the process id of its parent.
Linux tries to pretend threads exist, but this funky emulation seems to break how ptrace() works.
This is why I have had a hard time getting userland dtrace to work properly in this area - as I have been trying to understand what dtrace is doing and why the target process was stuck in the wrong state.
Now I understand, hopefully the "-c" and "-p" switches to dtrace can be made to work, and this will be a significant feature addition to Linux/dtrace.
Next up... | Monday, 15 June 2009 |
$ dtrace -n syscall:::/pid==$target/{} -c "sleep 100"
This is how to trace the syscalls for a specific process we want to launch - one of the last major features of Linux Dtrace which is missing.
Interestingly, I seem to be hitting an issue with pthreads vs fork/waitpid semantics...Time to read more on who gets the signals on Linux, vs solaris...
Dependencies | Monday, 15 June 2009 |
I am not going to respond to emails for trivial support issues.
Thank you
dtrace and the CALL instruction .. fixed | Friday, 12 June 2009 |
Amazingly, I was staring at a solution in the Linux kernel, but my brain has been hazy the last few days. I had implemented the Instruction Provider which has been a great help to find lots of samples of instructions I care about and try and get a feeling for what is going on.
The issue I was seeing is that when we take the INT3 and INT1 handler - for the initial breakpoint trap and then the single step trap, we would expect the kernel RSP to have moved, because we had just stepped a CALL instruction. But I wasnt seeing this. The "regs" structure on the stack at the point of exception for the same. This didnt make sense.
I hacked it for one 64b kernel, but the others hated my hack. (My hack involved looking at the stack dumps and trying to 'find' the magic values I wanted),
It worked fine on 32b kernels. Imagine an interrupt from kernel space taking place. The cpu pushes RFLAGS, RCS, RIP, in that order onto the existing stack. At this point, our code kicks in and pushes the full register set on to the stack (giving us a "struct pt_regs" structure we can point to and manipulate before returning from the interrupt).
Just above the flags should be the stack where we interrupted. This *is* true on a 32b cpu but not on a 64b cpu. I *think* the reason is that on 64b cpus, Linux sets up a TSS task switch so that on an interrupt, we have a private kernel stack, and this would hopefully avoid stack overflows if we interrupted a deeply nested part of the kernel.
That is why the 'regs' structure is always at the same address, and what we have in the r_rsp field is a POINTER to the original stack, not the stack itself!
A quick experiment and I could run:
$ dtrace instr::*call*:to trap every call instruction in the kernel and it worked. In addition
$ dtrace fbt:::works flawlessly on all three key 64b kernels I was trying, and I hadnt even broken the 32b kernel in fixing this.
Theres still a bogus issue or two to track down. Ctrl-C-ing dtrace can cause kernel problems - not sure why. If you Ctrl-C the dtrace binary, it sends an ioctl to the kernel to ask it to pull apart your probes rather than just exiting. Dont fully understand why they do that but it maybe for when you launch a binary from dtrace and it needs to kill or detach.
So, if this done, I can hopefully return back to user space and get userspace apps to be traced as well, and then we are done....
The Instruction Provider driver is hopefully going to be useful to implement a proper set of probes for the things that avoid patching kernel source.
Instruction Provider now works | Monday, 08 June 2009 |
$ dtrace -n instr::*-nop: 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86742 mutex_trylock-nop:0xffffffff8045b1b7 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86933 lock_kernel-nop:0xffffffff8045c6d6 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86933 lock_kernel-nop:0xffffffff8045c6d6 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86851 _spin_lock-nop:0xffffffff8045c3f5 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86742 mutex_trylock-nop:0xffffffff8045b1b7 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 0 86933 lock_kernel-nop:0xffffffff8045c6d6 0 86870 _spin_lock_irqsave-nop:0xffffffff8045c4b8 ...And another,
$ dtrace -n instr::*-lock: 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86925 __reacquire_kernel_lock-lock:0xffffffff8045c660 0 86883 _spin_trylock-lock:0xffffffff8045c51c 0 86883 _spin_trylock-lock:0xffffffff8045c51c
The Instruction Provider | Monday, 08 June 2009 |
Its hopefully useful and innovative, but my prime goal here was to provide a way to debug targetted opcodes, which are not necessarily in the first location of a function.
I ran a trial - I went from 25,000+ probes to 300,000+ probes. I quickly crashed the kernel (hey, it was a first effort), but hope to debug quickly.
I will probably make it a load-time option to enable it as it really can be destructive to the system under test with so many probes firing. But, if it works, it will also be a good stress of dtrace on linux.
More later!
E8 again... | Saturday, 06 June 2009 |
This is what a relative call does:
E8 nn nn nn nn CALLR offsetWe have a 32-bit relative offset from the next instruction. As a normal subroutine call, this is what should happen: decrement RSP, move return address to (RSP).
Now this is very strange: on the 2.6.9 kernel, we single step the call. The initial breakpoint hits and we look at the RSP. As we single step over the call, we expect the RSP to have decremented by 8 (64-bit return addr).
And it does.
But there is a gap between the RIP/CS/EFlags for the trap exception and the return address of the stepped over instruction. Look at the following debug output:
INT3 PC:ffffffff80110a48 REGS:0000010008eedea8 CPU:0 BEFORE: Regs @ 0000010008eedea8..0000010008eedf50 CPU:0 r15:000033526a5f59d7 r14:ffffffff804dc2a0 r13:0000010008165290 r12:ffffffff804dcc00 rbp:0000010008165290 rbx:0000010014f84d40 r11:ffffffff80110b5a r10:0000000000000038 r9:0000000001200011 r8:0000010008eec000 rax:0000010015e803b0 rcx:00000000c0000100 rdx:0000000000000000 rsi:0000010008165290 rdi:0000010015e803b0 orig_rax:0000010015e803b0 rip:ffffffff80110a49 cs:0000000000000010 eflags:0000000000000047 rsp:0000010008eedf58 ss:0000000000000018 00000000006f2840 00000000006f0a00 INT3 ffffffffa0236cff called CPU:0 good finishHere we get an INT3 trap and you can see RSP is set to 0000010008eedf58. The "Regs @" entry in the first case shows the extent of the 'struct pt_regs'. Note that between the printed rsp and the end of the regs area is a difference of 8 bytes. This shouldnt be there.int1 PC:ffffffffa025aff0 regs:0000010008eedea8 CPU:0 AFTER: Regs @ 0000010008eedea8..0000010008eedf50 CPU:0 r15:000033526a5f59d7 r14:ffffffff804dc2a0 r13:0000010008165290 r12:ffffffff804dcc00 rbp:0000010008165290 rbx:0000010014f84d40 r11:ffffffff80110b5a r10:0000000000000038 r9:0000000001200011 r8:0000010008eec000 rax:0000010015e803b0 rcx:00000000c0000100 rdx:0000000000000000 rsi:0000010008165290 rdi:0000010015e803b0 orig_rax:0000010015e803b0 rip:ffffffffa025aff1 cs:0000000000000010 eflags:0000000000000047 rsp:0000010008eedf50 ss:0000000000000018 ffffffffa0236d05 00000000006f2840
After the INT3 breakpoint trap, we single step (int1), and look again at the Regs@ and RSP field. The regs are at the same location - even although we just executed a call instruction and pushed the return address on the stack. In the INT1 register dump, RSP is correctly decremented by 8. Here we have no gap, but for INT3 we do have a gap.
I have been reading and re-reading exception handling on the web and Intels docs and there is no reason for the gap.
What is puzzling is that it works on the other kernels, but INT3 is pushing two extra words on the stack - more than I expect.
Another interesting issue is that when I look at the kernels I have and search for E8 call instructions at the first instruction of a probe, only this one seems to have one. Later kernels (or GCC's) dont seem to emit the instruction, so, if I dont understand what is going on, there is a chance that you will hit one and panic your kernel.
Strange. I am going to put out a new release (at least this fixes the compiler issues people have been complaining about, and hope no-one has an E8 in their kernel).
0xfa and 0xfb - STI and CLI | Friday, 05 June 2009 |
For instance, CLI, which clears the interrupt enable flag will ignore interrupts over the following instruction (as will STI, or, maybe only STI does).
What was happening if process 1 -- init -- would die, and the kernel would scream at me.
I have solved this by pure emulation - no point in single stepping these instructions, and just handle without a single step - which is better from a performance point of view.
I am running on 3 64-bit vmware kernels. 2.6.27.8 runs beautifully. 2.6.27-7-generic - an Ubuntu kernel - runs flawlessly but strangely slowly when all probes are enabled. I would expect both to run at the same speed, so either the first is running fast when it shouldnt or maybe the latter is flawed. (I think the slowdown may be due to calls to mcount which is doubling the overhead per function in the kernel).
The other is 2.6.9 - AS4 kernel. Just shown that to hang, so I need to debug that before making a release.
(32-bit kernels appear to work fine, and the compile issues are resolved).
I have added a special flag to FBT which is interesting/useful.
$ load.pl -opcode
will prefix each probe name with the first byte of the opcode at the probe, so that it is easier to diagnose where the flaws are. Single stepping the breakpoint for a probe works, but many instructions have to be handled specially, such as jumps, calls and rets. So being able to find the offending instruction or scenario is helpful.
This relates back to a prior blog entry where I talked about how nice it would be to have an instruction prober where we could probe by instruction type, rather than function. E.g. imagine probing by virtue of every LOCK instruction. Or REP or CLI. Get the picture?
How about JMP/JMPNE/JMPEQ instructions? That could be ideal for low level kernel profiling -- how many times is a jump taken in *this* function.
This is easy to do - just need a variation of the FBT disassembler which doesnt try to instrument the entry/exit of a function, but the body.
I may try and get this in on the release after this one, just to see what it looks like. Stay tuned.
Hoping to release this weekend or tonight if I can resolve the AS4 issue.
E8 issue - now fixed | Wednesday, 03 June 2009 |
Now...need to fix the compile time issues and a new release is forthcoming.
You can cat /proc/dtrace/trace to get some internal trace debug - I need to tone that done to avoid hitting performance too much. (Its not bad as it is, but I can do better).
E8 nnnnnnnn - CALL Relative | Wednesday, 03 June 2009 |
Its not rocket science, but I have been struggling with a lot of sillyness on just a few lines of code.
This instruction has two issues - (1) we need to adjust the return address since we want to return to the original instruction and not the copied one, and (2) its a jump relative.
In my work, I have managed to get one or both of these stupidly wrong. (One issue looks to be not sign-extending a 32-bit displacement to a 64-bit address).
Hopefully get this fixed and can move on
Some people have raised issues about plain compile errors due to <string.h> and memcpy. I hope to fix this too - very annoying that I did something to break what was working fine. (I replaced calls to bzero with a call to memset, and somehow the #define's conflict with glibcs string header.
I noticed a new solaris release has come out (2009/06) and the most notably change for dtrace is the CPC profiler -- http://wikis.sun.com/display/DTrace/cpc+Provider
This looks neat and really want to get that ported, but I need to finish the current workload before taking this on board.
More in a few days.