DTrace update Saturday, 25 August 2012  
Not much to report this week on DTrace - the pid provider seems pretty stable, and a number of Linux versioning issues were resolved. (I believe theres one more to fix, reported on Centos).

Using the pid provider is interesting - theres some fixes needed for some platforms, because some rtld (runtime linker) symbols cause an error to be generated - that just needs a little debugging; but its annoying that of the thousands of symbols being examined, only one is needed to cause dtrace to give up.

I also encountered a few 'too many probes in use' kind of error, e.g. when trying to instrument all functions in all shared libraries - its easy to blast past the 250,000 max probe limitation. Probably could do with looking at the fasttrap tracepoint data structure to see if its possible to shave a few bytes off and increase the limit. (Solaris allows the limit to be changed at start time, and Linux does too via the insmod/modprobe "modprobe dtracedrv fasttrap_max=nnnn").

Its probably time to relook at the static probes and continue working through the Solaris providers, to insert the missing probes.

But a lot of people are using dtrace, and I hope, learning about the kernel or optimising programs which one day, I may be using.


Posted at 22:10:06 by fox | Permalink
  More pid provider issues Saturday, 11 August 2012  
Was tracing crisp earlier today (its running inside gdb), and noticed a strange SIGTRAP inside the gdb. Given gdb had no breakpoints, this shouldnt happen, and its not clear where the SIGTRAP came from. PID provider uses SIGTRAP (breakpoint instruction) as part of its implementation but, again this shouldnt happen.

I was pondering what might cause this - and was wondering about signal delivery. The DTrace fasttrap provider has some acknowledgement of signals, but am not sure I fully follow what happens, and in any case, the code is aligned to the Solaris kernel.

Under "normal" circumstances, the PID provider puts INT3 instructions at the probe points, and intercepts these before anything else in the kernel sees them. There are two scenarios for a probe - in-kernel emulation of the CPU instruction (for instructions which modify the PC), or, trampolining in user space whereby a copy of the original instruction is executed from a temporary buffer.

Consider the following simple program:

#include <stdio.h>
#include <signal.h>

volatile unsigned cnt; int tick;

void alarm_handler() { printf("tick: %2d %u\n", tick++, cnt); } int main(int argc, char **argv) { while (1) { int old_tick = tick; signal(SIGALRM, alarm_handler); alarm(1); for (cnt = 0; ; cnt++) { if (tick != old_tick) break; } } }

It sits, counting for duration of 1s, and then prints the count. We use the alarm clock signal to do the printing. Not a strictly ISO-C compliant program, but good enough.

If I use the pid provider to trace this, all is good until the alarm_handler returns from the signal. I need to debug what happens here.

Out of curiosity, I tried this on MacOS, and, surprisingly, the application terminates erratically. Theres no core dump. gdb tells me the exit code is 0170 (presumably a signal was delivered). The tail end of the dtrace is

  0  37727                 alarm_handler:37
  0  37728                 alarm_handler:38
  0  37711             alarm_handler:return
  0  37728                 alarm_handler:38
  0  37711             alarm_handler:return
  0  37708                         start:34
  0  37709                         start:36
  0  37756               stub helpers:entry

So, I wander how thoroughly the PID provider is tested and actually *used*.


Posted at 22:26:24 by fox | Permalink
  New release of dtrace Saturday, 11 August 2012  
I wrote in my previous blog about some issues with the fasttrap provider and some instruction emulation. After testing more thoroughly, I found my original cure wasnt strong enough - there were still some instructions being misinterpreted, leading to app core failure.

#if linux
        /*********************************************
        /*   Handle:
        /*   41 ff 14 c4 callq *(%r12,%rax,8)
        /*   41 ff 24 f4 jmpq *(%r12,%rsi,8)
        /*********************************************
        sz = base == 5 ? (mod == 1 ? 1 : 4) : 0;
#else
        sz = mod == 1 ? 1 : 4;
#endif

In case anyone is interested, the above works properly for SIB indexed instructions where there is no offset.

I can now profile crisp and watch it run a few hundred times slower (if I instrument every instruction in the application). The beauty of dtrace is that I can cherry pick functions, shared libraries, and entry/return points of functions, so its possible, for example to put a trace on a specific function and see if its called or easily count how many instructions are executed, or even profile the sequence of instructions (e.g. looking for abnormally long runtimes).

At the moment, theres a slight slowdown due to some debug printk()s (which you can see in /proc/dtrace/trace); I've removed most of the ones I was using to do my debugging, but a handle of TODO's remain (need to fix up the rw_enter/rw_exit functions).

This leaves one thing to fix:

$ dtrace -c cmd ... -n ....

If dtrace is launching the application, it currently doesnt work properly, because we need to stop the child process as soon as it launches. Solaris/MacOS have the code in to 'walk' the process to the starting line, but this doesnt work on Linux, and I need to debug/replace that code.


Posted at 10:07:14 by fox | Permalink
  callq *(%r12,%rax,8) Thursday, 09 August 2012  
This is an amd64 assembler instruction. I have just fixed a bug in the PID provider where handling of this instruction is not handled properly.

The Solaris DTrace pid provider (fasttrap) is very cool, but in porting to Linux, I uncovered some instructions not being processed properly leading to core dumps of traced apps. (Its a relief to be dealing with core dumps and not kernel panics or kernel lock ups !)

I sent a mail to the solaris-dtrace mailing list - I dont know if I did it right or if it will be accepted, but I thought I would highlight this issue, since it affects DTrace on Intel (ie Solaris and Apple, and quite likely FreeBSD).

I have more instruction mishandlings to investigate now.

I'll update the dtrace release over the weekend with this and any other fixes I have in my holding area.

(Why this instruction? Because its an indirect subroutine call, *without* any offset, instruction coding 41 ff 14 c4; DTrace handles offset-based register indirection, but not non-offset based).


Posted at 22:56:47 by fox | Permalink
  DTrace update Monday, 06 August 2012  
I put out a new release today. This addresses a problem with syscall::rt_sigsuspend: on Centos/RedHat kernels. Its strange that even doing a blatant:

$ dtrace -n syscall:::

didnt pick up this issue before, and no matter how hard I tried, I didnt reproduce the error on Ubuntu (all the way back to Ubuntu 8).

What is interesting about the Centos 5 series of kernels is how bastardised they are. They are based on the 2.6.18 kernel, but with many upstream patches. This means using normal kernel version conditional compilation, wont work, not without taking into account the RedHat major version numbers, and even then, the prolific numbers of kernels are problematic to support. Anyway, the issue with rt_sigsuspend related to something I had forgotten to do with 32b binaries on 64b kernels.

That *appears* to be resolved; I did have some form of difficult to narrow down regression which may persist in the RedHat/Centos kernels (occasional CPU lock ups - cant get the info out of the locked kernel to determine what it is, and need to get kdump or kgdb to work properly).

I am back to playing with the PID provider. Tracing every instruction in the CRiSP executable works .. for a while, and then we jump off to location 0 for some reason. Difficult to track down (despite lots of debug in /proc/dtrace/trace), since everything looks right, but we just decided location zero was a good place to go.

I wish CPUs had some form of trace buffer to see where we had jumped from. (Please? Pretty please? Maybe it exists).

The fasttrap instruction emulation is very clever stuff. Theres a performance cost for dtrace on Linux since for some kernels, the NX bit is turned on for stacks (no-execute), which means we have to fudge the page table entry to ensure the trampoline instruction works ok. This potentially involves a TLB flush, which is not nice for performance. (Theres still quite a lot of printk() debug in the fasttrap code, so the TLB misses dont hurt as much as the extra debugging code).

Heres an example of the debug code:

$ cat /proc/dtrace/trace
....
1468.820705312 #0 2343-ffff81001789be28: 4c 89 64 24 e0 ff 25 00 00 00 00 56 bd 4e 00 00
1468.820705312 #0 2343-ffff81001789be38: 00 00 00 4c 89 64 24 e0 cd 7f
1468.820705312 #0 2343-COMMON: 00000000004ebd56
1468.820705312 #0 2343-ffff81001789be28: 49 89 f5
1468.823704856 #0 2343-fasttrap_isa: 1710: pc=00000000004ebd59
1468.823704856 #0 2343-ffff81001789be28: 49 89 f5 ff 25 00 00 00 00 59 bd 4e 00 00 00 00
1468.823704856 #0 2343-ffff81001789be38: 00 49 89 f5 cd 7f
1468.823704856 #0 2343-copyout: line 1723 ffff81001789be28 00007fff775df638 c=22 0
1468.823704856 #0 2343-dtrace_linux.c:rw_exit:1925: TODO:please fill me in

Until this is resolved, please take care if doing instruction level tracing.

I also have a report of a compile issue with Arch linux, but I have not been able to take the distro release (for i386) and get it to survive more than a minute or two in a VM, so I cannot easily debug the issue.


Posted at 21:35:02 by fox | Permalink
  Quote of the day Thursday, 02 August 2012  
I have had a hectic time, eradicating bugs in dtrace. Theres still some gotchas in the PID provider (certain instruction emulation issues). As fast as I can fix them, people are raising new issues.

Supporting older kernels is very labor intensive - I make slow forwards progress, but each forward step raises issues with the legacy kernels. The recent taskq work (which uses Linux workqueues) is a good case in point.

The taskq code is very simple - a mechanism for running background tasks away from a user process - to avoid interrupt deadlock. But those few lines of code rely on a mechanism which has many radical changes in the kernel.

The workqueue API is a mixture of #defines and real functions. Many of the functions are GPL functions. So, it works on kernel N, but not kernel N-1. Its quite time consuming ensuring that the 'fix' I put in for kernel N, doesnt affect kernel I, J, K, L, M, etc.

Then people report issues on kernels I dont have. Just had a report of issues on Centos 6.3. In fact in recent days, I have got Centos 5.2, 5.3, 5.5, 5.6 and 6.3 running in VMs (getting quite adept at configuring from scratch). The Centos/RedHat kernels have confusing kernel version numbering because one cannot rely on the triplet versioning (2.6.32, for instance) to determine which kernel we are compiling under.

Anyway, the quote of the day is attributed to Jeffrey:

You are brought this tool to the Common Man and not the guy who has a
huge yacht.

Thank you Jeffrey for that. Despite him having some strange issues on Centos 6.3, it feels worthwhile just for that, alone.

Now, off to strip the PID provider bare. If I win, I'll report back. If not, I got sucked into a VM, and couldnt find the exit.


Posted at 21:30:09 by fox | Permalink