dtrace update 20110531 Tuesday, 31 May 2011  
Its been a while since I put out a dtrace update and thought it worthwhile to give a brief update of what is annoying me.

The most annoying thing in dtrace at the moment is me. I have spent the last two months trying hard to resolve some resilience issues.

At the moment, there are two of them: (1) is the xcall code, and the other is (2) something to do with syscall tracing.

As related on prior blogs, the mapping of dtrace_xcall() (which does inter-cpu synchronisation), doesnt map to Linux very well. On Solaris, the inter-cpu calling code works from interrupt context, but we cannot do that in Linux. (Linux will write a warning to the /var/log/messages files when this happens - although it does mostly work).

I have tried a number of variants of a native IPI system in dtrace and they have failed with various problems. The biggest problem is that on an SMP system, cpu#1 will invoke a call to cpu#2 but cpu#2 wont respond until cpu#1 finishes the xcall (a deadlock). In the code, I have resolved the deadlock by giving up after a suitable period of time, but thats not good enough. Trying to find out what cpu#2 is doing when it refuses to respond to the interrupt is very tricky. Various ad-hoc debug tricks (like using the native smp_call_function() to dump stacks) failed. Additionally, the synchronous order of messages written to /var/log/messages is horrendous when I am doing my implementation of xcall - the cpus write out of order with timestamps going backwards. (I can understand why, but it doesnt help).

I have given up resolving the SMP cross-call issue: and instead have been trying something different. The only place where the xcall issue is a problem is the profile/tick provider hr_timer clock interrupts. So I have modified the code to use a tasklet structure instead. This seems to work (I have some race condition problem to fix before I can release it).

But, during all this testing, I hit another strange and annoying scenario. Doing:

$ dtrace -n syscall:::

and doing intensive things in another window, like:

$ while true ; do date ; done
date: error while loading shared libraries: /lib64/ld-linux-x86-64.so.2: cannot apply additional memory protection after relocation: Error 9

Very occasionally, a system barfs. I have seen the output from dtrace hang (it hangs until I press a key on the keyboard). I have tracked this down : when a write() syscall is being executed, its being turned into a read() syscall.

The event is very rare - 1 in hundreds of thousands of syscalls, but its horrible. And its *this* problem which is likely what prompted me to go on the xcall wild-goose chase. The "make test" regression suite is very good at pushing the cpu load to the max whilst doing dtrace things, but it occasionally would have issues.

So, if I can chase the 1:100,000 issue in syscall tracing, then I can move forward. (I suspect a timer interrupt coming in during a syscall might be causing the issue).

As always, I will release the code when I feel its better than where we are.

Posted at 21:21:36 by fox | Permalink
  Dtrace progress - update on xcall Friday, 20 May 2011  
I have spent the last few weeks on trying to perfect the dtrace_xcall emulation. Its quite possible I have been wasting my time and/or looking at the wrong problem and solution.

What is dtrace_xcall? Very occasionally, dtrace needs to broadcast to the other CPUs to ensure they are in sync. Typically this happens when shutting down the user space application as the trace buffers need to be dropped, but happens under other scenarios.

Solaris provides a cross cpu function mechanism, as does Linux. But they do not work the same.

On later Linux kernels, you may see a BUG warning in the kernel logs, like this:

[  245.215564] WARNING: at kernel/smp.c:421 smp_call_function_many+0x69/0x1b9()

This is the kernel being nice, and warning that the smp_call_function_many() function is being abused -- specifically being called from an interrupt routine.

The problem is to do with locking and rescheduling should delays happen. At worst, this can lead to deadlock and a hang of the system you are using.

In practise, this problem is rare - but that is not good enough. Dtrace needs to be rock solid. As part of the distro, I include a torture test ("make test") which tries various things, whilst putting excessive load on the system (forked processes, opening files and doing things which I have found to cause problems on earlier dtrace releases).

The test runs well, despite the problem described above, but very occasionally, one sees horrible "failed to open shlib" type errors as Linux applications are spawned - rare, but enough to demonstrate we are doing something wrong and/or have race and locking problems.

In reviewing carefully the Solaris code, and the Linux code, I have tried about 3 algorithms to resolve the problem. The latest code is good, and is a fairly close emulation of what Solaris does, but it is difficult to prove correctness. In this latest code, the driver allocates its own distinct interrupt vector, and uses this to send an inter-cpu interrupt (IPI), whilst spin locking waiting for the other CPUs to complete.

It works real well, except occasionally - an attempt to raise an interrupt on the other cpus fails, and we have deadlock. I put some counters and delays into the code and some "get out of jail" logic to avoid total machine hang, but thats not good enough. I have tried various other hacks, but it is demonstrable that we ask an interrupt be fired and sometimes it never arrives. This, I fear is me not fully understanding how the APIC works or having interference with other interrupt sources.

But an important question in all of this is why do we get that BUG warning, as illustrated above?

The reason is the timer interrupt. If you use the tick/profile probes to get periodic probes (and even if you do not), the clock will fire and occasionally this will happen whilst a probe is being serviced. (You get dtrace timer interrupts even without the use of the profile provider, since dtrace internally uses this for deadlock or lack of system responsiveness detection). The more probes you fire the higher the chance of collision.

Add to this the complexity of a multicpu system, and debugging is very difficult. One of the most difficult things to do is to see what another CPU is doing from another. Its easy to litter the driver code with printk() tracing, and to generate a stack trace when interesting events happen - but doing this for another cpu to see what/where it is stuck is not easy (or, at least, I havent found simple code to do it - I am sure its doable, since the kernel provides support via SYSRQ to do this, but, possible using the same smp_call_function_many calls we are trying to debug).

But - back to the timer interrupt -- should they even be happening?

I am about to research this: dtrace_probe() disables interrupts during operation, so the CPU invoking this function cannot have a reentrancy problem.

But the cyclic timer code eventually invokes the dtrace_xcall function:

[  245.215564]  <IRQ>  [<ffffffff8104d9ad>] warn_slowpath_common+0x85/0x9d
[  245.215564]  [<ffffffffa02472d6>] ? dtrace_sync_func+0x0/0xb [dtracedrv]
[  245.215564]  [<ffffffffa02472d6>] ? dtrace_sync_func+0x0/0xb [dtracedrv]
[  245.215564]  [<ffffffff8104d9df>] warn_slowpath_null+0x1a/0x1c
[  245.215564]  [<ffffffff81077c34>] smp_call_function_many+0x69/0x1b9
[  245.215564]  [<ffffffffa02472d6>] ? dtrace_sync_func+0x0/0xb [dtracedrv]
[  245.215564]  [<ffffffff81077da6>] smp_call_function+0x22/0x26
[  245.215564]  [<ffffffffa025575d>] orig_dtrace_xcall+0x35/0x4b [dtracedrv]
[  245.215564]  [<ffffffffa0255a85>] dtrace_xcall+0xe/0x1b [dtracedrv]
[  245.215564]  [<ffffffffa0248c9f>] dtrace_sync+0x1a/0x1c [dtracedrv]
[  245.215564]  [<ffffffffa022f9cc>] dtrace_state_deadman+0x46/0x89 [dtracedrv]
[  245.215564]  [<ffffffffa022c875>] be_callback+0x1d/0x2f [dtracedrv]
[  245.215564]  [<ffffffff810692ed>] __run_hrtimer+0xbb/0x143
[  245.215564]  [<ffffffffa022c858>] ? be_callback+0x0/0x2f [dtracedrv]
[  245.215564]  [<ffffffff81069aa6>] hrtimer_interrupt+0xd4/0x1b3
[  245.215564]  [<ffffffff8146ff95>] smp_apic_timer_interrupt+0x79/0x8c
[  245.215564]  [<ffffffff8100a693>] apic_timer_interrupt+0x13/0x20
[  245.215564]  <EOI>  [<ffffffff81397010>] ? read_pmtmr+0x10/0x17
[  245.215564]  [<ffffffff81397025>] acpi_pm_read+0xe/0x12
[  245.215564]  [<ffffffff8106dce9>] timekeeping_get_ns+0x1b/0x3d
[  245.215564]  [<ffffffff8106e1ce>] getnstimeofday+0x54/0x89
[  245.215564]  [<ffffffff81065f2c>] sys_clock_gettime+0x61/0x90
[  245.215564]  [<ffffffff81009cf2>] system_call_fastpath+0x16/0x1b

So, now I question whether the invocation of a timer interrupt should invoke a direct call into dtrace, or whether this should be queued up and deferred. I am 50:50 on this - deferring would imply complication which doesnt appear in the dtrace code. Its possible that in the non-dtrace Solaris code, this scenario is handled in the interrupt handlers - maybe some form of decouple of the interrupt from the probe firing. (I cant figure out if/how that can be done, without the kernel doing occasional checks).

It is possible that Solaris (and MacOS) are both broken and that it is possible for the system to be deadlocked, but I havent found evidence this can happen - so, somehow, this is resolved.

(Note that dtrace linux takes some cheap shortcuts in the hrtimer code and avoids some of the complexity of the cycle timer code - I havent understood enough of the details to see what it does and if we need it in Linux; maybe eventually for accurate and non-skewable clock sources).

Before I release a new dtrace, I need to decide what to do with some of the experimental code (presently, there is missing code on the i386 side, and issues with older kernel compilations, along with a bit of dirtyness in stealing an interrupt rather than properly allocating one via the kernel APIs).

If anyone is still reading this, you can occasionally pick up private beta releases (dtrace-tmp.tar.bz2) in the dtrace ftp dir, but I dont advise touching these - as the code will be in an indeterminate state along with too much debugging enabled.

Posted at 23:02:40 by fox | Permalink
  dtrace update -- xcall interim Tuesday, 10 May 2011  
I hope I have this fixed - am just trying to prove the quality of the implementation. I'll write up the experimentation and results another day when I have time.

I have implemented my own interprocess function call interrupt because the Linux one isnt viable from within an interrupt routine.

Posted at 23:26:09 by fox | Permalink
  Trials and Tribulations of Dtrace Monday, 02 May 2011  
I reported a while back about the issue with cross-cpu function calls (via the smp_call_function() series of functions).

On Solaris, the way dtrace_xcall() works is to invoke an inter-cpu interrupt in order to ensure the CPUs are in sync with certain data structures.

Alas, on Linux, the smp_call_function's are not callable from an interrupt routine. Unluckily, this is necessary and can cause the systems to deadlock / hang, whilst dtrace breaks the API call.

I tried an alternate implementation of dtrace_xcall() using timers, but found problems in getting this to work.

I tried a 3rd variant - not invoking cross-cpu calls, but, instead executing the cross-cpu call 'on-behalf-of' the current cpu. This looks more promising, but requires putting a mutex (actually, a spin-lock) around the dtrace_probe function, to avoid one cpu executing a probe, whilst another cpu is doing the housework to clean up.

This does look better, but I am currently having rare scenarios where some syscalls can break: a repeated series of process forks can occasionally give rise to a problem mmap-ing one of the shared libraries. This is very difficult to debug - as many millions of calls can work correctly, before one of them fails.

By "thinking about the problem" it is likely that some form of stack or register corruption is happening somewhere. What might be worse is CPU cache consistency issues happening - I am fearful of cache consistency issues being very difficult to debug, but lets see what happens.

I have put extra debug code into dtrace to try and spot the issues.

There is another approach which is to intercept the NMI interrupt which is where IPI interrupts come from, but I need to think this one through before attempting this.

Posted at 21:14:26 by fox | Permalink