Dtrace and Ftrace : cross-modification Tuesday, 29 November 2011  
Brendan Gregg does a good write up of dtrace vs system tap, here:


I have been thinking about systemtap and ftrace, and stumbled across a peculiarity in the performance game.

Solaris/dtrace has very little in the way of code when wondering about dropping or removing probes. On x86, a probe translates into a single byte breakpoint instruction. Dumping this on top of an instruction is effectively atomic (forget about barriers and all the other SMP complexities...for a moment).

But in diagnosing issues in linux/dtrace, I had to look at ftrace to check my sanity.

On x86, you can safely write self-modifying code. For future portability, Intel recommends instruction sequences to avoid problems with other CPUs, and the term "cross-modifying code" is used, whereby you are modifying instructions which might be sitting in the i-cache.

Now, in general, replacing an N-byte instruction with an M-byte, where N != M is a problem, depending on if M > 1, and M > N. Atomically modifying memory which might be executed by another cpu is "hard".

Linux 3.x and the ftrace code honor the Intel recommendation. They *stop* all the other CPUs whilst making these probe changes and use NMI interrupts to implement a hard locking region. Neat trick.

But very costly.

Solaris/dtrace doesnt. Neither does Linux/dtrace. We just "plop" our breakpoint wherever we see fit and the CPU does the rest of the work.

This is great. But, Solaris/dtrace has a problem which they havent noticed yet. In Linux/dtrace, I provide the instruction provider. This provides additional ways to drop probes on "interesting instructions" and there is a chance that the user could drop the same address probe via FBT and INSTR.

So, when a breakpoint is hit, which one wins? On Solaris, I dont think this can happen. It could in Linux, and at the moment, FBT will win and INSTR wont get a chance to handle the probe.

Thats not so much a problem.

But consider this: how do you remove a probe? Well, you just overwrite the instruction where you placed a breakpoint with the original opcode byte.

There. Done. Nice. Neat.

Hm...not so...

Solaris/FBT does not have a notion of a probe being enabled or not. It does...but it uses the breakpoint byte to indicate that the probe is set or not. In fact, Solaris/FBT doesnt really care.

So consider this: when disabling an FBT probe by overwriting the instruction, and another CPU, who has yet to see the byte modify, executes that instruction, may in fact, execute the breakpoint trap, even tho the probe is undone. So, now another CPU fires a FBT probe which was disabled. And Solaris/FBT lets it happen. It blindly fires the probe.

But, at this time, *nobody is listening*. This is fine - but a probe is potentially firing when it should not do.

The reason is, that Solaris doesnt flush the other CPUs i-cache.

I noticed this on Linux, that probes were firing, even after dtrace had terminated because I was handling the enabled/disabled state. This caused a problem. If FBT knows the probe cannot fire, then it wont intercept it. And if this happens, then we have a breakpoint trap and the rest of dtrace wont know how to handle the breakpoint - we wont know what the original byte of the instruction was.

I had to disable this feature and allow FBT to process probe traps even if the probe points had been disabled.

Which all comes down to the fact that cross-modifying code is extremely intrusive, but you can get away with it, until one day, you might not. ("One day" might mean on a different CPU architecture or some future Intel chips).

And finally: if Solaris had done the "right thing" it might be very slow at placing lots of traps or removing them. And this might account for ftrace losing some aspect of performance compared to dtrace.

BTW the current latest release of dtrace is proving remarkably resilient. I am up to 5.3 billion probes running the torture tests, on real hardware, and no problems so far.

Posted at 18:28:19 by fox | Permalink
  I am going nuts. Monday, 28 November 2011  
So, Nigel got me to check dtrace with real hardware. And my heart stopped. It ran, and then hung the machine. I narrowed down to an "fbt::form*:" (46 probes) and it kept hanging.

Nothing I could do could make it work. (Relaxing locks, removing all the actual probe capture code). I assumed the worst - a CPU L1/L2 caching issue against the SMP cores.

So, I tried on my VirtualBox VMs, and it was the same! Now, this worked perfectly yesterday, and I hadnt changed anything. Amazingly, "fbt:::" worked well, but a narrow FBT probe did not.

After narrowing it down further, it transpires that some of the calls to printk() (which now map to the internal dtrace_printf, rather than the real kernel printk) were causing some form of recursion fault. After modifying the dtrace_printf() and printk() code to avoid this, it worked as I had expected before. Nicely. On real hardware.

Now, something else is strange. I keep talking about rapid-fbt-teardowns being slow. I had done a lot of work to optimise this, and was proud of my ~10s of execution, down from more than 30-40s. On real hardware, this was coming in at sub-second. Now, on my VirtualBox, it was subsecond too! I dont know if the act of rebooting my laptop and starting afresh had fixed the issue, or some other nastiness (like the recursion above) had caused a performance issue to disappear.

I checked the timer (tick-1ms) again - and thats still bad, at about 350 x 1ms ticks per real second.

(I wonder if my VirtualBox slowdowns could be triggered by a paused VM).

Posted at 22:50:05 by fox | Permalink
  Slaphead time... Monday, 28 November 2011  
Whilst diagnosing something fishy in dtrace the other week (why xcalls are so slow), I was going to write that its obvious: when running in a VM, things are slower.

I got distracted, and have been chasing up issues and breakages in dtrace.

Nigel - such a great person for feeding back breakage and issues, was asking me why I was spending so much time on the teardown code. Its rare that people do "dtrace -n fbt:::" and they get what they deserve. I said, well, it has to be safe.

He then said, well, the figures in my prior blog didnt add up: I was quoting a great achievement of ~10s to teardown, and he was seeing sub-second teardowns. *On Real Hardware*.


So, I just checked. My 10s teardown is about 0.5s on a Pentium Core 2.33GHz machine:

218.620320647 #0 teardown start 1322505021.895618798 xcalls=90677 probes=423796
218.680320738 #0 teardown done 0.60000091 xcalls=181282 probes=291

So, double Duh! I also checked the tick-1 test:

$ dtrace -n '
	int cnt_1ms, cnt_1s;
	tick-1ms {
	        printf("%d %d.%09d", cnt_1ms, timestamp / 1000000000, 
			timestamp % (1000000000));
	tick-1s { cnt_1s++;
	        printf("tick-1ms=%d tick-1s=%d", cnt_1ms, cnt_1s);
		cnt_1ms = 0;
	tick-5s {
	        printf("the end: got %d + %d\n", cnt_1ms, cnt_1s);

And I get around 980 x 1ms ticks per second, vs around 250 x 1ms on a VirtualBox VM (running on an i7 2630 chip).

So, there we have it. Proof that a VM is slower -- very significantly in some areas, and one must be very careful to attribute performance in a VM to anything like real world. (Its not really surprising - playing with timers or doing cross-cpu interrupts, has to be mediated by the host VM). These dtrace tests are extreme and so amplify the weakness of a guest VM.

So, I have partially wasted my time in doing these optimisations, but actually, they are useful, allowing me to refine the port to Linux, but also to look for and find, potential issues which might show up after a much longer period of soak testing on real hardware.

Now...off to fix the next bug....

Posted at 17:28:53 by fox | Permalink
  DTrace and the quest for resilience. Sunday, 27 November 2011  
So, third time lucky at resolving the fast-teardown issue. The fast-teardown was proving too erratic.

This release re-enables the standard teardown. The performance difference here of fast teardown vs classic is about 40+s to terminate an fbt::: vs 3-4s.

I've put in a different optimisation, which not as good as the original fast teardown, is a decent optimisation (around ~10s).

The new optimisation realises that during teardown, not only is the current cpu invoking probes (eg dtrace_xcall invokes the functions to send an IPI interrupt to the other cpus, and the probes for our current cpu are totally pointless), but also the others are just passing the time, doing "stuff" and invoking probes. So, we try to "slow down" the other cpus - as soon as they try to probe, we put them in a small poll loop, checking for xcall calls.

I put some stats into /proc/dtrace/trace to show the tail of the teardown. Heres an example. (4-cpu VM):

46.251921756 #3 teardown start 1322415220.918787912 xcalls=35 probes=2964975
51.106543255 #3 [3] x_call: re-entrant call in progress.
56.645252318 #3 teardown done 10.393330562 xcalls=108922 probes=1161780

The teardown took 10.3 seconds, and during this time, 1,161,780 probes fired. We did 108,922 xcalls. (Previously we did a handful only - the xcalls are very expensive).

Posted at 17:45:16 by fox | Permalink
  gcc 4.6.1 Saturday, 26 November 2011  
The world of compilers should be boring. But I am impressed by GCC 4.6.1.

Having upgraded my Ubuntu system, I now get the pleasure of the latest compiler.

After 25+ years of CRiSP development, the code is pretty stable, and most bad things have been ironed out of the code base. Having been ported to just about every operating system out there (from Cray, to a 512K RAM 80186 MSDOS handheld), and used just about every compiler ever, its refreshing when the compiler takes warning-free code, and starts telling you about variables which are set and never used.

Previously, the compiler would warn on a variable defined, but unused, but wouldn't notice something like:

	void func(void)
	{	int	n;

n = some_other_func(); }

gcc 4.6 does. So, you get to look to see why you are assigning a result but never using it, and, in some instances, the call to the function might hint the call is not needed, especially if that function is a "pure" function.


My only complaint now is that its a bit noisier in a couple of other benign warnings (and hence, its more difficult to see the errors of interest).

Posted at 18:08:03 by fox | Permalink
  New dtrace release Friday, 25 November 2011  
After the release last weekend, more work was done to chase down reliability issues. They never seem to disappear. Lets hope this is better than the last "good one".

The Fedora Core 15 kernel has some "different" stuff in it compared to Ubuntu - I tracked down a few unique instructions at the probe entry points which werent emulated properly.

I also fixed some other bugs in the instruction emulator.

Next, the fast-teardown mode wasnt resilient enough and had to be rewritten. Without this, Ctrl-C a fbt::: probe could take 20+ seconds due to the frantic xcall conversations as each probe is removed one by one. Ubuntu and Fedora would complain about a stuck CPU when this happens, which isnt very social.

The new fast-teardown removes all dtrace_sync() calls, bar a few. (I would like to know how Apple does it; I know how Solaris does it; I dont know if FreeBSD is "as good as" Apple, or very poor). What is very interesting in this area is that I dont believe the Solaris mechanism can work on Linux, since Linux can have interrupt-disabled spinlocks which can prevent cross-cpu interrupts from happening and causing deadlock. My solution was to keep the lockless solution from before, but avoid the 2x calls to dtrace_sync per probe teardown, but "using a better algorithm". For simple one/two probes, it doesnt matter, but when you have 40,000+ probes in flight it makes a big difference.

Now the next issue is the tick/profile provider. I found today that the "profile" provider isnt implemented (it generates a TODO warnings). On re-reading the Solaris tick/profile provider pages, I found they are "wrong". The documentation refers to the fact that you can use, e.g. "tick-1us" to have microsecond level granularity probes, but they are not supported. Dont know why (maybe my source code is out of date, or they felt it too worrisome to support them).

My favorite provider at present is "tick-1ms" - a one millisecond tick. I knew dtrace/linux could "drop" probes when under pressure, but I havent had time to investigate why. A simple test:

dtrace -n '       int cnt_1ms, cnt_1s;
        tick-1ms {
                printf("%d %d.%09d", cnt_1ms, timestamp / 1000000000, timestamp % (1000000000));
        tick-1s { cnt_1s++;
                printf("tick-1ms=%d tick-1s=%d", cnt_1ms, cnt_1s);
        tick-5s {
                printf("the end: got %d + %d\n", cnt_1ms, cnt_1s);

Lets you quickly see if there are 1000 tick-1ms probes in a second.

There arent.

Linux is achieving about 750/sec, and MacOS achieve around 997/second.

I suspect my attempts to get it to "work" have left a gap in that the timer is not a cyclic timer, but a refiring timer, so the latency between handling the end of one tick and repriming the timer, is causing is to expose a window. Thats probably a part of the explanation; there may be more. On an idle system, losing 25% of the ticks seems a bit excessive. (Dtrace/Linux can achieve in excess of 1-2m probes/sec, so, 1000 measily ticks shouldnt be too bad).

I need to experiment further. Some things that could be affecting this include: (a) bad (my) code, (b) virtualisation, (c) high latency kernel scheduling.

I hope it is (a).

Posted at 23:19:24 by fox | Permalink
  1 in a million Tuesday, 22 November 2011  
So, Nigel had reported on the sporadic crashes. Had me very stumped - running the test suite and being really nasty to the kernel caused everything to pass. However, I did get a "simple" test to try:

$ dtrace -n fbt:::entry'{cnt++;}'

Some minutes after leaving this running would cause a kernel assertion about interrupts being disabled when they shouldnt be.

This was a real needle in a haystack - a problem, which occurs after millions of probes is almost impossible to find. Was it one specific probe, or maybe single-step, or a race condition in terminating interrupt routines.

How to find it?

Well, I started by adding /proc/dtrace/fbt - which gives a birds eye view of every FBT probe, along with state, and how many times it had been hit. I augmented this with the instruction which was probed.

Running a full fbt::: would cause a large fraction of all kernel functions to be hit, but, equally, large numbers of functions that are never called (eg for device drivers which are not active).

In a sense, this device gives a coverage view of FBT traps.

By casting out the known "good" probes, I was left with potentially hundreds of distinct instructions/probes to wade through.

I didnt get far, other than to look for patterns. If we have a 1:1,000,000 failure, then something has to be happening infrequently. Using the probe counter for each probe, we get some idea of the rarely called functions.

Whatever was causing the issue wasnt so much a single function being called and blowing up the kernel, but a small piece of "damage" because the trap handler didnt do the right thing.

The kernel kept saying the same thing: interrupts were unexpectedly disabled, but knowing which trap was called just prior to this event was nearly impossible to determine.

Finally, I started the "binary search" mode of attack: Lets ignore all instructions which start 0xFn, then 0xEn, ... At some point, around about 0x9n, the problem seemed to disappear. (Absence of evidence isnt evidence of absence!). Disabling all 0x9n instructions would allow dtrace to run for hours without issue. Re-enabling would cause a warning within minutes.

Ok, so what are the 0x9n family of instructions? Well, 0x90 is a NOP - quite common in the kernel, and nothing of interest. Heres the full list:

       90                      nop
       91                      xchg   %eax,%ecx
       92                      xchg   %eax,%edx
       93                      xchg   %eax,%ebx
       94                      xchg   %eax,%esp
       95                      xchg   %eax,%ebp
       96                      xchg   %eax,%esi
       97                      xchg   %eax,%edi
       98                      cwtl
       99                      cltd
       9a                      (bad)
       9b                      fwait
       9c                      pushfq
       9d                      popfq
       9e                      sahf
       9f                      lahf

The main instruction of interest above is pushfq. So, exactly how many probes have PUSHFQ in the first instruction of the function? Well, exactly two, on my kernel:

# count patchpoint opcode inslen modrm name
21603 0000 ffffffff81032704 9c 1 -1 kernel:end_pv_irq_ops_restore_fl:entry 9c
24295 0010 ffffffff81539c00 9c 1 -1 kernel:native_load_gs_index:entry 9c

Now, env_pv_irq_ops_restore_fl hasnt been called (column 2), but native_load_gs_index, was called 16 times. So, this isnt on every syscall or interrupt, but its a rare occurrence. And that is what we are after.

So, off to examine the code, and I found a bug. PUSHFQ was deliberately disabling interrupts (because of a bug), and PUSHFQ doesnt touch the interrupt flag. So, bingo! We could return back to the kernel with interrupts disabled, and, luckily, the sporadic consistency checks would tell us that interrupts were disabled when they shouldnt be.

When writing code in an SMP kernel, having interrupts disabled and sleeping waiting for an event, can lead to deadlock. Thats why the checks are there. And they helped, very nicely, to say "something is wrong" although, not exactly where.

So far, dtrace is holding up well.

Lets see if Nigel is happy now.

Posted at 22:32:20 by fox | Permalink
  Dtrace bugs... Sunday, 20 November 2011  
Nigel has been letting me know things are not 100% yet. He has a very rare scenario of a crash, and I have some issues with launching firefox whilst tracing all system calls.

Looks like at least one isnt totally correct for some reason.

May take a while to debug these scenarios (syscall tracing should be deterministically easy), but 1:1,000,000,000 is more tricky to diagnose. Maybe I can add enforced long delays to force interrupt traffic to jam up across the cpus.

Posted at 22:18:59 by fox | Permalink
  NMI revisited Sunday, 20 November 2011  
After researching and reminding myself how it works, we can have probe points from an NMI, but first, we need to fix the interrupt handlers.

An NMI interrupt can interrupt a normal interrupt routine. A normal interrupt cannot interrupt the NMI. *But* an NMI can take a trap, e.g. a breakpoint trap.

So, now the fun starts. When an interrupt terminates, the handler executes an IRET instruction. An IRET is very similar to a POPF/RET instruction sequence except for one very subtle point.

The subtle point is that the IRET will dismiss an NMI. If we execute an IRET from a breakpoint trap which trapped inside an NMI interrupt, then chances are that the NMI will be immediately reasserted - from inside the NMI interrupt handler.

This blows up and the CPU will hit a double or triple fault and reboot.

So, to restate, no interrupt can execute an IRET if we are nested inside an NMI. Therefore we need to keep some state.

Heres the hint at what can happen:




which details the tricks the kernel is doing to handle the nested interrupt structure.

Now, I need to figure out how to do the same thing without damaging the kernel. (I have some prototype code but need to fix one issue).

Posted at 22:11:19 by fox | Permalink
  NMI #2 Friday, 18 November 2011  

Just put out a new release which hacks around the NMI issue - whilst dtrace is loaded, we do not propagate the NMI to the kernel. This certainly seems to fix the dtrace-on-real-hardware issue.

At a cost/loss of being able to have NMIs.

Next up is to contemplate only disabling NMI when no probes are active, but better is to figure out how to avoid a double-trap if we hit an NMI probe. I dont know if thats possible, by definition of an NMI.

(Solaris gets away with it, because I suspect the NMI code is such that nothing uses it and/or dtrace wont allow probes to those functions?)

Posted at 22:56:13 by fox | Permalink
  Dtrace and the NMI Interrupt Friday, 18 November 2011  
Nigel has helped greatly in moving us forward on Dtrace. The latest release works well inside a VM, but alas, might not work on real hardware.

I ran some tests on Ubuntu 11.04 on real hardware and dtrace was rock solid (well, it survived 500m+ probes and running the test suite twice over).

But, if the real hardware is generating NMI interrupts then we are toast. Ubuntu doesnt do this by default. Maybe Fedora does, or its a function of the hardware and cpu.

What I found is that if I loaded the oprofile package (which uses NMI interrupts to feed the profiler), then the host will reboot if dtrace is loaded or if its invoked.

The reason is more than likely that within the NMI handler of the kernel, if we place a dtrace probe, then we will trigger a breakpoint trap from inside the NMI handler. I dont believe this is valid or meaningful (nothing should interrupt an NMI - it should only be used for small lightweight and contextless operations, such as watchdogs).

So, we have a problem because we dont know the call graph of an NMI interrupt, so we dont know what is safe to probe (even if we did know, chances are high that common/useful probably routines would have to be excluded).

I will experiment with turning off the NMI whilst dtrace is loaded. Thats a very unfair thing to do (disabling oprofile, or other real hardware events which need NMI), but at least we would be safe.

I am going to research what Solaris does for NMI ints. Maybe that will educate me to the problem.

Posted at 22:35:20 by fox | Permalink
  Dtrace .. new release Thursday, 17 November 2011  
I just put out a new release. Hopefully this fixes a number of stability issues and some silly bugs. I have rewritten the mutex code to decouple from the kernel, and avoid reentrancy issues.

Just as I was about to release this a few days back I found a regression where on repeated driver reloads, the kernel would crash or complain of interrupts being disabled when they shouldnt.

I spent ages trying to drill down to where I had screwed up, and in the end, it was a really stupid typo (sizeof(mp) vs sizeo(*mp)).

Glad to fix that and get back to a degree of sanity.

So this release has been tested exhaustively on Fedora Core 15 and Ubuntu 11.10 (Linux kernel 3.0). I may have broken the build on earlier kernels, and am looking to try and test on the earlier releases.

Please give it a try.

As an aside, I did get an email from someone trying out Dtrace on an Oracle Linux host, and getting some strange complaints when /usr/bin/dtrace didnt like the command line. /usr/bin/dtrace is *Oracle*s dtrace, not mine, and hopefully the person is in a better position now running a working dtrace on the system.

Lets see how much damage people can do to this. Many thanks to Nigel Smith who kept harping on at me for "this doesnt work" which lead me to deep dive a number of corner cases.

I have temporarily disabled some aspects of pid tracing/shadowing and will need to fix that before evaluating the whole pid tracing space.

Posted at 22:25:37 by fox | Permalink
  Scrollbars: Time to banish them Monday, 14 November 2011  
Theres two implementations of scrollbars which need to be fixed.

First, Ubuntu scrollbars are annoying. Very annoying. The Ubuntu scrollbar is invible until you move the mouse close to where the scrollbar is and then the partial scrollbar appears. Why? Its unobvious and annoying having to have your eyes track for the existence of the scrollbar before you can scroll. 30+ years of GUI development chucked down the drain.

Second, an iPod scrollbar. When watching a movie, many shows I record from TV have adverts. They are predictable. If I am watching a 2hr or more show, and an advert appears, then I use my finger to scroll past them. But on a long show, a 2-5min interlude is a tiny fraction of the scroller, and its nearly impossible to do this accurately, especially as the ipod/iphone scrollbar is not analogue, but digital in nature, leveraging 10s-1m+ scrollable fragments. The ipod has a way to fine-scroll, but for a long film, its very difficult to scroll accurately whilst keeping an eye out for where the advert break ends.

Whats really needed for the ipod is a scrollbar which is wider than the screen, so you can quickly "flick thru" without worrying that the longer the show, the more inaccurate the scroller is.

(Another bad design: why cant you see the title of what you are watching on the scroll area? One has to stop watching the show to see what it is you are watching.)

The scroller size on an ipad is not an issue, because there are so many pixels to play with.

If you think scrollbars are easy, then think again. If you are scrolling through a 1-million line file, there are not enough pixels to scroll to an arbitrary line in the file. Even CRiSP uses a fractional position, which for larger files can mean a single pixel move of the scrollbar could map to 1000 lines in a file on a 1000-pixel high scrollbar.

Posted at 21:06:46 by fox | Permalink
  In..Out..In..Out..Shake it all about ! Saturday, 12 November 2011  

In the beginning was the mutex. And lo! This was mapped to a Linux mutex. But this caused problems, because a mutex cannot be used inside an interrupt routine.

On the second day, this was mapped to a semaphore. Semaphores are good. They can be used in an interrupt routine.

On the third day, the semaphores were replaced with custom mutexes (effectively spinlocks). Because a semaphore could suspend the calling process inside a nested interrupt.

On the fourth day, the custom mutexes were replaced by semaphores, because a timer probe would invoke calls to spinlocks and preempt disables, and lead to recursive probe faults.

On the fifth day, the semaphores were replaced with different custom mutexes. Ones which supported nested operation, and avoided depending on Linux functions.

Get the picture? Its complicated.

Why is it so complicated? Because Solaris interrupt management (via splx()) doesnt map to a the cli/sti mode of a processor. Solaris' interrupt mechanism has existed for nearly two decades, based on a processor model defined for the PDP-11. Linux's model is sophisticated, and different.

One thing I realised this week .. why it took me so long, I dont know, is that when you take a breakpoint trap, interrupts are left enabled. This means, during an FBT probe, the timer can fire and you have a nested interrupt.

This week has seen me try to solve Nigels Fedora Core issues where, under load, the system would panic and reboot. The root cause here is nested interrupts, and the dtrace module not segregating itself strongly enough from the real kernel, allowing nested operation and deadlocks to arise.

I have to be very careful to get the key execution paths working in my head (an fbt/breakpoint interrupt, a timer/tick interrupt - both standalone and on top of an fbt/breakpoint interrupt, and the xcall/deadlock issue).

I also believe I have caught another implementation issue. Interestingly I believe Linux ftrace mechanism suffers the same issue. Namely, when single stepping, one has to be careful of 64-bit instructions which use %RIP relative addressing modes. Both dtrace and ftrace almost do the right thing, but can fail if dtrace is more than 4GB away from where the kernel is loaded. (I think this might explain some erraticness on large RAM machines).

I am not sure there is a cure for this (well, not a quick/easy one), and I may have to disable those offending probes (typically only a handful on a normal kernel - not a great loss).

Anyway, let me continue experimenting with mutexes and see how close I can get. (My current experiment is real good, except for problems when kzalloc is called with a mutex...lets see if I can fix that).

Posted at 23:03:02 by fox | Permalink
  Dtrace fixed (hopefully) Monday, 07 November 2011  
I just put out a new release which fixes the terrors of the last release. Hopefully this is more stable.

I found a bug in strace which caused repeated SIGSEGVs in one of the tests (strace fails to trace a process it is launching in about 1:100 test cases), even without dtrace active at the time.

I havent integrated in the fast-teardown/xcall optimisation yet, as I need to tidy that up. Hopefully in the next release.

Posted at 23:28:33 by fox | Permalink
  Dtrace .. what a crock Monday, 07 November 2011  
I put out a new release over the weekend, and all I can say is apologies. It appears to be a very bad release. Initial testing was positive, but Nigel pointed out some glaring errors, and on investigation some of my changes were bad regressions.

I think I have the issues resolved/understood, and hope to put out a new release today to fix this mess.

Posted at 22:35:05 by fox | Permalink
  Dtrace TCP Provider Sunday, 06 November 2011  
Now that some of the hard work of getting 3.0 kernels working is out of the way, I am now looking to add the TCP provider.

Heres the first very simple cut:

$ dtrace -l -P tcp
   ID   PROVIDER            MODULE                          FUNCTION NAME         48        tcp                                                     state-change
$ dtrace -P tcp
dtrace: description 'tcp' matched 1 probe

CPU ID FUNCTION:NAME 0 48 :state-change 0 48 :state-change 0 48 :state-change 0 48 :state-change ...

It doesnt provide the arg0/arg1 values in the probe callback, but will see what I can do.

Nigel just reported that the new release is better (thanks for the quick feedback Nigel), but that this is showing a problem:

$ dtrace -n 'fbt:kernel:: BEGIN {exit(0);}'

On my kernel, it slowed down enormously, but after 60s of no RCU response, the kernel just hangs. So, off to fix that.

Posted at 12:17:37 by fox | Permalink
  How fast is fast? Saturday, 05 November 2011  
In deeper diving the dtrace lockup on 3.x kernels, I have been revisiting the xcall code.

Heres a test for you to try (on FreeBSD, MacOS, Solaris and Linux):

$ dtrace -n fbt:::

This will trace all the FBT probes (on my Linux VM, thats about 48000 probes). "Time" how quick it is to start scrolling the output. (About 1-2s?)

Now ^C (twice, you need to do this twice for some reason in dtrace).

How long til you get your shell prompt back? (On my linux system, its down to 1-2s vs maybe 3-4s on a dual core MacOS box).

Why is it *slower* to exit than it is to invoke?

The answer is: IPI or interprocessor interrupts. Now, this may be fast on Solaris - its engineered nicely. On Mac and Linux at least, its not. Its really difficult to work out "why".

When you ^C dtrace, it has to tear down all the probes. In theory this is easy, and faster than the initial construction. But the Solaris/Dtrace code has a nasty performance issue. For every probe, three dtrace_sync() functions are invoked, and this involves communication with the N-1 other CPUs to process an IPI interrupt.

This is what I emulate on Linux. But its slow. My 48000 probes involve nearly 200k IPI interrupts to the N-1 processors. (I am testing on a 4-cpu VM). And the IPI is either delivered "slowly" or "received slowly" on the target CPUs.

What is worse, far far worse is the Linux 3.0.4 kernel I am using in Ubuntu 11.10 (I compiled my own; the default distro is 3.0.12). If the tear down takes too long, the kernel may notice a "hung" cpu, and after a minute or two, will hang, hard, the kernel due to the lack of responsiveness. (I will need to see if I can find out how it knows the CPU is not-idle, and maybe fool it).

I really dislike this 3.0 kernel - its a very harsh environment for a buggy driver to live in, and dtrace has to work even harder to avoid being caught in the searchlight of the kernel.

I have a hack/optimisation for this problem, which is proving rewarding (if the "other" cpu is not sitting inside a dtrace probe handler, then we have nothing to do, so we can skip the IPI interrupt. But its not bullet-proof in the few lines of code addition).

How does Solaris handle this? Well, on Solaris, direct interrupt disabling does not happen. Instead, a software processor level flag is set, and the interrupt handler can allow interrupts, even if logically, the code in question, does not want to be interrupted. I believe by making direct dtrace checks, that IPIs across cpus can happen even if the other cpu is in a critical section. I wish I could prove my understanding of the code (but that would be a distraction).

Hm. Just took a look at Oracles version:

void dtrace_xcall(processorid_t cpu, dtrace_xcall_t func, void *arg)
        if (cpu == DTRACE_CPUALL) {
               smp_call_function(func, arg, 1);
        } else
               smp_call_function_single(cpu, func, arg, 1);

All I can say is good luck to them if thats what they think is sufficient to do the job. Thats pretty much the code I implemented originally, and it doesnt work.

Oh well.

Posted at 23:26:11 by fox | Permalink
  Dtrace progress Saturday, 05 November 2011  
I've spent the last week or so looking at why Dtrace on the 3.x kernel is misbehaving and found some interesting things on the journey.

Firstly, printk() seems to be broken. In the 3.x kernel, they appear to have added a recursion detection and stronger locking semantics, which means attempts to call printk() from an interrupt (or some other strong locking mode), will hang the kernel.

That has made debugging the other problems very difficult. So be it, I have removed or disabled most of the printk() console output.

I have an internal function - dtrace_printf() which does what printk() does, but writes to an internal circular buffer, which is safe, but isnt much help if the kernel locks up.

I have decided to move totally away from the kernel spin locks and mutex/semaphore support, and use my own mutex implementation. This avoids problems when the kernel mutexes can allow preemption and other re-entrant problems, probably leading to some of the lock up.

I even went the whole hog and added the kernel debuggers to try and debug the problem (kdebug/kgdb). This was totally unuseful for two reasons. One, you cannot set breakpoints in the kernel (despite trying). I had to turn off the kernel RODATA config option since this prevents some parts of the kernel being writable, but I still got errors. I could never get the debugger to wake up on an RCU lockup or allow interactive "break-in" to the locked up kernel.

The kernel rarely panics - it just locks up. Very frustrating.

Having fixed various things, it appears dtrace works much better. The "dtrace -n fbt:::" works very well, EXCEPT ctrl-c can hang the kernel.

What is interesting is that during ctrl-c of dtrace, the amount of work to teardown the probes is huge - specifically, the number of xcalls called on teardown is about 3x the number of probes actually played. So, down an fbt::: probe, which is placing around 40,000 probes, means on the ^C, the system is sluggish for a number of seconds as tens/hundreds of thousands of xcalls are invoked. I think this is problem which needs to be addressed in the real dtrace.

The problem is that as each probe is torn down, a number of dtrace_sync() calls are invoked, and each one is talking to all the other CPUs to ensure synchronisation. For reasons I cannot tell, this is very expensive - presumably the other CPUs may be in a disabled interrupt state, so we have to wait for that cpu to come out of the interrupt region, or leverage the NMI interrupt to break the deadlock.

I'm still trying to understand how xcall works -- if we ask another cpu to take an IPI interrupt and it is stuck in a long spinlock, then the xcall may take a while or forever to be invoked. I dont understand how Solaris breaks this deadlock, but it does appear solaris tries hard to keep interrupts enabled at all times, which is different from Linux. I may have that misunderstood in Solaris.

MacOS does something different/similar in that on a xcall, the invoking CPU keeps an eye out for recursive TLB shootdowns.

I hope to make a new release of dtrace in next few days if I feel happy I havent made things worse.

I really dont like the Ctrl-C/teardown performance - it can lead to the kernel complaining about stuck CPUs.

Posted at 09:35:02 by fox | Permalink