Natty Ubuntu - n.i.c.e ! | Saturday, 30 April 2011 |
Thank you!
My only report of dislike is that iwconfig broke in this release. My prior command to set up the wifi stopped working and took a number of rereadings of the man page to figure out what was wrong.
iwconfig eth1 key XXXXXX
Firstly, my wifi changed from wlan0 to eth1. But the key was not being allowed no matter what I tried. I eventually did this:
iwconfig eth1 key XXXXXX '[1]'
and that fixed it. Strange. Ubuntu 11.04 is a 2.6.38.3 kernel - the same kernel I had previously (or maybe it is 2.6.38.8 - am a bit confused what it did to my /boot and /usr/src directory).
Just found they broke my googlecli install. Had to reinstall python-gdata and the googlecl package to allow me to blog post again.
CRiSP Adding gridlines support | Friday, 29 April 2011 |
I have added a 'set [no]gridlines' command as an interim to turn it on/off, so I can become comfortable with what it shows me.
This is an alternative to the 'ruler' which is more 'in your face' in the way the display is done.
Sometimes, things are too fast. | Tuesday, 26 April 2011 |
I have been using X11 CRiSP over a transatlantic link and the startup time is annoying, along with visible artefacts whilst drawing the color gradients.
On investigation, the problem is our old-friend: latency. Some X11 function calls block, waiting for a reply from the server, e.g. querying color pixel mappings being the core one. The gradient and pixmap drawing code relies on lots of color allocations, forcing round trips - large numbers of them.
On a locally connected machine or network, this happens so fast, you rarely notice it.
CRiSP was written back when monochrome displays were common - certainly color was a very rare thing. Today, we luxuriate in 24/32 bit video displays and rarely even think about it.
So, these round trips are pointless when we know what the RGB mapping will be - no more XQueryColors, or even XAllocColor.
Doing this has a dramatic performance enhancement for these long latency trips. (You wont notice the speedup on a local machine - not unless you use special measuring tools).
The first part of this went into CRiSP 10.0.6, and the pixmap enhancements in 10.0.7.
[Note, I am dropping the trailing letter in CRiSP version numbers - they never served a real purpose, and people sometimes forget to feed this back on error reports; the build number tells me everything I need to track back to specific source code changes].
At the moment, this feature is enabled by setting an environment variable:
$ export CR_XS_DIRECT=1
http://www.crisp.demon.co.uk/download.html
Dtrace bugs | Sunday, 24 April 2011 |
Try the following on a Mac:
$ dtrace -l >/dev/null & dtrace -l >/dev/null & dtrace -l >/dev/null & dtrace -l >/dev/null &
We get interrupted system calls and issues on the major/minor numbers (see 'dmesg' after it fails).
Next up is DTRACE_ENABLEIOC. This calls dtrace_copyin_dof() which validates the main dtrace mutex is not asserted. Normally it isnt, unless someone else is running a heavy handed probe trace, in which case a nasty race condition exists. Unless a VFS lock is applied, this could spell out danger or kernel panic (on Solaris, FreeBSD and MacOS).
dtrace_xcall is another area of potential for kernel deadlock if multiple dtraces are firing on multiple cpus.
I have nearly finished my "safety" checks in dtrace, having written my own dtrace_xcall, and am checking that kernels dont deadlock on you.
But its uncovering some nasty race conditions in dtrace as a whole.
How does this work? (wifi) | Friday, 22 April 2011 |
Tonight, my wifi broke, totally, on my laptop. Spent last few hours slaving over the system, a driver or two and all the config in the world.
Today, was the day I threw out most of my old PCMCIA wifi cards. Typical.
Ok, out with the screwdrivers, take the back of the laptop off (and learnt that my laptop can take two hard drives at the same time).
No joy. Niente. Nothing. Nada.
Same problem.
Use the Android phone to analyse the wifi around me.
Use the iwlist command to probe access routers around me. Someone must have installed a new one.
Next....err....why does my startup script set the channel to 11, yet iwlist is showing it on channel 6?!
Ok, so set the channel to 6. And Viola! It works.
Weird. Very weird.
If the channel was so wrong, how did it work at all?
Just dont *ever* assume I know what I am talking about.
smp_call_function and dtrace_sync | Tuesday, 19 April 2011 |
Linux's implementation seems ok. But it is at right angles to the Solaris implementation. Solaris seems to have a higher level of semantics in this area, allowing interrupt code to invoke inter-cpu synchronisations.
I have a question - and if anyone (Adam?) knows the answer, feel free to educate me. Without an answer, I may have to wrap the Linux APIC interrupt handlers with code that resembles Solaris.
What does dtrace_sync() *actually do*? In reading the code, it is trying to ensure other cpus are synchronised in terms of the probe states, but the implementation *looks wrong*. dtrace_sync() is a way of ensuring that another cpu is either not running the dtrace code, or, if it is, is at a sync point. But there are many sync points in the dtrace driver, and no guarantee that the other cpus are anywhere close to where the invoking cpu is asking for help.
Its a bit like a 3-dimensional goto - trying to prove safety via the various code paths is not easy (maybe not possible).
Normally, mutex's are used to ensure guarded regions of code, along with interrupt enable/disable, to prevent nested interrupts.
But dtrace_sync() is different - it suspends the invoking cpu until the other cpus have acknowledged the interrupt - and the acknowledgement is not done based on where the other cpus are.
The problem that dtrace/linux is having is mostly around the timer interrupt - breaking the kernel contract on interrupts and scheduling state. Its not possible (without hacking or damage) to conform to the contract, which means I need to either stop using smp_call_function or seek some other mechanism.
I need to sleep on this and work out the various permutations of code paths.
[Note, I do have a safe workaround, but the workaround will cause the odd timer tick drop within dtrace (the rest of the kernel is not affected). I may have to release this as a temporary safe fix].
opensuse .. and the differences | Monday, 18 April 2011 |
2.6.31.14-0.6-desktop
Its caused me a few days of frustration, but I didnt want to be beat. I have mostly cracked it now. The issue is around syscalls. The way dtrace intercepts the system calls is by patching the system call table. This normally points to the C code function to implement the activity, and a small amount of assembler glue and C is used to "hook" or "wrap" the system call.
One of the problems is that all system calls in dtrace go through the same hooking code. When a system call is invoked, the %RAX register contains the system call number, and dtrace assumed this register was left intact.
It turns out it isnt. But why it isnt is interesting.
The opensuse kernel appears to be one of the most resilient kernels to kernel crashes. Many a time an array of general protection faults and illegal instruction traps fired, but I have rarely had to reboot the VM.
Two things distinguish this kernel: DWARF stack walking and stack protection. The opensuse kernel has a DWARF stack walker which helps to ensure more accurate stacks are displayed when a fault occurs. (Similar to the work I started, but to which I abandoned. Maybe I can look at that code and see what style of approach they used). [Stack walking is problematic generically, because of the 32 + 64 bit kernels, along with all the permutations of GCC compiler switches which makes it difficult to ensure the code base can handle these variations].
The stack protection ensures that if a buffer overflow or some other bad thing happens, then this is caught very fast.
The approach that GCC takes is to snapshot a random value on the stack at the start of the function and validate the value is still there on exit. This code utilises the %RAX register, which is what was tickling my problem.
After various attempts to "jam" the uncorrupted %RAX into the C arguments to the dtrace handler, I gave up. On 64-bit code, arguments can be passed on the stack or via registers (the first 6 arguments only), which means some degree of register fiddling, but also sensitivity to compiler regimes and kernel compilation modes.
What I did was created a new per-cpu data structure so that the %RAX register could be saved, without corruption by another cpu. This data structure can then be used by the syscall wrapper code and the results look good.
The existing systrace code has to handle 6 of the syscalls specially (eg clone, fork and a few others), because, by definition, these syscalls dont take the normal "exit" from the kernel route, but hopefully I can fix these.
The next steps is to see if this "better" code works for the other kernels I did not have problems with, and then to contemplate looking at the 32-bit code implementation.
Hopefully, an update in a few days (or over the long weekend).
Why? | Friday, 15 April 2011 |
OpenSUSE 11.x - very nice system, very slow network gui configurator (GUI network configurators drive me mad also - everyone is different, none of them as good as the old fashioned SunOS /etc/rc.local files, and all of them bloat. OpenSUSE cannot even have a settings control panel that even looks familiar to any other operating system).
But my beef is the kernel source/images. Everyone has the module include files in /lib/modules/`uname -r`/build/. Except OpenSuse who has them in the /lib/modules/`uname -r`/source directory. (Maybe they are available in the build/ dir if I had installed the right extra package, but its like russian roulette trying to guess where people have stored things or what the package names are).
apt-get, zypper, pkgadd, ... the list of names for proprietary package installers is maddening. WTF is "zypper"? At least "pkgadd" hints what it is. apt-get - which I am now familiar with, is standard across many distros, yet the name of the package installer on Fedora eludes me.
Whilst I am bitching, what is "plymouth", what is "policykit", what is "hald-runner", and the other ten zillion memory hogging wasteful utilities on a Linux system? Go look for "plymouth" on google and see how well you find references.
What happened to meaningful names. Even "dtrace" hints at what it might be, and is unique to not clash with any other English word.
So....why do I care? Because I am setting up a new dual-core VM to debug the smp_call_function issue. I think I know why smp_call_function might be causing me erratic reliability issues, but I need a good kernel to demonstrate/cause the bug I am trying to find.
Incidentally, I have turned up the clock ticks in the tests/tests.d script to 1mS, to force a higher rate of clock-ticks interrupting caught probes to help diagnose the tests. (I might even go higher - I want to torture any system I come into contact with, because that is the only way to validate interrupt-based coding systems; reading code or test-driven scenarios can never handle the multi-dimensionality of interrupts occuring at just the right + wrong points in time).
smp_call_function and friends | Wednesday, 13 April 2011 |
The torture test involves use of copyinstr() against an argument which is not a string pointer. Good ones to test are fbt:::return, where the argN parameters are whatever is left on the stack or in registers. Dereferencing these lead to kernel level GPF or page faults. Dtrace should be resilient to these.
Yet, it still can crash.
Some kernels are better at handling this than others - the later releases generally have better interrupt handling routines or BUG_ON warnings which log a message when a driver breaks a programming contract.
Which brings us to smp_call_function and it friends. From day 1, these functions caused me pain; fortunately, Linux provides a set of these functions, very similar to Solaris, so we won, handsomely.
Or didnt.
What are they? On a multicpu system, it is sometimes necessary for drivers to invoke synchronisation barriers on other cpus. Typically an SMP kernel and driver will utilise arrays of structures - one per cpu, so that each cpu can process work independent of what another cpu is doing. In the case of dtrace, we may be handling an fbt trap on one cpu, whilst another is doing a system call. As the number of cpus goes up, the permutation of scenarios of user code, kernel code, drivers, shared structures, etc goes up.
What does dtrace do to warrant inter-cpu function calls?
Well, dtrace, the user space program works by periodically polling the driver for trace information. Each cpu utilises a double-buffering approach for tracing: traps and probes are recorded in an event buffer. When the buffer is full, it can switch to an alternate buffer. When bin/dtrace asks for a buffer dump, the buffer is emptied, and dtrace asks the kernel driver to switch to the alternate buffer - just like video double buffering.
A single bin/dtrace is effectively asking for data from all cpus. So, the cpu which takes the ioctl() to ask for the buffer dump has to tell the other cpus to "empty their buffers". It could do this by swizzling the pointers in the per-cpu buffer, but this is dangerous - the other cpu may be executing code leveraging these things. You cannot simply disable interrupts and lock them out with a normal type of mutex (you could do, but the number of places to litter mutex can be high). Instead, because this is so rare, the smp_call_function functions are invoked to ask each CPU in turn to do an action on behalf of the invoker. Its quite elegant.
This is all based around the APIC implementation on the Intel/AMD cpus which provides a mechanism to send a forced-interrupt to another cpu. The target cpu takes the interrupt (assuming interrupts are not presently disabled on that cpu), performs the task, and exits the interrupt.
The Solaris and Linux implementations are similar, but different. That difference hurts.
Ok, so now to the tricky bit. Consider at any point in time, what is a cpu doing? It could be in user space, or kernel space. In kernel space, interrupts may be enabled or disabled.
Lets consider kernel running with interrupts enabled - another cpu invokes smp_call_function; the first cpu takes the interrupt and returns.
Now, Linux has a programming contract: when invoking the function, we must honor the following contract (taken from comments in kernel/smp.c):
* You must not call this function with disabled interrupts or from a * hardware interrupt handler or from a bottom half handler. Preemption * must be disabled when calling this function.
It is saying that the invoke of the function cannot be an interrupt handler. Guess where (my) dtrace implementation invokes smp_call_function's from? The hrtimer code. A timer callback, by definition, is an interrupt. So, because of this contract, which we have broken, profile tick probes can interrupt the kernel when it should not do, leading to either kernel warnings, or, kernel deadlocks.
I recently attempted to avoid recursive dtrace probes, but if we are not careful, we will lose timer tick probes, which can break scripts. (My own regression tests, which terminates after 5s, never terminate because the tick we need is discarded).
So, we need to fix this problem. The SMP function calls rely on a fair amount of fabric to control the APICs and other cpu register masks, so its not as simple as reimplementing the code to shield from kernel artefacts.
So, I need to come up with a plan.
Diary of a compiler bug | Sunday, 10 April 2011 |
GCC - an excellent compiler - has been around for many years. I think I first learned of it back in 1987 or so, and not only has a huge number of updates and releases, but works on nearly every CPU/OS combination ever invented.
But, when you hit a problem with the compiler, you are hosed -- you could report a bug, but this may be fruitless, especially when the compiler version is quite a few years out of date.
Today, I hit a compiler bug causing strange behavior in Dtrace on Ubuntu 8.04, using the gcc 4.2.4 compiler. Yes this is old, and there may be patches for it, but who actually runs Ubuntu 8.04 today? Is my Ubuntu even patched up to date? (No, it isnt).
And does this bug exist in many other prior/next compiler versions?
There isnt enough time in the world to validate this. So a kludge, but a nice one, was applied:
if (dp->dtdo_rtype.dtdt_kind == DIF_TYPE_STRING) { char c = '\0' + 1; int intuple = act->dta_intuple; size_t s;for (s = 0; s < size; s++) { if (c != '\0') c = dtrace_load8(val++);
#if linux /***********************************************/ /* This pointless code, which will never */ /* fire, is to work around a gcc compiler */ /* bug which causes a page fault because */ /* 'act' gets overwritten. I havent exactly */ /* figured out whats going on here, but */ /* turning off optimisation (which is not a */ /* good plan for __dtrace_probe()) isnt */ /* viable. I have seen this on Ubuntu 8.04, */ /* gcc 4.2.4, i386. */ /***********************************************/ if (act == valoffs) { printk("defeat compiler bug! %p act=%p s=%x/%x %x %x\n", &act, valoffs, end, act, s, size); } #endif DTRACE_STORE(uint8_t, tomax, valoffs++, c);
if (c == '\0' && intuple) break; }
The above code is embedded into the middle of the very important dtrace_probe() function. (Renamed to __dtrace_probe, to allow/detect re-entrancy problems, which lead to finding this bug).
The code is too large to easily find this bug or refactor. I verified that turning off optimisation avoids the bug, but thats not a good thing, as per the comment above.
What seems to be going wrong is register allocation/spilling in the compiler. I *hope* it only affects this code fragment, but its very difficult to tell.
The effect of the bug was to cause a key variable ("act") to be overwritten. Fortunately, the kernel survived the subsequent panic/page-fault, but its not comforting to have this happening in an unexpected way.
BTW I managed to nuke my first VM last night - Fedora Core 14. After debugging something similar on FC14, I started getting strange filesystem bugs. Even a reboot didnt fix things. I got annoyed and fsck'ed the root filesystem manually, despite fsck telling me "this was a bad idea". And it was right. Because of the LVM, i nuked the root filesystem, and had to reinstall.
I do love VMs - took less than an hour to recover (oh, I wish I had snapshotted the filesystem first, but, well, you know, I didnt!)
New dtrace later tonite to fix the compiler issue above.
Dtrace and the insane dtrace_probe_error call | Saturday, 09 April 2011 |
So, leveraging the fact that I am not insane, I can fix it to.
(Its possible that in looking at the Open Solaris implementation of dtrace, its missing some key features, as I find it difficult to believe real Solaris has this issue).
Either I am insane or dtrace is... dtrace_probe_error() | Friday, 08 April 2011 |
Heres the function prototype for:
void dtrace_probe_error(dtrace_state_t *state, dtrace_epid_t epid, int which, int fault, int fltoffs, uintptr_t illval)
This function, which is (perversely) written in assembler in FreeBSD and Solaris, calls dtrace_probe(). An additional arg is passed back to dtrace_probe -- the internal id of the ERROR probe (dtrace_probeid_error).
All good so far.
But - and its late here, maybe I am too tired - the number of arguments are *wrong*. The last parameter passed to dtrace_probe() is the "illval" argument which is the address of a faulting instruction.
Consider this D script:
BEGIN { x = (int *)NULL; y = *x; trace(y); }
We should get an error from Dtrace telling us we tried to access address 0x0000. But in Linux, we get some other random value. This is because we cannot pass in the illval address - we ran out of arguments (arg0..arg4).
*Yet* FreeBSD/Solaris pass in the extra arg anyway - one arg too many.
I need to read this more to understand what is going on.
Heres the solaris code:
ENTRY(dtrace_probe_error) pushq %rbp movq %rsp, %rbp subq $0x8, %rsp movq %r9, (%rsp) # <=== what is this?! movq %r8, %r9 movq %rcx, %r8 movq %rdx, %rcx movq %rsi, %rdx movq %rdi, %rsi movl dtrace_probeid_error(%rip), %edi call dtrace_probe addq $0x8, %rsp leave ret
New dtrace release | Wednesday, 06 April 2011 |
Initial results look good - although I havent done a diverse validation across my kernels, but it should boost dtrace usability.
I am seeing a lot of these on FC14, indicating some kernel API protocols are not being conformed to...thats next on my list:
[ 311.686851] BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1074 [ 311.687789] in_atomic(): 0, irqs_disabled(): 1, pid: 2553, name: tests.pl [ 311.687789] Pid: 2553, comm: tests.pl Tainted: P D W 2.6.35.6-45.fc14.x86_64 #1 [ 311.687789] Call Trace: [ 311.687789] [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0 [ 311.687789] [<ffffffff8146c374>] do_page_fault+0x15c/0x265 [ 311.687789] [<ffffffff814697f5>] page_fault+0x25/0x30 [ 311.687789] [<ffffffffa023ddfe>] ? dt_try+0x0/0xa [dtracedrv] [ 311.687789] [<ffffffffa021f0b2>] ? dtrace_load8+0x41/0x90 [dtracedrv] [ 311.687789] [<ffffffffa022842b>] dtrace_probe+0x202b/0x2420 [dtracedrv] [ 311.687789] [<ffffffff8111f4d1>] ? path_put+0x22/0x27 [ 311.687789] [<ffffffff811201e9>] ? putname+0x34/0x36 [ 311.687789] [<ffffffff81116511>] ? do_sys_open+0xfe/0x110 [ 311.687789] [<ffffffffa024011e>] dtrace_systrace_syscall2+0x208/0x21b [dtracedrv] [ 311.687789] [<ffffffffa02402ab>] dtrace_systrace_syscall+0xb2/0xb4 [dtracedrv] [ 311.687789] [<ffffffff81009cf2>] system_call_fastpath+0x16/0x1b
Tales of an Interrupt -- ud2a | Monday, 04 April 2011 |
So, whilst enabling lots of fbt probes, I was finding AS4 kernels dying on me. Didnt appear to happen on other kernels, but probably could/would do.
On investigation, we had executed an invalid instruction. Weird, as in theory, the dtrace disassembler should just *work*.
But it turns out that because Linux uses judicious use of inlined assembler code, we had hit a problem. The BUG_ON macro in the kernel, used for asserting broken behavior (typically of a driver), is implemented by utilising an invalid instruction. The invalid instruction (UD2A) is great - we handle that ok, but subsequent to this instruction is a 10 byte field which encodes the file/linenumber of where the bug happened.
Interesting.
Rather than have the compiler embed lots of __FILE__ strings into the kernel memory, the strings are uniquified and stored in a separate lookup table, utilising tricks of the GCC compiler.
Of course, Solaris doesnt do this - and so the FBT code need not worry about this. (But FreeBSD and MacOSX *could do* if they so choose).
A quick path to the prov_common.c file stops us tripping over and losing sync with the instruction stream, and this fixes the problem with:
$ dtrace -n fbt::a*: