Happy New Year Dtrace...You ruined my Christmas | Saturday, 31 December 2011 |
This was related to the page-fault interrupt vector. If that was disabled, then all was well.
Strange. I dont recall this error before....
Although most recent work has been on the 64b version of dtrace, I had assumed the 32b was in sync and all-was-well. But not so.
Its an interesting trail....I thought the driver reload/reload/... cycle was fixed. It works well on 64b kernel, but not on the 32b one.
After a lot of searching, and narrowing the problem down to the page fault interrupt vector, I checked on my rock-solid Ubuntu 8.04 (2.6.28 kernel). And hit the same problem: namely a double (or even single) driver reload would hang the system.
I spent a lot of time on the Ubuntu 11.10 kernel - the driver would hang the kernel on the first load after bootup. I eventually was tinkering with GRUB and turning off the splash screen, and got to a point where the first load would work, the 2nd would hang.
Prior to this point - I had no way to debug the code. Any attempt to leave the page fault vector modification in place would hang the kernel .. or cause a panic in printk(). I eventually considered this to be a problem where the segment registers were not setup properly. (The kernel uses the segment registers to access kernel data and per-cpu items, so, if these are incorrect on a page fault, you arent going very far). I even cut/pasted the existing kernel assembler code for page_fault, but had a lot of problems getting something repeatable.
Whilst investigating this, I had to do a lot of "mind-experiments": what was the CPU up to? Why was it having a hard time?
Well, what I realised is a number of things:
On a SMP system, each CPU has its own IDT register - set to the same location in memory. We might patch the interrupt vector table, but there was no guarantee the other CPUs would see these changes atomically. In addition CPU caching might cause the other cpus to see the old values of these interrupt vectors, until enough time had passed to force cache line flushing.
Bear in mind, we are patching vectors in a table, so, the CPU may not know we had done this. For all we know, the CPU may have cached the page_fault vector and may not notice our changes. Or, ditto for the other cpus.
So, google to rescue us. After a short while, I found these two links:
- http://stackoverflow.com/questions/2497919/changing-the-interrupt-descriptor-table
- http://www.codeproject.com/KB/system/soviet_direct_hooking.aspx
The second link hinted at my problem: if you randomly change the interrupt vector table, then expect problems. The codeproject link didnt suggest an explanation, but hinted that the way forward was to create a new IDT, copy the old table to the temp, switch the CPUs away, make the updates, then switch back.
The first link confirmed this. (Interestingly, the second link is for info on a Windows kernel, but the first link echoed the same sentiment on Linux).
After a lot of fine-tuning the code and cleaning up, it now works !
The trick is to switch the CPUs away whilst updating the vector table, and switch back when the updates are done. Also, to do the same during the driver teardown code, so we can load/unload repeatedly.
On the way, I added a /proc/dtrace/idt driver so its easier to visually see the raw interrupt descriptor table.
One interesting issue here is why the 64b driver didnt suffer the same problems? It feels like we hit a CPU bug/errata in this area, and the 64b CPU mode does not suffer this problem (or, the size of the 64b IDT vector entries "move" the problem around).
Now I just need to tidy up the code and release ... the last release of 2011.
Happy new year.
Dtrace / Ubuntu 11.10/i686 | Tuesday, 27 December 2011 |
I was asked about some compile errors on Ubuntu 11.10/i686 and was confident there was no issue here.
What I hit upon really made my stomach churn. Firstly, a simple error stopped compilation because of something not quit right in GCC. That was easily fixed.
But no /usr/include/sys directory was a stunning shock. That means most apps are not going to compile. Period. Scanning google showed that the error was common, but yet I couldnt find a solution, that didnt involve a "hack". The hack is to manually symlink /usr/include/i386-linux-gnu/sys to /usr/include/sys.
Strange, because the 64bit release works perfectly.
But then, the next horror story is that dtrace will quite happily crash the kernel. There seems to be possibly two avenues of investigation. Firstly, the kernel is complaining about stack overflow, which means they built the kernel with tiny (4k? 8k?) stacks, and could be a problem for dtrace. If this is the case, we will need to switch to our own stacks; not had to do that before.
The other issue, which might be related, is that the interrupt handler does not work. Again, this could be related .. if we blow a stack then we will corrupt whatever is abutting our stack pages, and all bets are off. (I am getting quite reliable VirtualBox / Guru Meditation dialogs, which is a sure sign of a double or triple fault, again, hinting at bad karma with the stacks).
So, off to go fix the i386 build. (The Ubuntu 8.04 release works really nicely, as a BTW).
dtrace update | Thursday, 22 December 2011 |
The first was that 64b ELF binaries using user-space (USDT) probes couldnt correctly notify the kernel of the probe points. This was tracked down to a strange issue in the libelf binary where attempts to update the relocatable symbol table entries weren't being committed to the output file. I also found a lack of symmetry in the Solaris code, which probably worked because the Solaris libelf routines allow for a problem of storing a RELA entry into a REL slot.
The second issue was that user space breakpoints being ignored by the interrupt routines by a previous cleanup/change. i386 and x64 are now in sync.
Heres an example of the sample program demonstrating the USDT probes in action:
In one terminal, we run the simple-c tool:
$ build/simple-c __SUNW_dof header: dofh_flags 00000000 dofh_hdrsize 00000040 dofh_secsize 00000020 dofh_secnum 00000009 dofh_secoff 0x40 dofh_loadsz 0x270 dofh_filesz 0x400 0: 0008 0001 0001 0000 0000023c 00000034 1: 0010 0008 0001 0030 00000160 00000060 2: 0011 0001 0001 0001 000001c0 00000002 3: 0012 0004 0001 0004 000001c4 0000000c 4: 000f 0004 0001 0000 000001d0 0000002c 5: 000a 0008 0001 0018 00000200 00000030 6: 000c 0004 0001 0000 00000230 0000000c 7: 0001 0001 0000 0000 00000270 0000000a 8: 0014 0001 0000 0000 0000027a 00000186 PID:14632 0: here on line 93: crc=00008998 PID:14632 here on line 95 PID:14632 here on line 97 PID:14632 here on line 99 PID:14632 1: here on line 93: crc=00008998 PID:14632 here on line 95 PID:14632 here on line 97 PID:14632 here on line 99 PID:14632 2: here on line 93: crc=00008a4c PID:14632 here on line 95 PID:14632 here on line 97 PID:14632 here on line 99 PID:14632 3: here on line 93: crc=00008a4c PID:14632 here on line 95 PID:14632 here on line 97 PID:14632 here on line 99 PID:14632 4: here on line 93: crc=00008a4c PID:14632 here on line 95 PID:14632 here on line 97 PID:14632 here on line 99 ...
Ignore the dof dump at the top - that was for my benefit to debug what was being sent to the kernel. Note the "here on line" messages, and the CRC. As the app started, but before the USDT probes were enabled, the code segment had one checksum. After I started dtrace in another window, the checksum changes, (which proves something happened to the code segment, namely the NOPs are replaced by breakpoint instructions):
$ dtrace -n :simple-c:: dtrace: description ':simple-c::' matched 2 probes CPU ID FUNCTION:NAME 1 312224 main:saw-line 1 312225 main:saw-word 1 312225 main:saw-word 1 312224 main:saw-line 1 312225 main:saw-word 1 312225 main:saw-word ....
Theres still some problems to resolve. When the simple app terminates, the probes are left active. We need to detect process exit (or exec) and remove the probes.
Theres some more examples/details on dtrace.org on USDT, here:
http://dtrace.org/blogs/dap/2011/12/13/usdt-providers-redux/
Your dtrace fell into my dtrace :-) | Saturday, 17 December 2011 |
Having validated we have the dtrace print() function, we can do a CTF-style structure dump print.
You are partially on your own - go find a structure to print out. (You can't have mine, because its mine! All mine!)
This simple example shows the ctf-style print in action:
Reference: Eric Schrocks dtrace blog:
http://dtrace.org/blogs/eschrock/2011/10/26/your-mdb-fell-into-my-dtrace/
$ build/dtrace -n 'BEGIN{print(*((struct file *)0xfff fffff81a156a0)); exit(0);}' dtrace: description 'BEGIN' matched 1 probe CPU ID FUNCTION:NAME 1 1 :BEGIN struct file { union f_u = { struct list_head fu_list = { struct list_head *next = 0 struct list_head *prev = 0 } struct rcu_head fu_rcuhead = { struct rcu_head *next = 0 void (*)() func = 0 } } struct path f_path = { struct vfsmount *mnt = 0 struct dentry *dentry = 0 } const struct file_operations *f_op = 0xcf spinlock_t f_lock = { union { struct raw_spinlock rlock = { arch_spinlock_t raw_lock = { unsigned int slock = 0 } } } } int f_sb_list_cpu = 0 atomic_long_t f_count = { long counter = 0 } unsigned int f_flags = 0 fmode_t f_mode = 0 loff_t f_pos = 0 struct fown_struct f_owner = { rwlock_t lock = { arch_rwlock_t raw_lock = { s32 lock = 0 s32 write = 0 } } struct pid *pid = 0 enum pid_type pid_type = PIDTYPE_PID uid_t uid = 0 uid_t euid = 0 int signum = 0 } const struct cred *f_cred = 0xffffffff810267c4 struct file_ra_state f_ra = { unsigned long start = 0 unsigned int size = 0x17d436 unsigned int async_size = 0x8 unsigned int ra_pages = 0x3e8 unsigned int mmap_miss = 0 loff_t prev_pos = 0x200fffd058 } u64 f_version = 0x700000000 void *f_security = 0 void *private_data = 0xffffffff810267dd struct list_head f_ep_links = { struct list_head *next = 0xffffffff81026995 struct list_head *prev = 0 } struct list_head f_tfile_llink = { struct list_head *next = 0 struct list_head *prev = 0xffffffff817ae848 } struct address_space *f_mapping = 0xffffffff00000064 }
For those of you who are observing, I picked a random symbol in the kernel to prove this works ok, so dont treat that file structure as having any meaning !
llquantify | Saturday, 17 December 2011 |
Rather than me trying to do justice to this function, its better to let Bryan Cantrill give you the low-down, as in here:
http://dtrace.org/blogs/bmc/2011/02/08/llquantize/
Theres some kernel fixes/enhancements in here. (A fix for ungarbage collected ecb's when a module is unloaded, but thats currently disabled until I put in a fix for the timer callback; module unloading is relatively rare, and untested on Linux/dtrace).
Heres the output of a run (inside a VirtualBox VM):
$ dtrace -n "tick-1ms{@ = llquantize(i++, 10, 0, 6, 20);} tick-1ms/i==1500/{exit(0);}" dtrace: description 'tick-1ms' matched 2 probes CPU ID FUNCTION:NAME 1 279854 :tick-1ms
value ------------- Distribution ------------- count < 1 | 1 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 6 | 1 7 | 1 8 | 1 9 | 1 10 | 5 15 | 5 20 | 5 25 | 5 30 | 5 35 | 5 40 | 5 45 | 5 50 | 5 55 | 5 60 | 5 65 | 5 70 | 5 75 | 5 80 | 5 85 | 5 90 | 5 95 | 5 100 |@ 50 150 |@ 50 200 |@ 50 250 |@ 50 300 |@ 50 350 |@ 50 400 |@ 50 450 |@ 50 500 |@ 50 550 |@ 50 600 |@ 50 650 |@ 50 700 |@ 50 750 |@ 50 800 |@ 50 850 |@ 50 900 |@ 50 950 |@ 50 1000 |@@@@@@@@@@@@@ 500 1500 | 0
Ding dong! Who's there? Anybody? Someone say hello! | Sunday, 11 December 2011 |
I'm not sure if I have fixed or found the issue...but whilst trying to reproduce the issue (and I am not happy having to wait 21h to get a feel for the issue), I found a new bad scenario.
On a 3 CPU VM, if I run a simple syscall:::{exit(0);} type of probe, repeatedly, then running *four* of these will deadlock the system.
After a lot of poking around, adding debug, and trying to understand it, I think I located the source of the problem.
So, a process runs on cpu#0, and whilst holding on to the locks, is suspended by the kernel. Next, another dtrace process comes along and blocks waiting on the mutex held by cpu#0. Repeat two more times.
Now, the mutex implementation is effectively a spinlock, and we dont allow the first process, holding the locks to run. So we have deadlocks and a hung system.
The cure appears to be calling schedule() in the middle of the mutex-wait loop, avoiding cpu starvation.
This appears to work fine.
New release with this fix, and Nigel can spend another 21h finding more bugs for me. :-)
Couple of tools | Friday, 09 December 2011 |
I thought it worth documenting a couple of trivial tools I wrote.
The first tool, is a simple "tail -f" tool. It seems that the GNU "tail" tool doesnt work for character special device files. So:
$ tail -f /proc/dtrace/trace
doesnt track the new data in there. A char-special file doesnt have a "size", so, when EOF is hit, you cannot necessarily do the standard trick of seeking and rereading the new data.
Whilst debugging dtrace, its useful to keep an eye on /proc/dtrace/trace. I had given up using kernel printk() messages for debugging, because when faced with cpu lockups, you may not be able to see the messages and earlier lockups had implied that the lockups were a by product of multiple cpus causing printk() which in the later kernels, has its own mutex.
So, tools/tails.pl does the magic of doing the tail. The algorithm is very simple: keep reading lines, storing them in a hash table. If we havent seen the line before, print it out. On EOF, sleep for a while and keep rereading the file. Its cpu intensive (and potentially memory intensive) but for the purposes, it fits the bill.
I also found it useful for tracking /var/log/messages, since in the event of a stuck-cpu, you may not have a chance to see the file.
cpustuck.pl
The next tool - which is quite interesting, is trying to visually see a stuck cpu. Almost, like a black hole, you cannot see it - its likely stuck in a tight loop with interrupts disabled.
Heres how we do it: have a program which prints, to the console, once a second, the cpu we are running on. Using the taskset utility, we can ensure that this process *only runs on the nominated cpu*.
So, now the algorithm is: foreach cpu, spawn a child process, which can only run on the specified cpu, and prints out the cpu it is on. By running these processes in parallel, we will see an output sequence like (this is a 3-cpu example):
0120120122100010002222220012012...
The order of printing is indeterminate. Under normal circumstances, we expect all cpus to say something (in any order). If one or more cpus lock up, then we will miss the notification, and the pattern breaks. Eventually, if all cpus lock up, we will see no more output, and usually no response to the keyboard, and pretty soon...the need to reset the machine.
What this allowed was to detect *which* cpu was stuck, quite easily. Unfortunately, that is all it does. We know a cpu is stuck but we dont know what/where it is stuck.
Combine this with the "tail.pl /var/log/messages", and, if we are lucky, when the kernel sisues the cpu stuck diagnostic, we can get a stack trace, and get a better understanding of whats going on.
Its not sufficient - if we know the stack trace for a stuck cpu, it may not be clear *why* it is stuck. In my case, it seemed to be inside the code for kfree(), on a spinlock.
But to be stuck on a spinlock implies some other cpu is doing something, which was not found to be the case.
In the end, the problem was really a deadlock which was not easily understandable. One cpu is asking the other cpus to do something. Whilst this is happening, the original cpu is being asked to do something. But the first cpu is not listening (interrupts disabled).
The ultimate cure was to understand fully the two main areas where dtrace blocks (mutexes and xcalls) and ensure they "knew" about each other (mutexes will drain the xcall queue whilst waiting for a mutex).
Losing my marbles, err..mutexes. | Friday, 09 December 2011 |
But then Nigel had a go, and he reported a no-go. Slightly better but not robust enough and could crash/hang the kernel within a few minutes. The test was very simple:
$ dtrace -n fbt::[a-e]*:'{exit(0);}'and
$ dtrace -ln syscall:::
running in parallel. The long teardown from the first dtrace would allow the much shorter syscall::: list to interleave, and sooner or later, lead to deadlock.
I modified the first dtrace to fbt::[a-m]*: to improve the chance of the deadlock. (fbt::: is very draconian, enabling probes on all functions, and makes ^C or other checks for responsiveness, a little excessive. fbt::[a-m]*: was a middle ground - quick to fail.
In investigating this, I ended up writing a couple of very simple but useful tools. The mode of failure was to cause "stuck CPU" messages to display on the console. Linux has a periodic timer fire which can be used to detect CPUs stuck in an infinite loop - a useful feature. When the stuck cpu message fires, you should get a stack trace in /var/log/messages (Fedora), or /var/log/kern.log (Ubuntu).
Unfortunately, once a CPU is stuck, chances are high that the system is in an unrecoverable state - and attempting to look at the stacks doesnt help. I occasionally could get the kernel logs.
Each time, I could see a suggestion that the cpu was stuck in the kfree() function, freeing memory. As the dtrace session ends, all the probes need to be unwound, and the memory freed. For fbt::[a-m]*:, this is around 29,000 probes (and some multiple of this in memory allocations).
Freeing 29000 memory blocks should take well under a millisecond. But it takes *seconds*. (See prior threads on slow teardowns of fbt).
The reason it is so slow, is that dtrace needs to broadcast to the other cpus, effectively saying "Hands off this memory block". This involves a cpu cross-call to synchronise state. Unfortunately, this is slow. It also does 3 of these per probe that is dismantled.
Previous attempts to optimise this failed. (Nigel had pointed out this is significantly faster on real hardware vs a VM). I didnt quite get to the bottom of why it failed...
So, we know we have a stuck cpu, something to do with freeing memory. Very difficult to debug. I wanted to know what the other CPUs were doing. If one cpu is waiting for something (a lock, semaphore, spinlock, whatever), then the others must be causing the lock. The evidence showed they werent really doing much.
During a teardown, the other cpus may continue to fire probes, so the chances were that another cpu was causing a deadlock.
The big implication was that either my mutex implementation or my xcall implementation was at fault. I spent a lot of time going through both with a fine tooth comb, and trying various experiments.
It really didnt make sense. To cut a long story short, eventually, after despairing I could ever find it, it started working! Having made lots of little changes, eventually it survived over an hour, vs the 1-3 minutes previously.
In the end, a couple of small changes were made (it wasnt obvious which of the many changes were the cause of the fix). By suitable diffing, and cutting back of the noise, I found that the xcall code and mutex code needed to occasionally drain the pending xcall buffers. If we run a cpu with interrupts disabled, and another cpu is trying to call us, then we put a long delay in responding to the function call. Worse, we can deadlock - the other cpu is waiting for us, and we are waiting for that cpu. By suitable "draining" calls, the problems disappeared.
So, hopefully it is hardened.
Lets see how Nigel copes with this, over the weekend.
Its worth noting that debugging this kind of code is very difficult. I tried using the kernel debuggers, but faced a number of problems. One is that you cannot interrupt the kernel to see whats going on if the cpus have turned off interrupts. (I believe).
I'll document a couple of the tools I added to the war chest in a follow up blog post.
CRiSP 10.0.19 | Sunday, 04 December 2011 |
This is because this is the first time (in a long while) that the clean-install startup defaults changed, to enable these features.
When the config is saved, the internal defaults were defeating the user selection.
So, in $HOME/.Crisp/crisp8.ini, the "display_mode" call is replaced with two new lines: "set_display_mode" and "set_display_shift".
Most CRiSP primitives use a variety of flags to control behavior, rather than having a single get/set(inq) function to control behavior. This was for back in the 4MB-was-large days. So its time to liberate some of the functions, where it makes sense.
What this means is that if you pick up 10.0.19 and dont like it, and go back to an older release, those config lines will cause a problem (a startup warning, possibly) until the older defaults are resaved.
I may upgrade crisp8.ini to crisp10.ini if this presents itself as a problem.
You locked me out! No I *didnt*. | Sunday, 04 December 2011 |
The effect was that "dtrace -l & dtrace -l &" would cause havoc and assertion failures.
New code seems to fix that, and run much tighter.
Lets see if this cures problems.
Someone raised an issue today on USDT probes - that they are broken. My response is "probably". I havent properly validated them in a while in an attempt to fix all the other issues.
If nothing comes up, then I shall take a look at this next.
Whats a nice byte doing in an instruction like this? | Saturday, 03 December 2011 |
Trying to enable all probes would crash the kernel. After a binary search, the function flush_old_exec was found to be the culprit.
Nothing special in that function makes it stand out, but putting a fbt::flush_old_exec:return probe in would cause the next fork/exec to kill the process. After trying every conceivable thing, nothing worked.
Obviously a bug in dtrace - could it be the trap handler? Interrupts? Pre-emption? CPU rescheduling?
Whilst analysing and trying to resolve this, I did some interesting things.
The nice thing about localising the probe at error, was that I could test (simply start a new process), and it wouldnt crash the kernel but kill the new process. So, a very controlled environment for making small changes and adding monitoring was possible.
Firstly, which probe was firing? Looking at /proc/dtrace/stats showed that *no* probe was firing. I added some extra debug to the int1 and int3 handlers (single step and breakpoint), and this too, showed no probe was firing.
Not possible ! Really not possible !
Ok, so next, had we actually armed the probe? Well, we can use /proc/dtrace/fbt to examine probes, and we can tell if a probe is armed (tell-tale sign is "cc" instruction as the opcode at the location). Yes, we are arming the probe, but, no, it does not fire.
Next up is to disassemble the function itself. I have found it very annoying with Linux that there is no /vmlinux binary on the system - only the /vmlinuz (and /boot equivalents), which are not proper ELF files, but bootable images. Something as simple as examining the instructions and bytes at physical addresses is tricky. I had written a "vmlinux" extractor, but it never worked reliably.
One trick I have to do this is the following:
$ sudo gdb /bin/ls /proc/kcore
We dont care about /bin/ls but use it so we can examine /proc/kcore, and from this, we can access physical memory addresses (eg as reported by kernel stack traces or the dtrace probes).
What I found was curious. The two RET instructions in the function were slap in the middle of a CALL instruction.
This meant the instruction disassembler was wrong. Looking back a few instructions to see why, we came across the infamous "UD2" instruction. UD2 is a special instruction to generate an undefined-opcode fault. In the old days, lots of opcodes could do this, but Intel formally added this instruction so that compilers and operating systems had a real instruction, that would never change in future CPUs, for the purpose of generating an illegal instruction trap.
The Linux kernel uses this in the BUG and BUG_ON macros. Since these calls are called rarely, the kernel maps to an UD2 instruction and the fault handler can gracefully report the fault and the location of the error.
When the INSTR provider was implemented, I came across these instructions and had put some special "jump-over-it" code in place to handle this, but either I misread the assembler, or the kernel changed. Whenever a UD2 instruction is met, the disassembler would jump 10 bytes forward and continue from there.
This just so happened to be in the middle of a call instruction which happened to have 0xC3 as part of the relative address field. Dtrace then slapped a breakpoint on that 0xC3 instruction and changed the call to something that was wrong.
As soon as we hit the call, all bets were off, and we were lucky not to crash the kernel.
Interesting that this didnt show up in Ubuntu 11.x or FC15, despite that code note really changing in a while, but it could be the quality of the GCC compiler code changed to cause the opcode to just match something plausible, whilst never tickling the bug on different kernels.
So, one more bug down for now.