I dont want to release just yet.... | Sunday, 31 May 2009 |
There are certain paths of code which can cause probe traps from within the probe handler, and I get differing results from my differing kernels. The best kernels are those that crash on me - allowing me to see an issue, rather than hiding these latent instabilities.
"fbt:::" works, but theres more to it than this, as many probes never fire, or a probe may cause an issue when another probe fires. (The SYSCALL instruction is a prime case - it causes the kernel to enter at 'system_call', but at this time, the stack and state of the kernel is not consistent, and we cannot (yet) probe on that, so I have added it to the toxic list for now).
There are many entries on the toxic list which can be removed, but am working thru the failure scenarios I can see at present.
I've put in an interrupt handler for interrupt #13 (GPF), since when we do things wrong in dtrace, its good to get a chance to shut us up and avoid an infinite cascade of console messages, resulting in a total panic. (If we get a GPF caused by us, we disable all probes to try and give me a chance to debug what is going on; in theory this should never happen, but at this time, can do whilst I iron out some of the thorny issues).
Keep watching the ftp site - I will upload when I feel happy.
dtrace progress | Friday, 29 May 2009 |
I need to validate the 64b code - I have added a new file (and removed the older cpu_32bit.c cpu_64bit.c) to store this emulation.
Interestingly, in looking at kprobes in the kernel, for hints about what I was doing wrong, I am seeing that I now do much more -- i.e. I believe kprobes cannot handle many functions or instructions in the kernel, so they can borrow from this code if they like, or use it to educate themselves whats missing.
I am happy to share the code and ideas, because this way, dtrace and its competitors can improve, and people who do not contribute to the code, get something-for-nothing - better quality tools. I dont look at dtrace as the competitor to annihilate all others. I am pleased with the quality and thought process Sun put into the construction, and the raw dtrace engine has given me almost zero issues.
Sun had an easy starting point - tight kernel integration and a dumb C compiler that doesnt show up the issues that inlined assembler gives to us Linux people. Maybe they can learn something too.
Someone asked me the other day of the performance differential for Linux dtrace vs Sun/Apple. My response then, as now, is that near zero difference. Just because Linux dtrace is not a part of the kernel, but an addin module doesnt deter dtrace from doing things. (Ok, SDT is going to be a challenge but much less so, I hope, than getting to where we are).
Also, I was asked about the 'alpha' nature of dtrace. I either write optimistic blogs ("It works!"), or short cryptic ("No it doesnt!") entries - depending on how positive I feel.
When dtrace works for more than a few minutes without crashing a kernel, thats good. But we dont have a scheme to handle coverage (maybe I will add that) - so we can tell, of the 25000+ probes, which ones fired and which didnt. Certainly, 20-30+% of the kernel is executing all the time, but the rest may depend, e.g. if a CD is in the drive, or if a TCP packet is dropped, or a user space app core dumps, and so on. And even if a probe fires, it may be handled incorrectly.
The truth is, if it stands up to scrutiny from people using it, and I am not writing about "Oops!" moments, then we are making progress.
Next on my todo fix list is Ctrl-C to dtrace. When I run:
$ dtrace -n fbt:::
it works fine. But Ctrl-C-ing it causes a *long* delay - sometimes 5-10s before the shell comes back. *Sometimes* a kernel GPF is raised, indicating that as the probes are being removed, an interrupt or something is creeping in, firing a probe about to be destroyed, and possibly hanging or causing disruption. For small numbers of probes, the window of opportunity is tiny. But for all probes, its big enough to be a real problem, whilst ensuring you cannot panic a running production system.
Hopefully this will be easy to fix.
Nuts - i am wrong | Wednesday, 27 May 2009 |
Apologies to Sun - and am glad that I can trust their code!
Now to find why a few F0 instructions arent stepped properly...
REPZ/REPNZ prefix | Wednesday, 27 May 2009 |
Heres a small dump from the dmesg output after loading dtrace when it can process the F0 series of opcodes (F0..FF).
[52358.586721] fbt:F instr ptype_seq_stop:c02c5ae0 size=1 f3 c3 8d b4 26 [52358.589013] fbt:F instr neigh_stat_seq_stop:c02ccb60 size=1 f3 c3 8d b4 26 [52358.601019] fbt:F instr seq_stop:c02e3f50 size=1 f3 c3 8d b4 26 [52358.601484] fbt:F instr seq_stop:c02e41d0 size=1 f3 c3 8d b4 26 [52358.601760] fbt:F instr rt_cpu_seq_stop:c02e49f0 size=1 f3 c3 8d b4 26 [52358.601828] fbt:F instr ipv4_rt_blackhole_update_pmtu:c02e4ae0 size=1 f3 c3 8d b4 26 [52358.615720] fbt:F instr icmp_address:c030e3b0 size=1 f3 c3 8d b4 26 [52358.615743] fbt:F instr icmp_discard:c030e3c0 size=1 f3 c3 8d b4 26 [52358.623705] fbt:F instr xfrm_link_failure:c0322210 size=1 f3 c3 8d b4 26 [52358.639776] fbt:F instr __read_lock_failed:c0347a90 size=1 f0 ff 00 f3 90 [52358.641537] fbt:F instr kprobe_seq_stop:c034a9a0 size=1 f3 c3 8d b4 26Note the size=1 which shows that Sun's disassembler has mistreated the f3 instruction (REPNZ).
LOCK: Prefix or instruction byte? | Wednesday, 27 May 2009 |
For opcode 0xF0 (LOCK prefix), the disassembler says we have an instruction length of 1 byte, i.e. it treats this as standalone, and not as a prefix.
If we plant an FBT probe on this instruction (and there are a few in the kernel), then when we single step - we will step the LOCK all on its own and not have the following instruction, leading, more than likely to a kernel crash or bad semantics.
I am amending the disassembler to detect for this, and treat LOCK properly as a prefix.
I found this whilst working thru all x86 instruction bytes, so we can enable the entire lot, and not special case "known scenarios".
The other prefixes (such as REP/REPNZ, etc should likely be treated similarly, but I havent found an example in the first instruction of a function where these instructions are fetched).
mcount and gcc -pg | Saturday, 23 May 2009 |
This means we wont call into the kernel, and we can now safely do:
$ dtrace -n fbt::mcount:and see all the calls to it.
So, we are safe again - we can probe all functions.
I am seeing a funny in Ubuntu 8.10/64, whereby if I probe too many functions, I get a kernel trace in /var/log/messages like below, where it looks like as we enable all probes, we fire before we are really ready, and subsequently dont fire any probes at all. Reloading the driver fixes this, but not sure I understand how/why this happens fully to diagnose as yet:
[74745.164017] Call Trace: [74745.164017] [] warn_on_slowpath+0x64/0x90 [74745.164017] [ ] ? dtrace_int3_handler+0x1f5/0x2f0 [dtracedrv] [74745.164017] [ ] ? dtrace_int3+0x47/0x53 [dtracedrv] [74745.164017] [ ] ? warn_on_slowpath+0x0/0x90 [74745.164017] [ ] smp_call_function_mask+0x22c/0x240 [74745.164017] [ ] ? do_flush_tlb_all+0x0/0x70 [74745.164017] [ ] ? do_flush_tlb_all+0x0/0x70 [74745.164017] [ ] ? dtrace_int3+0x47/0x53 [dtracedrv] [74745.164017] [ ] ? do_flush_tlb_all+0x0/0x70 [74745.164017] [ ] ? do_flush_tlb_all+0x0/0x70 [74745.164017] [ ] ? do_flush_tlb_all+0x0/0x70 [74745.164017] [ ] ? do_flush_tlb_all+0x0/0x70 [74745.164017] [ ] smp_call_function+0x20/0x30 [74745.164017] [ ] on_each_cpu+0x24/0x50 [74745.164017] [ ] flush_tlb_all+0x1c/0x20 [74745.164017] [ ] unmap_kernel_range+0x2cd/0x2e0 [74745.164017] [ ] remove_vm_area+0x84/0xa0 [74745.164017] [ ] ? remove_vm_area+0x0/0xa0 [74745.164017] [ ] __vunmap+0x55/0x120 [74745.164017] [ ] ? __vunmap+0x0/0x120 [74745.164017] [ ] vfree+0x2a/0x30 [74745.164017] [ ] kmem_free+0x50/0x70 [dtracedrv] [74745.164017] [ ] dtrace_ecb_create_enable+0x16d/0x20b0 [dtracedrv] [74745.164017] [ ] ? dtrace_match_nul+0x9/0x10 [dtracedrv] [74745.164017] [ ] ? dtrace_match_probe+0xa2/0x100 [dtracedrv] [74745.164017] [ ] dtrace_match+0x1f5/0x2e0 [dtracedrv] [74745.164017] [ ] ? dtrace_ecb_create_enable+0x0/0x20b0 [dtracedrv] [74745.164017] [ ] ? error_exit+0x0/0x70 [74745.164017] [ ] ? kfree+0x21/0x100 [74745.164017] [ ] dtrace_probe_enable+0xb7/0x190 [dtracedrv] [74745.164017] [ ] ? dtrace_match_string+0x0/0x50 [dtracedrv] [74745.164017] [ ] ? dtrace_match_nul+0x0/0x10 [dtracedrv] [74745.164017] [ ] ? dtrace_match_nul+0x0/0x10 [dtracedrv] [74745.164017] [ ] ? dtrace_match_nul+0x0/0x10 [dtracedrv] [74745.164017] [ ] dtrace_enabling_match+0x9e/0x200 [dtracedrv] [74745.164017] [ ] dtrace_ioctl+0x214e/0x23f0 [dtracedrv] [74745.164017] [ ] ? __mod_zone_page_state+0x9/0x70 [74745.164017] [ ] ? __rmqueue_smallest+0x11c/0x1b0 [74745.164017] [ ] ? ext3_get_branch+0x21/0x140 [ext3] [74745.164017] [ ] ? put_page+0x20/0x110 [74745.164017] [ ] ? prep_new_page+0x103/0x180 [74745.164017] [ ] ? buffered_rmqueue+0x1b2/0x2a0 [74745.164017] [ ] ? get_page_from_freelist+0x2a6/0x380 [74745.164017] [ ] ? find_get_page+0x23/0xb0 [74745.164017] [ ] ? find_lock_page+0x37/0x80 [74745.164017] [ ] ? mark_page_accessed+0xe/0x70 [74745.164017] [ ] ? filemap_fault+0x1a3/0x430 [74745.164017] [ ] ? __wake_up_bit+0xd/0x40 [74745.164017] [ ] ? page_waitqueue+0xa/0x90 [74745.164017] [ ] ? unlock_page+0x32/0x40 [74745.164017] [ ] ? __do_fault+0x134/0x440 [74745.164017] [ ] ? __inc_zone_page_state+0x2a/0x30 [74745.164017] [ ] ? handle_mm_fault+0x1ee/0x470 [74745.164017] [ ] ? __up_read+0x8f/0xb0 [74745.164017] [ ] ? up_read+0xe/0x10 [74745.164017] [ ] ? do_page_fault+0x372/0x750 [74745.164017] [ ] dtracedrv_ioctl+0x2d/0x50 [dtracedrv] [74745.164017] [ ] vfs_ioctl+0x85/0xb0 [74745.164017] [ ] do_vfs_ioctl+0x283/0x2f0 [74745.164017] [ ] sys_ioctl+0xa1/0xb0 [74745.164017] [ ] system_call_fastpath+0x16/0x1b [74745.164017]
working again | Saturday, 23 May 2009 |
I had disabled many opcodes, and have been adding them back in. RIP relative addressing (x86-64) is used in the kernel and have been getting that to work (again!).
I've hit a hopefully minor issue with 'mcount' - which is in the 2.6.27+ kernels (ftrace facility). This starts with a RIP relative instruction, e.g.
mcount: cmpq $ftrace_stub, ftrace_graph_return jnz ftrace_graph_caller cmpq $ftrace_graph_entry_stub, ftrace_graph_entry jnz ftrace_graph_callerbut I dont believe its single stepping over that initial CMPQ which is causing the issue, but, possibly whoever is calling it, e.g. the interrupt handlers themselves. Hopefully will get to fix this shortly, as that would open up the possibility to enable any instruction at the start of a function in fbt_linux.c.
I've also fixed a couple of crisp bugs/issues today, and I have one more before I put out an update for that.
some progress | Saturday, 23 May 2009 |
I *think* I have just put out a stable release -- works on 32b+64b kernels. Some silly re-entrancy issues not being handled properly. So, I need to test it more, but full-fbt tracing seems to be working.
What does this mean? Well, if this stands the test of running on my various kernels and my real non-VMware hardware, I need to start moving along.
Next up is maybe to look at the userland tracing or look at kernel stack trace operations, since thats a mess - due to the fact that kernels may be compiled with or with framepointers, and if you dont have frame-pointers, then a stack trace can only ever be guess of where you are. (The kernel uses '?' to indicate a stack trace is not necessarily valid - it walks the stack, word by word, to see if anything looks like a kernel text address).
I really need to get some CRiSP fixes done this weekend, along with a new driver to provide low overhead TCP port to PID enumeration. I may go onto to describe that in more detail later.
some progress | Wednesday, 20 May 2009 |
I rewrote the trap handlers to clean them up and put in a more powerful state machine, and its dtracing beautifully on the 64b kernel (I need to revalidate 32b and more kernels).
I am moving the debug output in /dev/dtrace to /proc/dtrace:
/home/fox/src/dtrace@vmubuntu: ls /proc/dtrace total 0 0 ./ 0 ../ 0 debug 0 security 0 stats 0 traceThe key one is /proc/dtrace/trace - I am trying to move away from printk for kernel debugging, and using an internal printf-to-a-buffer mechanism (like FreeBSD), because debugging the trap handlers is painful if, by virtue of invoking printk, we invoke a recursive fault.
So, /proc/dtrace/trace is log /proc/kmsg, - a private internal memory buffer to log trace info.
More in a while when I feel happier.
reliability issues | Tuesday, 19 May 2009 |
In fixing the 32-bit kernel issues, and redoing the INT3 handling via raw interrupt patching, has caused the 64-bit kernel to be unreliable. (Unreliable means within a few seconds of fbt::: probing, we crash the kernel).
I *think* at this point its due to a page fault firing whilst handling the breakpoint and single step trap.
I am therefore revamping this core code to allow a nested page fault, and tidying up the code which had started to become a bit untidy.
Hopefully have an answer in the next day or so...
64-bit issues | Sunday, 17 May 2009 |
The IRETQ (return from interrupt) can be returning from user mode or kernel mode, but the interrupt handler in my code doesnt/didnt handle this.
More news in a while when I have a fix. (And I need to reverify 32-bit as well).
modrm - RIP relative addressing | Saturday, 16 May 2009 |
Looking at kprobes reminded me of this, so we need to do that. (I could do with collecting stats of where such an instruction exists in the first instruction of a function; I could be worrying unnecessarily). I am seeing some 64-bit instability (especially as I bring more probes into scope by enabling more instruction opcodes).
Let me see if this improves things....
a problem... | Friday, 15 May 2009 |
$ dtrace -n fbt:::
immediately after loading the driver. If you give the driver a chance to 'breathe', and do this:
$ dtrace -n fbt::sys_chdir:control-c it, and then redo the big test, then all seems well. I suspect first time page fault or something triggering this, but still diagnosing.
Security policy for dtrace now working | Friday, 15 May 2009 |
Heres an example:
/home/fox/src/dtrace@vmubuntu: cat /dev/dtrace here=0 cpuid=1 all priv_kernel uid 200 priv_proc gid 201 priv_kernel all priv_owner(The first two entries are just debug). The other entries form a table which is scanned. The first entry -- "all priv_kernel" enables everyone to do everything. This would not normally be put into a system, unless you want to allow anyone to do things.
The next example shows user id 200, has priv_proc (dtrace_proc) privilege which allows users to monitor only their own processes.
Enumerating uid's may be too much, so you can also specify a gid, as in the third example.
So, the format is:
[uid | gid] <nnn> [priv_user | priv_kernel | priv_proc | priv_owner] or all [priv_user | priv_kernel | priv_proc | priv_owner]Multiple privs can be specified on the same line, e.g.
uid 200 priv_proc priv_owner(Not sure this is necessary since the perms form a hierarchy, each one embracing more priviledge than the prior, but am mirroring the Solaris model).
By default, /etc/dtrace.conf is read and loaded into the kernel. You can cat /dev/dtrace to verify what it has is what you think it has.
security time | Wednesday, 13 May 2009 |
However a defining security model is needed - mostly based on the solaris equivalent, but we dont do things that way on Linux. Here is a comment added to dtrace_linux.h defining the mechanism.
If people object or have extra wish-list items, feel free to suggest them.
/**********************************************************************/ /* The following implements the security model for dtrace. We have */ /* a list of items which are used to match process attributes. */ /* */ /* We want to be generic, and allow the user to customise to the */ /* local security regime. We allow specific ids to have */ /* priviledges and also to do the same against group ids. */ /* */ /* For each security item, we can assign a distinct set of dtrace */ /* priviledges (set of flags). These are based on the */ /* DTRACE_PRIV_xxxx definitions. On Solaris, these would be set */ /* via policy files accessed as the driver is loaded. For Linux, */ /* we try to generalise the mechanism to provide finer levels of */ /* granularity and allow group ids or groups of ids to have */ /* similar settings. */ /* */ /* The load.pl script will load up the security settings from the */ /* file /etc/dtrace.conf (if available). See the etc/ directory */ /* for an example config file. */ /* */ /* The format of a command line is as follows: */ /* */ /* clear */ /* uid 1 priv_user priv_kernel */ /* gid 23 priv_proc */ /* all priv_owner */ /* */ /* Multiple priviledge flags can be set for an id. The array of */ /* descriptors is searched linearly, so you can specify a */ /* fallback, as in the example above. */ /* */ /* There is a limit to the max number of descriptors. This could */ /* change to be based on a dynamic array rather than a static */ /* array, but it is expected that a typical scenario will use */ /* group mappings, and at most, a handful. (Current array size */ /* limited to 64). */ /**********************************************************************/BTW, if people wonder how those comments are neatly boxed, is because I use the CRiSP editor to do my work, and a single hilight followed by Ctrl-F reformats in that mode. (It handles re-formatting also, so its not tedious to make changes to the comments). In addition, spell checking inside the comment let me notice that I have mispelled privilege in the above comment (but, fixed in the source).
fixed predicates | Monday, 11 May 2009 |
dd if=/dev/zero of=/dev/nullshowed this up with a simple dtrace call like:
$ dtrace -n 'syscall::read: /pid == 7331/ {printf("pid=%d", pid);}'
The process shadowing code in dtrace_linux.c (par_alloc) had been set to only monitor a single process during earlier debug sessions. I unset that and fixed some other sillynesses, and now it will create a new proc structure as it sees them. (Theres still no garbage collection for the structs, and the data structure is a horrid linked-list, so this can cause some slowdowns as each probe will go on a search for the process to see if we have seen it before).
That code needs a hash table, or, at least, move the new proc to the front of the chain, or something, but thats relatively minor.
Of course, theres likely to be more bugs, and I hope my recent fix here wont destabilise.
I put more debug trace hooks (which are visible in /dev/dtrace), but will likely remove them or move them around, so dont depend on them for anything consistent.
And so the theory comes to pass... | Monday, 11 May 2009 |
If, for any reason, we hit a probe whilst processing a probe, all bets are off. This is most likely because of the way dtrace folds around the kernel, rather than being a part of the kernel source, but its frustrating to have a hung kernel or the CPU taking a triple fault (which simply causes a reboot) and having no way to diagnose.
I note in the FreeBSD version of dtrace, they have an in-memory version of printf() to log trace info, which can then be seen in a stable context. I was going to do that, but havent gotten around to it. (dtrace_linux.c sort of has that - but a minimal "this is what happened when I was blind/deaf").
What it uncovers is that occasionally we get a page fault whilst processing the probe, which was causing the recursion error.
Not sure why -- Linux allows kernel page faults -- but I thought they wouldnt/couldnt happen inside a probe - since the kernel should be entirely memory resident (but maybe I am misreading).
Ideally, if this is the only reentrancy issue, then I need to fix/allow for a page fault to be handled, because this is a very valuable probe. We may or may not be interested in kernel page faults, but we definitely want user page faults (I am not sure we can get one of these in a INT3 probe handler, since the probe will have fired from kernel space, and even if another cpu is doing user-land stuff, that other cpu will see the fault).
It leads to potentially interesting scenarios, such as a probe firing BECAUSE of a userland page fault, which in turn causes the kernel page fault. Only time and a clear head will tell.
Someone has kindly reported a bug where calls to the read() syscall dont always have matching entry/return probes. Every read should return, unless some form of exception happens.
So, lets see if this is easy to fix, and then I can go back to the security wrapper.
By the way, "cat /dev/dtrace" will give some stats - not many, but a key one is "dcnt" which tells you how many times dtrace_probe() called. Useful when trying to see if "anything is happening". Theres a simple script -- utils/dmon -- which monitors this every second (so I can tell if my kernel has hung or is just 'busy').
mutexes and semaphores | Sunday, 10 May 2009 |
To be honest, I am seeing the same behaviour - 64-bit 2 cpu seems to work fine, but 32-bit 2 cpu, seems to be flaky.
Next .. off to look at the places where I *dont* think the problem is, because, chances are, thats where the problem is. (My normal way of working is when poking a piece of code, and trying everything, chances are high it is not in the area you thought it was).
The main difference between 32 and 64 bit dtrace is really in the interrupt handler. Most other code differences are due to the size of registers or odd portability issues in the kernel...
That Coyote moment.... | Sunday, 10 May 2009 |
Thats just happened to me with mutexes. I have transliterated Solaris kmutex and mutex types into Linux mutexes, but this is illegal. A kernel mutex is part of a process context - it can block and the caller process sleeps, allowing another process to run.
The documentation specifically states you cannot call the mutex API from an interrupt.
Guess what my code does ?! .... *KAPOW*
I believe I should be using semaphores and not mutexes, so am in the process of converting over to see what happens.
I believe this is a part of the puzzle of SMP kernels where under load the kernel will panic, hang or triple-fault the CPU.
More in a few days when I feel happier.
Meep-meep.
glibc: I take some of that back | Saturday, 09 May 2009 |
This is GCC's fault - it compiles in a stack protection mechanism, by default, and looks like it is using %GS register as part of this mechanism.
Heres an example:
int main() { char buf[10]; xreadlink("/tmp/x", buf, sizeof buf); }and the assembler generated by compiling:
$ gcc -m32 x.c $ cat x.s .file "x.c" .section .rodata .LC0: .string "/tmp/x" .text .globl main .type main, @function main: leal 4(%esp), %ecx andl $-16, %esp pushl -4(%ecx) pushl %ebp movl %esp, %ebp pushl %ecx subl $36, %esp movl %gs:20, %eax // HERE movl %eax, -8(%ebp) xorl %eax, %eax movl $10, 8(%esp) leal -18(%ebp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call xreadlink movl -8(%ebp), %edx xorl %gs:20, %edx je .L3 call __stack_chk_fail .L3: addl $36, %esp popl %ecx popl %ebp leal -4(%ecx), %esp ret .size main, .-main .ident "GCC: (Ubuntu 4.3.2-1ubuntu12) 4.3.2" .section .note.GNU-stack,"",@progbits
This means a binary compiled with this gcc wont run on an AS4 (glibc 2.3) system.
The solution is to turn off this 'feature' (which I thought I had done).
Now off to recheck the gcc man page...
glibc is brain dead | Saturday, 09 May 2009 |
But glibc thinks different. glibc thinks it can break applications and cause a nightmare for application maintainers.
We're talking CRiSP here, and not dtrace.
What has gotten my goat ?
"readlink()" - a relatively rarely used syscall, but yet, glibc developers decided that when you call readlink, it should "optimise" the call and use the %GS register (used for threads - not sure what else.. yet), rather than simple do a function call, exactly as you would expect.
Why do I care? Because glibc stops you creating a binary on one release of Linux and having it run on any other (upwards or downwards).
Recently, I wrote a tool to remove the stupidity that glibc has when it attacks your object and executable files. It works nice. An application, like crisp, which compiled on say, glibc 2.7, can run on 2.3 as well.
Unless we call readlink(), in which case we get a segmentation violation.
Now I know what the dastardly /usr/include files are up to, I shall put a wrapper, and allow my terminal emulator (fcterm) and CRiSP, to be portable.
There...thats off my chest.
Meanwhile...back in dtrace land...
There appears to be a bug in multicpu environments, where cross-cpu function calls is broken - leading to crashes or hangs. The function we call isnt safe from an interrupt routine, so I need to do some engineering work to allow the inner parts of dtrace_probe to work, without the crash.
Not sure why I dont see this on the 64-bit platform, but, you never know what bad things I am doing, which is why we need to all test and prove the validity of the dtrace releases.
32-bit problems .. again | Saturday, 09 May 2009 |
Interesting that I have been running vmware Ubuntu 32-bit on a single virtual cpu - and didnt detect issues, but now I have it set to 2 cpus, they show up (just like my real machines).
Hope to fix very soon....
The SDT Provider | Friday, 08 May 2009 |
In reading the recent FreeBSD 7.2 release notes, and looking at other peoples dtrace blogs, mention is made of the various providers which hide in this single/simple source file.
SDT is a provider which lets you compile into the kernel, calls to dtrace. Probes at certain points will compile to a NOP, unless the probe is enabled.
Remember, we are not going to break the GPL terms or the CDDL licensing terms, and, anyway, we are not a part of the kernel source tree, and patching kernel source code is too tedious and error prone. (See previous posting on issues with simply validating compiles).
I have a new, novel, and GPL patent-free idea to resolve this.
SDT is mostly a set of probes on key parts of the various drivers which increment statistics. For example, the NFS driver has a lot of stats exposed via the "nfsstat" tool.
Patching the source code is painful, but that is only one way to get the desired effect.
Whilst working on FBT - an idea struck me, and I think we can using this in SDT. FBT disassembles the kernel, looking for function entry and exit points. What if we looked for something else? How about all "INC" instructions, or "ADD" instructions?
That would make for a very boring set of probes - probes based on instruction types!
But, say we looked at the relocation info whilst we disassembled - we could intercept the memory reads/writes to certain areas of memory - specifically, anywhere that falls inside a struct full of counters. Now we could instrument every/any struct reference in the kernel (easier if we talk about a single static struct, not dynamically allocated structs, such as the process tree). We can see these symbols in /proc/kallsyms, and we can create a table of ones we want to intercept, and create on-the-fly providers.
That is my plan - and see what falls out.
Ideally we could do it with any/every struct like this in the kernel - maybe a piece of Perl code to find them, and some other glue (we may not have kernel source, but we will have the headers).
Anyway, need to finish the cross-compile stuff (2.6.22 kernel is on my bug report list), and then to write the security wrapper for dtrace, so you can lock down or liberate the use of dtrace for use in commercial environments.
Then, we play...
cross-compiling | Friday, 08 May 2009 |
Alas, I get build reports of issues on 2.6.19 and 2.6.22 and probably every other release out there. C'est la via.
Although the included makefile includes a target to build all the kernels in /lib/modules, I have extended it to do cross compilation for 32bit versions of your 64-bit kernels. This is mainly for my benefit - the output of this cross-built driver is almost certainly useless, but at least I get to see some of the compile errors and config issues.
Unlike most systems and tools, dtrace can be a better product because it supports the old systems, which, by their very nature will never get "fixed", because a later kernel is available, but this doesnt help you if you are stuck with an ancient distro.
Success ! dtrace_gethrtime - you naughty boy. | Wednesday, 06 May 2009 |
I tried my normal debug tricks, e.g. divide-by-two search, to find the problematic functions. When this failed, I had to re-engineer the interrupt hooks for INT1+INT3, so we attach directly to the vectors. The beneficial side effect here is we are more decoupled from the kernel and can trace int1/int3 traps from the user or kprobes.
But this just failed.
Annoying!
Very annoying!
Annoying because its so difficult to debug - no way to print out where we got to. This required a lot of "thought-experiments".
I knew it must be something like "a probe inside an interrupt routine", and/or "a probe interrupt invoking a recursive INT3 probepoint", but some test cases showed it not to be so simple. It might run for a few seconds or crash straightaway.
Next up was to deliberately force kernel panics to prove which bits of code were reached.
None of this was really conclusive, except to tell me "something is wrong".
I even read up on the Intel Architecture manual to refresh my mind on finer points of kernel traps. (I now know that when a trap is taken, the callers SS:ESP is pushed onto the stack, but only on the transition from user -> kernel. A kernel trap whilst in kernel mode does not do this. (All makes perfect sense if you do a "thought experiment").
Anyhow, I hit on the perfect test case theory: lets assume its something in dtrace_probe() which is causing the issue. My first reaction was, well:
$ nm -u build/driver/dtracedrv.koand stare at the list of functions which are undefined (linked in at module load time). Theres quite a few, but most of the hairy and simple ones (like strchr, memchr, memcmp, sprintf) are not called (or are they?) during probe time, only during the module startup.
Ok, next step, lets short-circuit dtrace_probe(). Success! No kernel hang. Making it return without doing anything proves the interrupt vector assignments work, and normal kernel interrupts and workloads are unaffected.
I added a counter so I could see we got there (cat /dev/dtrace).
Next, started letting more and more of the code to run.
Then I hit on the call to dtrace_gethrtime(). This is an interesting function. dtrace_probe() keeps track of the time since the userland program was started. If the kernel is looking unresponsive, e.g. a bug in dtrace, but, more likely, too complex probing, then it will disassociate the D program and you get a message from dtrace saying that the system appears to be unresponsive: an air bag! Very nice design point.
So, now I knew it was dtrace_gethrtime(). Why? Well, how do we get accurate (nanosecond level) times out of the kernel? Well, in userland we typically use the time() system call or gettimeofday(), and other variants. Internally, the kernel is driven by timer interrupts, and increments an internal counter.
This internal counter is then exposed to those parts of the system via a lock (mutex like) mechanism. Since time is recorded in nanosecond units, a 32-bit entity will overflow in 4 seconds. A 64-bit quantity (typically, seconds + nanoseconds). So, there is no atomic increment, or atomic read. You might read the seconds just as the nanoseconds rollover. From an application point of view, the clock may be seen to go backwards (3.97, 3.98, 3.99 4.99, 4.00, 4.01).
This is no issue for the later 2.6.20+ kernels - we have code to directly read the counter (but am missing the dtrace inspired code to handle correct atomic reads; thats for another day). Anyhow, on 2.6.9, the interface/naming conventions is different.
I cheated. I call do_gettimeofday(), which is effectively the userland system call interface. Well, ... it worked, the day I implemented it ! But its horrible - we cannot call this from an interrupt routine - if the clock is ticking and we hit a probe (very high probability if we plant probes on every function in the kernel), then we deadlock inside the interrupt routine.
So, that explains why we became unresponsive - no panic. Nothing. Niente.
After reviewing the dtrace_gethrtime() code, and do_gettimeofday in the earlier kernel, theres a simple workaround.
And voila:
/home/fox/src/dtrace@vmas4: dtrace -n fbt::: | head -50 dtrace: description 'fbt:::' matched 15754 probes CPU ID FUNCTION:NAME 0 12452 schedule:entry 0 793 recalc_task_prio:entry 0 794 recalc_task_prio:return 0 33 __switch_to:entry 0 878 remove_wait_queue:entry 0 879 remove_wait_queue:return 0 14313 e1000_watchdog_task:entry 0 14442 e1000_update_stats:entry 0 14491 e1000_read_phy_reg:entry 0 14337 e1000_swfw_sync_acquire:entry 0 14335 e1000_get_hw_eeprom_semaphore:entry 0 14336 e1000_get_hw_eeprom_semaphore:return 0 14339 e1000_swfw_sync_release:entry 0 14492 e1000_read_phy_reg:return 0 14491 e1000_read_phy_reg:entry 0 14337 e1000_swfw_sync_acquire:entry 0 14335 e1000_get_hw_eeprom_semaphore:entry 0 14336 e1000_get_hw_eeprom_semaphore:return 0 14339 e1000_swfw_sync_release:entry 0 14492 e1000_read_phy_reg:return 0 403 smp_local_timer_interrupt:entry 0 950 profile_tick:entry 0 404 smp_local_timer_interrupt:return 0 125 do_IRQ:entry 0 117 handle_IRQ_event:entry 0 189 timer_interrupt:entry 0 1122 update_process_times:entry ....
and this seems to work nicely.
For those of you with eager eyes, look at the functions we call. (I am ssh'ed into the machine, which is why the ethernet driver is busy).
I just got a kernel hang when piping the dtrace into 'head', so I will fix that, and hope I will do a new release tonight...
nearly there... | Monday, 04 May 2009 |
The current problem *looks like* double-fault handling. In the context of a INT3 interrupt, we may hit another one, and the kernel panics with a double-fault Oops message.
Later kernels seem to perform some magic with TSS states for processes and kernels, which protects from this. Again, this is difficult to debug, and am looking to understand/fix the latest issues.
More in a few days.
why did it work before? | Sunday, 03 May 2009 |
Turns out the later kernels are nice and clever. When a kernel mode interrupt/exception occurs, a new stack is given to the interrupt. This new stack does not abut the callers area, so, when the instruction emulator wrote below %RSP, we never clobbered anything.
Kernels 2.6.28 upwards work real nicely, but 2.6.9 lacks this nested kernel facility.
Of course, we need it on the earlier kernels even if they dont support it, but we can emulate that, now we know what is going on.
The magic is in entry_32.S and entry_64.S of the kernel, where it judiciously manages the stack frame entry/exit.
Lets see if I can be as good a citizen...
more problems... | Sunday, 03 May 2009 |
I *think* this is due to random interrupts - either interrupts during a INT3 handler or vice versa. Whats strange is that it does work for a bit, but then crashes.
I think this may be due to stack switching - my int handler doesnt do anything special for the stack switch (or touching the %GS register), and maybe we end up with a miscommunication with the kernel - it assumes we havent switched to a debug stack, but in fact we have...sort of.
Difficult to debug because theres not much to print and with the kernel crashing, difficult to pick up the issue.
I am currently debugging on a 32bit 2.6.27 kernel (lot easier than the 64bit 2.6.9 kernel which just goes silent when things die).
Hope to get this fixed very shortly...
depression..or why is it so difficult ?! | Saturday, 02 May 2009 |
Minutes to get this working. Then...nothing.
I kept staring at the code, wondering why it didnt work. (Side note: I was doing this on a 2.6.27 kernel; luckily the panics in the trap handler were recoverable, so I didnt need to reboot or restore my snapshots; I *do* like that in the later kernels - they try really hard to help you with bad drivers. Nice).
Anyway, having gone a few hours with no forward progress, I stumbled on it today (good nights sleep):
call dtrace_int3_handler cmp 1,%rax je int3_handledCall the function, check the return code. Couldnt be easier.
Took me a long time to realise what is wrong in those lines of code above. (See bottom of update for the answer!).
Once I fixed that - it worked - enabling all probes on 2.6.27 was fine. I checked for userland INT3 traps, i.e. a debuggers normal workload, and made a fix to avoid this interception for userland - we can just let the kernel handle user land so that all registers, contexts, and stacks are correct. For kernel trapping, we can short cut some of the complexity in the kernel, with no loss/effect on anyone else.
Next up is to address a compile issue in patching the debug interrupt gate on 2.6.9, since the structures changed a bit. Then I need to do the same assembler for 32-bit cpus, and this phase should be complete.
Answer: "1" means address 0x0000000000000001, not constant 1. It needs to be "$1" in the compare instruction.
The INT1/3 traps, re-entrancy, and reachability | Friday, 01 May 2009 |
The key difference I note in 2.6.9 and much later kernels is re-entrancy. Think about an INT3 trap - we are in the kernel, we step on a breakpoint instruction, and trap into the kernel.
Now, whilst processing that INT3 trap, we step on another function - the path to the int3 handler in dtrace and the path out of that handler can double and triple trap. As I mentioned in a prior article, this is handled by detecting the fact we hit a critical path probe, and auto-disable it.
Now, back to 2.6.9 and AS4. The kernel interrupt code doesnt seem to understand double/triple nested faults, and screws the stack causing a crash. This means I am having to validate toxic.c to ensure these unanticipated functions/probes are on the blacklist.
I dont like this - so this is an intermediate solution - ensure:
$ dtrace -n fbt:::is stable for the kernels I have, and then fix this properly.
My next solution is to overwrite the int3/int1 trap handler vectors so we take control, save all registers, do our stuff, and get out quick.
This means we can put probes on the real int1/int3 handlers, along with the die_chain notifier list, and all the kprobes functions.
The only downside is a little more assembler in dtrace to save the stack and "do the right thing", but means we are more kernel independent.
Hope to release over the weekend.