DTrace/ARM syscall provider | Tuesday, 30 April 2013 |
Next, tackling the clone() syscall. After a bit of brainstorming and trial and error, I was doing everything right...but in the wrong order. Once I realised that - clone() works. A few minutes later, execve() is working.
And after a tedious bit of typing, the remaining 7 are good.
So, that effectively completes the first phase of DTrace/ARM.
Next up is to fix some DWARF issues on 3.8 and above kernels and gcc toolchains, and decide what to do next.
I'll hopefully put up a new release tonight if I can validate I havent broken anything too much.
DTrace/ARM update | Saturday, 27 April 2013 |
--- doc/ARM.txt
DTrace can run on the ARM processor. The ARM CPU exists as a number of variants, from tiny embedded CPUs, to full blown general purpose CPUs, commonly found in smart phones and other systems. As of this writing, ARM/64 is coming.
The earliest ARM processors had limited memory support (many instructions refer to 26-bit addresses); later processors can support 4GB of memory.
The ARM port of DTrace has been done in a KVM virtual machine, targetting a custom kernel (Debian/Wheezy and 3.6.11 kernel) for the RaspberryPi, which is a ARMv6 architecture kernel and CPU.
This specific kernel was chosen, simply so that I could tally the kernel binary with the source code, in order to clarify how the ARM architecture worked, and specifics of debugging probe functions. In theory DTrace should work on earlier and later kernels.
As of this writing, SMP kernels have not been tried. Almost certainly, DTrace will not work on an SMP system (because I have not validated the xcall CPU code). It *might*, but I am suspect it will not, and there is a need to verify this.
This port relies on the register_undef_hook() kernel function to intercept the FBT probes. FBT probes are implemented by using an undefined-instruction and handling the traps they generated. This is different to x86, where 0xCC (INT3) and single-step mode is used to manage probes which are taken. (ARM appears not to have single-step mode execution).
The file toxic.c is updated to avoid those parts of the interrupt fabric which are needed for the probes to fire, so we are more conservative in what can be probed (this mostly wont matter to most people). "dtrace -n fbt:::" is *safe*, as far as my testing is concerned, and the toxic probe functions reflect areas which have caused trouble. It is possible that more research is needed if you attempt to step out of what I have personally tested.
Summary:
* ARMv6 architecture only * No support for Thumb mode (or, not tested with Thumb user apps) * Validated against RaspberryPi * Not validated against Android * Not validated on SMP * Not validated on ARM/64 * Not validated on < ARMv6 or > ARMv6* FBT works * USDT has not been validated * SYSCALL is dummied out - to be fixed in subsequent release
Future:
Some of the items above will be addressed in future versions, especially SYSCALL, followed by SMP, and eventually, Android.
DTrace/ARM | Monday, 22 April 2013 |
After a lot of head scratching...I finally figured out my mistake. Interestingly, it takes me back to the earliest part of the x86 port, where I made the exact same mistake.
When a breakpoint (in the case of ARM, an undefined-instruction trap) fires, we are sitting on a kernel stack - a nested stack of the point where the breakpoint occurred. As I recounted on the last blog, we dont have a single-step trap on the ARM, so we cannot do the same logic flow as X86 (single step over the instruction where we placed the breakpoint probe).
Instead, we have to emulate the instruction. The initial tactic was to emulate the PUSH instruction - a class of instructions (ARM encodes a multitude of operations into a single 32-bit instruction, to do things in parallel, compared with x86). When emulating a PUSH, we cannot "push" because the place we want to push is between where the SP register was, and the stack frame for the trap we just took.
--------------- | original | | callers | | stack | | | --------------- SP at the time of the FBT trap | saved | | registers | | | --------------- SP in the FBT handler
This is solvable, by moving the saved-registers area around. So, the alternate trick is to call into a scratch buffer:
--------------- | orig instr | --------------- | jmp OPC+4 | ---------------
where OPC is the original trap program counter.
This actually works well, except for two scenarios. The first is the original instruction may be doing PC relative addressing, so we either have to emulate that, or rewrite the instruction to avoid the PC relative addressing.
The second issue is we cannot have a scratch buffer (per cpu), because the scratch buffer may be modified by another trap or interrupt which fires. So, we need a scratch buffer per probe. (The scratch buffer is sized to be 5 x 32-bit instructions long - the longest "rewrite" requires 3 instruction slots and we need 2 slots to handle a 32-bit JMP instruction).
Now, with that done, I have modified the dtrace to start channelling "proven" instructions through the FBT probe handler - only is we have implemented the rewrite support will we fire. And the results are striking! From a single FBT probe to 1,000,000+ probes (until I ^C'ed dtrace).
As I ramp up coverage of all kernel functions, I am hitting a few exceptions (instructions not being properly rewritten, or functions needed by the undefined-instruction trap handler itself; the latter are handled by the blacklist in toxic.c).
So, we are nearly finished in FBT function tracing - and after that is the syscall provider (which should be much easier to do than FBT, albeit there is a lot of quirky code to handle the various arguments passed to the syscalls).
This DTrace/ARM port does not handle SMP configurations, since my VM (or RaspberryPi) doesnt include SMP support - but this can happen later.
I'll release the new code when I am happy fbt is "done".
Number 3 | Wednesday, 17 April 2013 |
push {r4, lr}
is sufficient to handle a large number of entry points (that push instruction approximately handles all single arg functions). By generalising to handle any PUSH instruction, we can handle a lot more, and so on. (In ARM assembler, you can push any permutation of all 16 registers on the stack in one instruction - the registers are bit encoded).
However, I hit a problem. Scaling back, I found that:
$ bash $ cd $ cd $ cd
would cause bash to get a segmentation violation. Why its the third chdir - I havent figured out. A simple test app doesnt show this. Internal prints and close code review doesnt make it obvious what I am doing wrong, even if I distill this down to a basic trap handler.
It seems like something is being corrupted on return back to the invoking process, but the trace is not logical. I have to try really hard to push all preconceptions from my mind and look for the "unobvious". Hopefully, I can find it (I dont think any debugging tool can help, unless I could somehow trace execution forwards, at the instruction level on the third chdir() syscall).
If I can solve this, then it opens up FBT to handle as many sequences as I care to emulate - but, because of the way ARM probes are implemented (by me), I will have to be careful of some functions which could cause a recursion issue (which doesnt exist on the x86 DTrace implementations).
Nobody said this was gonna be simple...
DTrace/ARM .. some progress at last | Sunday, 14 April 2013 |
/home/fox/src/dtrace@raspberrypi: uname -a Linux raspberrypi 3.6.11 #4 Mon Mar 18 21:26:49 GMT 2013 armv6l GNU/Linux /home/fox/src/dtrace@raspberrypi: build/dtrace -n fbt::sys_chdir:entry dtrace: description 'fbt::sys_chdir:entry' matched 1 probe CPU ID FUNCTION:NAME 0 3170 sys_chdir:entry ^C
My prior attempts at intercepting the invalid opcode handler failed. I dont fully understand why, but having gotten this far, I may have a better understanding about what I did wrong to be able to tackle this again. The prior attempt tried to come in at a low level. The current attempt does what kprobes does and uses the register_undef_hook() kernel function to add a handler for invalid opcode manipulation. This is nicer, in that it is pure C code - no assembler, but it not so good because it will preclude tracing certain low level functions.
This example above is special - the ARM does not (does it?) have single-step support, so in order to handle FBT probes requires an ARM emulator to handle the continuation of an FBT probe. The code in dtrace handlers the PUSH instruction on entry to sys_chdir. (sys_chdir was chosen as its easy to fire it, on demand, and theres no background activity misfiring it except, when I want it to).
The next step is to start advancing the ARM emulator (I have been studying what kprobes does to get an idea of how complex this is - it is complex, but it only needs certain instructions to be emulated - those on entry and exit of a function - not every instruction. Now I can start looking at all entry points to see what are the common functions).
I have a realisation that this only targets the raspberrypi cpu (armv6l) and fully expect any other system to require more hard work to handle the various ARM chipsets, along with ARM64. My only other target ARM chip is a galaxy note 2 (Android), so eventually, the goal is to try and get this working on Android, but thats a step later.
(The sources arent released yet - theres no milage in doing so, but I will release them in a few days or weeks [most likely], when I feel the substance of the ARM/DTrace port is more functional).
In case anyone asks: Why am I doing this? Because I can. No more to it than that. If anyone wants to pay me or send me appropriate hardware, am happy to consider prioritising the work, but theres no guarantee progress is fast.
BTW, I created my first "Hello world" Android app the other day. It didnt work (some cross compilation issue). I want to solve that so I can get CRiSP running on Android. (CRiSP actually works quite nicely with the ConnectBot ssh emulator and running remotely; but thats not really very useful).
/proc/kcore on ARM | Tuesday, 09 April 2013 |
And Linus wonders who on earth uses it, and maybe it should be withdrawn?
Well, heres a cool trick if you do kernel level work:
$ gdb /bin/ls /proc/kcore .... (gdb) x/x 0x12345678
See whats happening?
We use any arbitrary binary to allow gdb to run, e.g. /bin/ls. We tell gdb that we are using the "core" file for the currently running kernel.
Once inside gdb, we can peek the active kernel memory, e.g. looking at instructions or data corresponding to the item of interest.
This is useful whilst debugging, say DTrace, to validate what an FBT probe has done to the target locations.
So, if anyone says, lets remove it, please reference this article and show a good use case.
(You dont need gdb to do this - dd and hd tools can be used to seek around in /dev/kmem, assuming you can get them to work, with the right arguments, but you cannot disassemble memory via normal tools and you need something in an ELF or core file format to use the standard ELF/binutils).
2-Months and 4 lines of assembler | Sunday, 07 April 2013 |
Theres a useful link here:
http://www.poppopret.org/?p=251
I stumbled on this a few days ago, and its very similar to what I am doing - its comforting its closely aligned, but DTrace doesnt work that way.
In general, DTrace is doing what a rootkit would do - the same technologies can be used for nefarious purposes or useful purposes.
So, lets start. In order to stop on an FBT probe, we need to patch the instruction with a BREAKPOINT instruction. ARM (now, but not in the early days) has a breakpoint instruction. On the Intel CPUs, a breakpoint generates an interrupt/trap via the IDT. ARM works differently, and tracking down a good place to take over the breakpoint handler proved difficult. (I may have got the code right, but am plagued by the same problem described below).
So, in order to make this simpler, lets consider the following:
instr1 instr2 instr3 instr4 ...
Given the above assembler, lets patch it as follows:
b DTrace_Handler label: instr3 instr4 ...
DTrace_Handler: instr1 instr2 b label
All instructions are 32b on ARM, and a branch is a single instruction plus an additional word containing the target label, so, therefore 2 instructions. If we patch a function, we need to execute the two instructions we placed. Theres a lot of restrictions about what we do here, e.g. neither instr1 or instr2 may be the target of a jump in the body of the function. This is fine for the single function I am targetting (do_PrefetchAbort) which handles a breakpoint trap. (This patching is part of the mechanism to allow FBT to work, but it is *not* the mechanism of FBT itself).
Given the above sequence, it should be simple. I can prove the DTrace_Handler works (eg by adding code to increment a viewable counter) and can see the function get executed lots of times.
If I then set an FBT breakpoint trap (eg sys_umask is good - only hit on demand), my kernel hangs. If I avoid modifying do_PrefetchAbort but do the FBT trap, then I get a kernel error due to an unhandled breakpoint trap. This latter evidence is useful - I am not crashing the kernel when the probed function is hit, and I know the FBT code is doing something.
But combine the two together - the patched code above, and an FBT breakpoint handler, and the kernel hangs.
Despite only dealing with 4-5 instructions, it does not work. Ergo, something outside of my expectation area is at work. But I have yet to find it.
The example in the link referenced above uses a similar mechanism, but it is not foolproof or SMP/interrupt safe - it can lose some hits; but the mechanism is similar to what I am doing in DTrace - and just getting the basics to work is surprisingly difficult (I really need a kernel debugger to see what is going on - but I havent set one up).