What the heck is CTF? A Dead-end ? Thursday, 29 July 2010  
The CTF part of dtrace is a small library of functions which are part of the dtrace package. To date, I have relatively ignored it - a few tweaks and it compiles nicely on Linux.

Recently, I have been playing with SDT probes - specifically formulating a plan to get static probes into the kernel. These static probes act like high level macros, compared to FBT, which is to do with planting probes on functions. When a FBT probe fires, you have access to the raw arguments (arg0, arg1, ...), but you dont really have access to the structures these arguments may represent.

Lets look at "io:::start". In the Solaris kernel, these probes are placed in various places to indicate when a file system driver is about to do I/O. This is a high level probe - you dont care what filesystem is in effect (UFS, NFS, ZFS), and you dont have to compute which are the relevant functions to probe - nice and sweet.

But when these probes fire, arg0, arg1 and arg2 are defined to be pointers to structures representing the buffer, file and device info of the underlying vnode.

But *how does this work* ?

Beats me !

What happens is that the SDT provider knows about these high level probes and the structures to be passed to the user space dtrace application. These structures (struct buf *, fileinfo_t, devinfo_t) are "created" by grabbing fields from relevant internal structures. DTrace has a thing called a "translator" which is used to map from internal representation to the D style structure. This avoids problems with trying to get the real structures visible into the D application. (One would need kernel level knowledge to get the #include's correct to even make the structures visible).

What dtrace does is scan /usr/lib/dtrace/*.d and preload various "include-files" as your script runs, to make certain constants and structures visible to you.

But how and where does a fileinfo_t structure get created?

I *think* this is done via the CTF (Compact Type Framework) library. CTF is a simple way to describe structures and members without the full complexity of DWARF debugging. So, what Sun has done is made sure all libraries in the system have a special ELF section (.SUNW_ctf) and this section is read from the libraries (for user space apps, or the kernel for kernel probes) to find out what structures exist.

Alas, we dont have this ELF section in the executables in Linux. So we are going to have to be a bit more clever to get access to the internal structures.

To illustrate what I mean, consider this:

$ cat io.d
#pragma D option quiet

BEGIN { printf("%10s %58s %2s\n", "DEVICE", "FILE", "RW"); }

io:::start {

printf("%10s %58s %2s\n", args[0]->dev_statname, args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W"); }

Consider:

  1. Where does B_READ come from? (Answer: /usr/lib/dtrace/io.d)
  2. Where does "dev_statname" come from?
  3. How does dtrace know that args[0] is convertable to a structure containing dev_statname?

The answer to the last two questions, I believe, belongs to the CTF scanner.

And that is where I am heading off to -- to see how we can do this on Linux.


Posted at 20:04:43 by fox | Permalink
  Single stepping an MOV %CR3,%EAX Tuesday, 27 July 2010  
Just been hunting down a bug in dtrace which would crash on the following on a i386 (2.6.24) kernel:
$ dtrace -n fbt::native_flush*:
...

Turns out one of the functions will be a "MOV %CR3,%EAX", and when single stepping, the kernel will step the *next* instruction too. The way dtrace is coded, we copy the instruction to be stepped to a buffer (because the original has a 0xcc INT3 as the first byte), but the buffer is null (0x00) filled. 0x00 isnt a good x86 instruction to randomly execution, leading to a panic.

The cure seems to be to put a NOP in the copied instruction buffer, and dtrace is happy to step over the double-instruction and we regain control and carry on nice.

Strangely, this didnt happen on amd64 (probably because the first instruction of a probe never corresponded to the equivalent instruction).


Posted at 19:48:24 by fox | Permalink
  Dtrace progress 20100725 Sunday, 25 July 2010  
Have been busy implementing the io::start and io::done probes. The mechanism is different to the way Solaris/Apple do it - because we dont want to annotate kernel (GPL) source code.

In getting this to work, I found a problem with the disassembler code. The issue is handling a multi-byte NOP instruction (as emitted by GCC to align jump targets to quad boundaries). Fortunately, Apple (or maybe Sun) had already fixed the issue, so I grabbed an update to dis_tables.c.

In doing so, it exposed many more probe (fbt) points than previously seen. I was previously seeing about 48,000 fbt probes on my Ubuntu 10.04 kernel. After the above change, the number jumped to 250,000 - I assume because now we more successfully find the multiple exit points in a function rather than aborting when we hit one of these instructions.

Anyway, heres a simple example of the io::: provider working for just one type of probe.

/home/fox/src/dtrace@vmub10-64: dtrace -l | grep -w io
    4         io                                               start
    5         io                                                done
    6         io                                                done
    7         io                                                done
    8         io                                                done
    9         io                                                done
   10         io                                                done
   11         io                                                done
   12         io                                                done
   13         io                                                done
   14         io                                                done
   15         io                                                done
/home/fox/src/dtrace@vmub10-64: dtrace -n io:::
dtrace: description 'io:::' matched 12 probes
CPU     ID                    FUNCTION:NAME
  0      6                            done:
  0     15                            done:
  0      6                            done:
  0     15                            done:
  0      6                            done:
  0      7                            done:
  0      6                            done:
  0      7                            done:
  0      6                            done:
  0      7                            done:
  0      6                            done:
  0      7                            done:
  0      6                            done:
  0      7                            done:
  ...

Posted at 20:31:14 by fox | Permalink
  Dtrace: Formulaic Providers Thursday, 15 July 2010  
Whats a "formulaic" provider?

Well, consider static dtrace providers (sdt). This is implemented in Solaris/MacOSX by annotating the code with calls into Dtrace. The way its done is like using USDT (user space dtrace probes) - the location of a function call is turned into a NOP, and converted to a real subroutine call (or trap) when the probe is enabled.

Now lets turn to Linux. Poor Linux.

We cant touch the kernel source - its easy to do, but getting the kernel guys to adopt the changes is full of licensing issues. Never mind.

But what is a provider? Many of the core providers, like io::start, are a bit like "macros": they are a short hand convenience for plopping a probe on a number of locations in the kernel (along with an argument calling convention), and then trapping any of the call spots. For example, for "io::" all it is doing is putting traps around read/write blocking syscalls (simplified explanation), for each file system type, e.g. UFS, ZFS, NFS. (Internally in Solaris its done at the VFS layer, so the number of places to patch is small).

So, how are we going to approach this in Linux?

Well, looking at the kernel source shows the likely places to place the probes, but we need to do this at run-time (module load). The way to do this is to compute a function which helps find the right probe area, e.g. "3rd call function into the vfs_read() function, and all exits".

This is what I would call "formulaic". (Linux dtrace already has some formulaic code to allow syscall interception).

Given the kernel can and will change in the future, finding a way to map and annotate these layers in a fairly high level way is the key to adding the static providers.

My first experiment will be the "io::" provider (because io::start and io::done are very useful in many dtrace scenarios).

I will update the blog when I have something that looks reasonable.


Todays dtrace fix is for the interesting scenario of a 32-bit application executing the SYSENTER syscall instruction on a 64-bit kernel.

Posted at 21:43:19 by fox | Permalink
  Dtrace and illegal addresses Wednesday, 14 July 2010  
Been playing more with dtraces and illegal address probes. DTrace is supposed to be totally safe to run on a system. It accepts code from user space and does complex things at probe time and/or interrupt time.

How does it do this?

Nicely and logically I would say. Theres a number of layers which give the desired protection. Firstly, any D script code is validated in a sandbox to ensure the user app is not passing down non-sandbox proof code. I havent vetted all that code yet, but a cursory parse shows it looks logical, disallowing things like jumps or loops, or invalid p-code instructions.

Next, the protection model distinguishes uids which are allowed to read or write addresses in the kernel.

Assuming you have permission (i.e. probably root), then it can validate reads/writes to ensure only scratch pad registers or known memory blocks are accessed or that the address is in kernel address space. (Defining kernel addressable memory gets complex with loadable modules, VM and user vs kernel space, along with PCI type I/O addresses).

Assuming we get through all of this, the final layer of protection is the CPU faulting on an invalid memory reference. Wherever dtrace can do something suspect, a hint is set (on the cpu we are running on), to flag to the interrupt handler to pass back a notification of any GPF, and to continue execution. This ensures that the error can be passed back without a user app core dumping or the kernel panicing with an unexpected address violation.

It seems logical and sweet to me.

I have just patched in the bad address logic - I need to validate it (not sure if I am going to get a page fault or a int13 segmentation violation so will need to handle both).

I need this to debug a problem with the copyinstr() D function.


Posted at 22:07:55 by fox | Permalink
  dtrace progress 20100712 Monday, 12 July 2010  
clone() syscall on 2.6.32 kernel (64-bit) now works. I had to rework the assembler/stub glue - at least I more or less understand why I had to rework it. Unfortunately the code has had to be conditionalised for ">= 2.6.32" kernels - I didnt see a quick win which unified the earlier kernels with this, which makes it a little unobvious.

I havent tested on a 32-bit 2.6.32 kernel, so maybe someone will tell me it works or not.

Theres some other syscalls which have special assembler prologue (fork, iopl, sigaltstack, sigsuspend) and I havent proved they need reworking. I am expecting them to, but at least

$ dtrace -n syscall:::
appears to be working again for me.

This really illustrates the complex environment dtrace needs to live in -- supporting not only 32 and 64 bit kernels, but all old and new kernels.

Its especially hard given we are running under the Sun CDDL license and not the GPL. Given Oracle now owns dtrace, I wonder if we can convert to something more palatable.

Now I am off to try out some more things to see if they work.


Posted at 22:26:34 by fox | Permalink
  dtrace progress 20100711 Sunday, 11 July 2010  
I've been fixing dtrace to run on the 2.6.32 kernel (hopefully fixes are applicable to 2.6.33 - but time will tell).

Theres two major bugs at present:

  1. The first is that the hrtimer is causing issues; I think I know why and have put in an interim fix. This seems to be stopping the general panics caused by interrupts being disabled in an interrupt routine (local_irq_disable() called from hrtimer callback)
  2. The second is more awful; the assembler glue to launch syscalls changed somewhere from 2.6.30 -> 2.6.32, and now a few syscalls will panic the kernel, e.g. "clone" (called by fork()). I nearly understand why, but its a mess having to support lots of kernels and the 32+64 bit variants. For now,
      $ dtrace -n syscall::[d-z]*:
      
    works but not:
      $ dtrace -n syscall:::
      

Not sure when I will fix these - maybe this week or maybe in august, so keep checking back.


Posted at 21:40:35 by fox | Permalink
  sizeof(long) in Dtrace Saturday, 10 July 2010  
Someone sent me a small dtrace script which I have been poring over for the last week trying to figure out why it generates an error (dereferencinging a typedef).

$ cat tests/dt021
/* Basic test for typedefs/structures. */
typedef unsigned long my_dev_t;
typedef struct {
	my_dev_t dts_dev;			/* device */
	int dts_necbs;				/* total number of ECBs */
} my_dtrace_state_t;

BEGIN { trace(((my_dtrace_state_t *)arg0)->dts_necbs); } $ dtrace -S -s tests/dt021 DIFO 0x1caa4c0 returns D type (integer) (size 4) OFF OPCODE INSTRUCTION 00: 29010601 ldgs DT_VAR(262), %r1 ! DT_VAR(262) = "arg0" 01: 25000002 setx DT_INTEGER[0], %r2 ! 0x0 02: 04010201 sll %r1, %r2, %r1 03: 05010201 srl %r1, %r2, %r1 04: 25000102 setx DT_INTEGER[1], %r2 ! 0x8 05: 07010201 add %r1, %r2, %r1 06: 1e010001 ldsw [%r1], %r1 07: 23000001 ret %r1 NAME ID KND SCP FLAG TYPE arg0 106 scl glb r D type (integer) (size 8) dtrace: script 'tests/dt021' matched 1 probe dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 24

I learnt a lot from such a simple script. What is arg0 in a BEGIN clause ? Why and how does the kernel handle such things.

arg0 is a pointer to a structure containing useful info for a dtrace script (a form of introspection). The script above works fine on FreeBSD and MacOS, which I could repeat. But the bottom line is a bug in the dtrace port meant that the credentials (uid) of the caller were not being passed in an ioctl(), meaning that the kernel was refusing to honor the peek into kernel memory.

Its a shame that dtrace doesnt print a meaningful error message here. I had to scatter the code with lots of printk()s to follow the cause/effect.

Having solved this, I came across another issue - namely sizeof(long) is either misdefined in all the dtrace implementations, or the documentation is wrong:

http://wikis.sun.com/display/DTrace/Types,+Operators+and+Expressions

The docs say that a long is 4 bytes - no matter the platform. But dtrace codifies sizeof(long) as 8 for a 64-bit platform. Maybe I am misreading the code (libdtrace/dt_open.c). You can select a model on the command line.

Another interesting thing is the on MacOSX, /usr/sbin/dtrace is a dual mode application - containing a 32 bit and 64 bit binary. Not sure I understand why (presumably so that if Snow Leopard is installed on older 32-bit only hardware, it will all work). Not clear when I run on a 64-bit capable cpu which version it is selecting.

Another thing I notice and dislike in dtrace is lack of C++ style comments. I may just add that; no reason not to, even if the competition (Apple, FreeBSD, we wont talk about Oracle for now) dont.


Posted at 23:30:19 by fox | Permalink
  Dual publishing with blog.pl and blogspot.com Monday, 05 July 2010  
Seems to me that its best if I dual publish the same blog to both http://www.crisp.demon.co.uk/blog and http://crtags.blogspot.com as I potentially get better google search coverage by doing this. So will try and keep them in sync with the same scripts.

Just trying to enhance dtrace to support some user reported issues. And fix at least one of my own.


Posted at 22:00:11 by fox | Permalink
  Dtrace progress .. restarted Saturday, 03 July 2010  
After a long period not touching dtrace, I need to fix some things in dtrace. Thanks to the people sending me bug reports or code fixes.

I fired up an Ubuntu 10.04 VirtualBox session - as I know theres something erratic in this kernel causing dtrace to bomb out. I think I know the culprit area - but need to drill down to find out what I am doing wrong. (I recall a problem in using the cpu-calling functions which are not interrupt level safe). A simple:

$ dtrace -n syscall:::
works for a bit and then generates kernel errors effectively crashing the machine.

Also I need to fix the struct/typedef user level bug causing a strange segmentation violation.

As an aside, I am temporarily pausing my research into full text indexing. After reviewing my code, there are big bottlenecks in the way hash tables are being manipulated - at exactly the areas I wasnt expecting - its not the putting of words into the hash tables thats at fault but the sorting the entries into alphabetical order on large hash tables (300k+ entries) and allocating/clearing down these large hashes. I need to do some fine tuning or look at better algorithms. Definitely interesting that this should be the case.


Posted at 21:47:00 by fox | Permalink
  DTrace / CTF Hashing function Friday, 02 July 2010  
Am playing with hashing functions. Why? Because I feel like it. No real reason, other than trying to speed up my full text search engine. (Why am I writing a full text search engine? Err...no reason either, but, because I can, and because I want to better understand some of the issues around this, and because I may be able to incorporate it into CRiSP (its bundled with CRiSP), but also, because it may be useful on its own).

Anyway, was browsing ctf_hash.c in the DTrace package, and there seems to be something wrong with the hash function:

static ulong_t
ctf_hash_compute(const char *key, size_t len)
{
        ulong_t g, h = 0;
        const char *p, *q = key + len;
        size_t n = 0;

for (p = key; p < q; p++, n++) { h = (h << 4) + *p;

if ((g = (h & 0xf0000000)) != 0) { h ^= (g >> 24); h ^= g; } }

return (h); }

The first problem is that variable "n" is unused.

The second problem seems to be that is this a bad hashing function? "h" is basically the bytes of the key, shifted by 4 each time.

For short strings (less than 7 chars), we never execute the XOR lines. And the hash doesnt seem very strong or distributed. I am guessing the goal here is that for most strings in the kernel, or an application, they are generally longer than 7 bytes. So, some experimentation is required to see if the distribution is flat.

Secondly, since we know the size of the string (len), the loop can be unrolled considerably. For strings less then 7, we could avoid the complex "g" and/test/jump. Something like this:

static ulong_t
ctf_hash_compute(const char *key, size_t len)
{
        ulong_t g, h = 0;
        const char *p, *q = key + len;

if (len <= 7) { for (p = key; p < q; p++) { h = (h << 4) + *p; } } else { for (p = key; p < q; p++) { h = (h << 4) + *p;

if ((g = (h & 0xf0000000)) != 0) { h ^= (g >> 24); h ^= g; } } }

return (h); }


Posted at 21:59:00 by fox | Permalink