DTrace and Xen: part #2 Sunday, 28 October 2012  
I wrote a few days ago about DTrace and Amazon EC2 not playing well together. As I have dug into the problem, it has become clear what is wrong.

Xen uses paravirtualisation - meaning that priviledged instructions need to go through the correct API, else the VM may crash abruptly. Its quite distracting that Xen does this - not even the ability to failback to more normal semantics at a performance cost.

Over the many iterations of the Linux kernel, these priviledged areas have been mapped, via #define macros so that if CONFIG_PARAVIRT is set, then a function call is made, which invokes the hypervisor; if CONFIG_PARAVIRT is not set, then the direct instructions are executed.

This is mostly straightforward; eg the IDT instructions (SIDT and LIDT) dont do the right things in a Xen guest; Xen keeps a copy of the IDT outside of the address space the kernel can see. So, any changes need to be channelled through the correct API. In addition, direct memory tampering with IDT entries is not allowed - you must tell the Xen hypervisor that you just changed an entry.

After making these changes to DTrace, it now loads and unloads, and, with some corrections to the page-table mapping code, syscall tracing works.

I have two or three areas to work on to complete this work.

Firstly, the interrupt handlers (for INT1 and INT3) are broken - I havent played by the Xen API rules, so I need to fix that. (I also need to make sure that the correct paravirt detection is done; a kernel which is setup as a Xen guest can be run on physical hardware or inside Xen; the API functions hide this detail).

Second, multi-cpu operation needs to be corrected. The code in xcall.c which invokes the NMI based IPI cross-cpu calls is too low level and doesnt play well with Xen. If I restrict my guest to a single CPU, things work well; on a multi-cpu, the system locks up because the cross-cpu calls are not delivered or processed properly.

Lastly, having made these changes, I need to handle old kernels which lack the Xen API calls, so we dont break compatibility.

I now have a Xen guest on my main machine - a nuisance, because VirtualBox wont run on a Xen kernel. So I either need to migrate my VMs to Xen, or migrate to VMWare or KVM.

Posted at 21:12:53 by fox | Permalink
  DTrace and the Art of Xen Wednesday, 24 October 2012  
Someone reported that DTrace was failing - on an Amazon EC2 instance. By all accounts, this should work - its a Ubuntu 12.04 kernel, after all.

Isnt short-sighted great?! Of course it works - I test on Ubuntu - all the many Ubuntu kernels, as well as Fedora. How dare they report this doesnt work!

Of course, as you slowly unravel the detective story, you realise how right they are (facts dont lie) and how my world is shaped in some imaginary Universe....

So, the issue is the Xen virtualisation. I know a little..very little.. about Xen - its paravirtualisation; and its in the kernel.

But what does that *mean* ? You can read Wikipedia and many web articles and rarely does the whole picture fit together. And this is where it gets interesting.

DTrace, runs in kernel space. Inside the Linux kernel is like running inside MSDOS - you can execute any and every instruction, and the good thing about every cpu since the 80286, is that the segmentation and MMU support means that bugs can be trapped when attempting to access out of bounds areas (GPF, page faults, or core dumps).

DTrace, in many respects, is simple, and kernel agnostic (it could be ported to Windows, for instance. A rainy day project maybe). DTrace needs to understand the interrupt descriptor table, some aspects about page tables, and occasionally disabling interrupts. Most of the bulk of Dtrace is implementing the virtual machine for when traps occur.

This applies whether you are on real hardware or inside a VM, such as VMWare or VirtualBox (and, I believe, KVM/QEMU).

But Xen is different. Xen runs the kernel VM almost as if the VM runs in user space, and traps the instructions which require priviledge. Its an illusion. Where VMWare and VirtualBox trap priviledged instructions, like STI/CLI and SIDT/LIDT, Xen can do this, but provides an escape hatch through which the VM guest has to communicate, asking the hypervisor to do things for it. Theres complexity over things like page table management - in VMWare/VBox, you can modify page table entries and 'the right thing happens'. In Xen, you cannot.

All communication with Xen takes place, via a special "portal" - via the SYSCALL instruction, sitting in a special page. The Linux kernel wraps the key instructions and operations via an API. On real iron, those instructions execute directly; in a Xen guest, the functions translate to the API calls.

If you attempt to run DTrace (or a guest O/S) without these API wrappers, the wrong things happen. And thats what happens to DTrace - GPFs where none are expected.

I am working through the issues experimenting to do the right thing, and will issue an update to DTrace for Xen when I have concluded this avenue of research.

For anyone who is interested, here is a link which describes in some detail, aspects of page table management in Xen - which helps reinforce that there is a "right way" for Xen.


Posted at 23:27:53 by fox | Permalink
  DTrace update Monday, 15 October 2012  
Havent really touched dtrace in a while - bar some minor bug reports.

Claudio K. sent me an interesting mail and was questioning why this didnt work:

# dtrace -n 'syscall:::entry {
	self->start = timestamp; 
	self->file = fds[arg0].fi_pathname;
syscall:::return/(timestamp - self->start) > 1073741824/
	printf("%d ns on %s", timestamp - self->start, self->file); 
	self->start = 0;
' -p 26544

I briefly glanced the command, noted the use of "-p" and assumed this was the problem. Claudio highlighted the 3rd line referring to the fds[] array and I was scratching my head wondering what was going on here. The code makes sense, but I was trying to figure out how this actually worked.

Research showed this is handled by this, in etc/io.d:

inline fileinfo_t fds[int fd] = xlate  (
    fd >= 0 && fd < curthread->t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

Now this was commented out in etc/io.d a long time back, by me, because I was never ready for it. Those structures are Solaris structures.

An excellent posting by the author is available here on translators:


So far, so good. So I started converting this to Linux structures and apart from a minor issue in the libdtrace code (to do with ctf access to the kernel structures), its not far off:

$ dtrace -n syscall::open:'{printf("%p", fds[arg0].fi_offset);}'

So, open, arg0 is a string, and we want the filename. The nice thing about these translators is that they are a recipe for accessing struct-like members without having to precreate the return value for all elements of the structure. (I had fallen foul of this in existing driver code - and can start to discard the horrible code!)

But the *filename*. Well, Solaris has access to this. Linux has access to the filename (but - I need to do some homework, because Linux has access via a function, and dtrace wont let us do that at the moment). But on *MacOS*, its interesting because they simply did not bother.

translator fileinfo_t < struct fileglob *F > {
    fi_name = (F == NULL) ? "<none>" :
        F->fg_type == DTYPE_VNODE ?
                ((struct vnode *)F->fg_data)->v_name == NULL ? "<unknown>" :
        F->fg_type == DTYPE_SOCKET ? "<socket>" :

Example from MacOS:

dtrace -n syscall::open:entry'{printf("%s", fds[arg0].fi_name);}'

dtrace: description 'syscall::open:entry' matched 1 probe

CPU ID FUNCTION:NAME 1 18659 open:entry <none> 1 18659 open:entry <none> 0 18659 open:entry <none> ...

If you try to access the filename of a file descriptor on MacOS, you get an "unknown" output (or "socket", etc). You cannot gain access to the filename - they just havent implemented that facility. Which is a shame, as it is mightily powerful.

On Linux, despite a function being available in the kernel to convert a file to a name, it is a mutex/blocking function, so we cannot call it directly, and may need a private implementation without blocking semantics (occasionally, this could lead to output corruption, but that should be rare for the scenarios we are normally interested in).

I'll spend some time seeing if I can get something to work in this area, and I will have usefully learnt something whilst adding a new valuable feature to DTrace (or rather, got a facility working as it should do).

Posted at 22:45:42 by fox | Permalink
  Process Groups and fork speed Thursday, 11 October 2012  
Was just trying out an experiment. Am surprised that my i7 laptop CPU (2.0GHz) can only achieve 200 fork/sec on Ubuntu 12.04. I would expect it to do much better.

Why do I care? Well, have been experimenting with process ids and process groups - a part of Unix for decades, yet rarely understood, except by those writing shells or other job control types of activities.

Run the following command:

$ ps -j
  347   345  3179 pts/4    00:00:00 launch.pl
 1374  1371  3179 pts/4    00:00:00 launch.pl
 3179  3179  3179 pts/4    00:00:00 bash

This shows three processes - one is my shell. Note the PGID column. What is it?

Well the process group mechanism is the thing which ensures when you hit Ctrl-C, you kill all the child processes, but not the shell itself.

The shell invokes the system call setpgrp() and the child and all its children sit in a group.

The wonderful thing about process groups is they provide a means to allow killing them all, without having to do the equivalent of "ps -aef" to find all the procs in the system. (Imagine you want to kill all the children and grandchildren, even if these children are fork-bombing you; in a fork-bomb type scenario, by the time you have done a "ps" to find the PID, it will have already forked a copy of itself and the PID may no longer be valid).

The PGID is interesting; normally its set to the PID of the process group leader (root of the tree of processes). You can change it when you like, but you can only change it to the PID of yourself.

If you do this, and then fork, and have the parent pid terminate, you can end up with a situation (such as the launch.pl procs above) where the PID != PGID.

Now the PGID have an important property. Whilst a PGID of value nnn exists, you cannot fork a new process to have the same PID. Doing so would mean you are joining an existing process group. (And this would be a security issue). (I wrote a script to keep forking til we hit a specific PID, but it never happened, and debugging showed this scenario - PGID and PIDs exist in the same name space).

So, you could create 10,000 pids, each with distinct PGIDs, and steal 20,000 of the pid address space. (Many Linux's limit you to 10,000 pids per user id).

I stumbled across this whilst trying to prove a theorem about process killing - and its good, because it means the real problem I am trying to solve is not amenable to a race condition or attack.

There is a converse issue: setpgrp() system call *CAN* fail. If we try to set a PGID then we can *only if* session-id (SID, 3rd column in the ps listing) is the same. If we are sitting in the same xterm, we can do this; if we are in a different xterm, we can not.

SID and PGID are confusing ideas, but effectively the SID is acting as a kind of policeman over the PGID address space. And this stops a disparate group of processes merging into the same PGID as another. Although setpgrp() can be used to set a specific PGID, there is no syscall to set a specific session-id. The setsid() syscall takes no arguments.

This potentially leads into trouble, because one could use 10,000 session ids, and then grab 10,000 process-group ids, and sit on 10,000 pids, and the system would (nearly) grind to a halt - Linux actually allows 33000 unique pids before reusing them. But two userids can collude to eat all the available pids.

Another note on setsid() - it will fail if you are a process group leader (PID == PGID); typically, a child will do the setsid, in which case the SID is set to the PID of the calling process. (So my prior paragraph doesnt hold true - SIDs are a function of a PID; if the proc which does a setsid() forks+exits, then you can have a situation where no PID exists with the same value as a SID, e.g. a launcher process terminates). But in any case, you are not going to join someone elses process group whilst you have a distinct SID. This is important - if you are writing forking-daemons, that setsid() must be called, else you can interfere with the daemon in some way, if you carry on launching processes from the same xterm session (technically, the same SID).

Posted at 22:33:32 by fox | Permalink