The Case for a Safe XCall Tuesday, 27 November 2012  
I've written about this before, and its time to write again - dtrace_xcall(). This is the function DTrace uses on multiprocessor systems to sync the CPUs so that the CPUs can agree, e.g. on which buffer to use when logging the traced probes.

dtrace_xcall() is the function in the driver to do this. On Linux, this maps to smp_call_function() and friends. A CPU cross-call is an interesting concept - an ability for one CPU to make a function call on the other CPU. The use case is rarely needed, and if it is done, whilst breaking the calling protocol, you can lock up one or more CPUs or crash the system.

On Linux, the cross-call (or IPI, interprocessor-interrupt), can be seen by examining /proc/interrupts and looking for a line like:

$ cat /proc/interrupts
....
CAL:      52770      36972      Function call interrupts
...

My system has had many hours of uptime, yet the calls are rare (the above is showing the calls for each CPU).

When dtrace is called on to do a heavy action, like:

$ dtrace -n fbt::::

tens or hundreds of thousands of probes may be collected per second. DTrace has two internal buffers to log these probes, and the buffers will fill up quickly, and the calls to the IPI xcall code will happen a lot.

I've been pondering how this actually works - both on Linux and Solaris.

Lets take a thought experiment:

Imagine a dual cpu system. One processor is sitting inside a lock region, with interrupts disabled. The other CPU is trying to access the same lock/region. Now this other CPU is blocked until the first cpu exits the lock.

Now, imagine this again. This time, the first CPU takes a very long time to hold the lock. This would block the other CPU indefinitely. Normally this is rare - the kernel arranges to never hold locks for long periods of time.

Now, lets modify this scenario. One cpu is holding on to a lock, interrupts disabled, and we do an IPI cross-call. The other CPU is holding on to a different lock and has interrupts disabled. The first cpu cannot interrupt the other CPU and so we deadlock. In normal scenarios, this mutual exclusion cannot happen (other than bugs in the kernel or drivers).

IPI interrupts are just like normal interrupts - they can be ignored when interrupts are disabled, and processed when interrupts are reenabled.

The kernel smp_call_function() call has a contract: it must not be called with interrupts disabled. Doing so generates a kernel log/BUG warning, and indicates the kernel could deadlock.

When we use DTrace, we can place a probe on any function in the kernel, especially functions which run with interrupts disabled. This means we break the contract.

(I note Oracle UEL Linux DTrace simply calls smp_call_function() and suffers the bug, unless they have fixed it). In my DTrace, I take steps to avoid calling the Linux smp_call_function() and implement my own. It *seems* to work.

Whilst examining Xen, I had great difficulty finding a way to avoid smp_call_function() so someone invoking fbt probes with interrupts enabled can cause deadlocks or long live locks. (DTrace will detect a mutual deadlock and break the lock, but this is horrible and can panic the kernel in some extreme circumstances).

DTrace is supposed to be reliable and the above behavior is horrible. Simple *HORRIBLE*.

I have a (new) workaround...see below.

But, why doesnt Solaris have this problem? Well, the Solaris xcall code is intimate with the kernel interrupt code, and a CPU waiting in a xcall runs with interrupts enabled. (The whole Solaris/BSD kernel uses interrupt priorities to allow much of the kernel to run with interrupts enabled and even when interrupts are disabled, deadlocks cannot occur).

I wish I understood the above paragraph more - but experiments, user success stories demonstrate that Solaris has no deadlock issue.

Ok, so the solution for Linux.

If you followed the above carefully, you will note the problem is caused if we try to do a cross-call whilst interrupts are disabled. And interrupts are disabled either when probing a function in interrupt handler code, or inside a locked region.

So, lets disable probes whilst interrupts are disabled. If we did this, then the kernel should be safe for fbt::*: probes and never deadlock. Most interesting scenarios are in the non-locked kernel regions.

But thats bizarre! How could we do this? That defeats one of the deep probing aspects of DTrace.

Well to resolve this conflict of interest - we can have DTrace run in 'safe' mode by default, and when the user wants to remove this safety barrier, they can do so, by sending a message to the driver.

And this is what I am going to do for the next release.


Posted at 21:01:08 by fox | Permalink
  Xen progress Sunday, 18 November 2012  
I think I have finally fixed the Xen issues. After slowly wading through the basics of getting the syscall provider, the fbt provider, and GPF's in kernel space resolved, it appears to work.

Life has been difficult because the key to the issues resolve around paravirtualising certain instructions, but also, the way interrupts are handled. The normal interrupt routine sequence doesnt work for INT1, INT3 and PageFault interrupts.

What Linux does is provide two distinct interrupt routines - one for normal hardware, and one for Xen. At a low level in the IDT handler, it decides which one to use. (This is buried in the Xen handler for write_idt_entry()).

Part of the work is to have a similar mechanism - autodetect if we are on a Xen host, and use the correct interrupt handlers. Fortunately, the code in intr_x86-64.S is amenable to parameterisation via the macro assembler, so the code for all the interrupts is one macro, with some conditional assembly.

Another problem area is that when the system dies, hard, Xen is very unforgiving and reports an issue but no easy way to diagnose, in simple terms what happened. (Before modifying the page fault handler to use the correct Xen calling sequence, Xen would kill the guest due to issues in the page table; this appears to be bogus, and not the true cause of the issue - that the interrupt stack wasnt correct).

I now need to merge the Xen changes back to the mainline code, and check it still compiles/works on the older kernels.


Posted at 20:38:53 by fox | Permalink
  Xen progress Saturday, 10 November 2012  
I think I have the FBT provider working now on a Xen guest. As usual, one has to do "everything right", and "nearly-right" is not good enough.

Unfortunately, knowing what "right" is, is difficult!

BTW, I want to complain about google.com. Here is a fascinating in depth article, written by authors of VMware, which gently leads you through how VMware works and VM monitors in general.

web.mit.edu/6.033/www/papers/agesen.pdf

My complaint is that google wraps all links, so you cant just take the URL of a page you jumped to, but need to copy the link from the results page (which is not in a form that is an http embeddable link).

The article above hinted at a problem I was seeing in getting FBT to work *reliably*. FBT uses the INT3 and INT1 interrupt traps. I have had to rework the interrupt handles to be CONFIG_PARAVIRT compliant (and need to rework again so that the code works on a non-CONFIG_PARAVIRT kernel). Anyhow, a dtrace like:

$ dtrace -n fbt::sys_*:

would run for about 30-60,000 syscalls, and then a problem would arise. When doing FBT, we replace the instruction with a breakpoint instruction (INT3), single step the replaced instruction, and then resume execution.

What appears to happen is that occasionally the single-step trap would not fire. The copied instruction, which we single stepped, would continue execution after the copied instruction..resulting in a kernel page_fault. Now this is strange, because in the copy-buffer, we have:

original-instruction, nop

The nop should never be executed because of the single-step mode; placing a NOP after the valid instruction seems like a good practise, (rather than random junk) because otherwise the CPU may fetch ahead and try to decode a junk instruction, even if it is not executed.

I tried the following:

original-instruction, nop, nop

Two nop's after the instruction, and it ran much better; the first time, it got to nearly 1,000,000 traces; I then killed it, removed some of my debug, and ran again. Whilst writing this article, it got to about 750,000 traces, but the same thing happened. Heres the dump from the kernel:

[26881.316402] Call Trace:
[26881.316412]  [<ffffffff81664a82>] ? system_call_fastpath+0x16/0x1b
[26881.316417] Code: ff ff ff ff 00 00 00 00 00 10 00 81 ff ff ff ff 
00 10 00 81 ff ff ff ff 55 6e 10 00 03 8e 03 a0 ff ff ff ff 00 00 
00 00 55 90 90 <00> 00 ...

Opcode 55 is "PUSH %RBP"; NOP is 0x90. The instruction after the second 0x90, is an instruction which causes a kernel page fault. So, the cpu went marching on ahead and fell over...despite the trap flag being set.

At the moment, this is weird, and it looks like Xen is not honoring an IRET (or the hypervisor call equivalent) everytime, and ignoring the single step mode.

Now off to do more research ... maybe its a known bug in Xen, or maybe I have honored one of the rules (but the "rules" arent written down anywhere :-) ).


Posted at 22:47:04 by fox | Permalink
  Xen on VirtualBox... Monday, 05 November 2012  
Running Xen->Ubuntu 12.04 has the undesired side effect of stopping VirtualBox from working, which means side by side debugging is a pain, as I need to reboot, and swsusp stops working on my main machine.

So, now I have Ubuntu 12.04 running inside VirtualBox, and inside that, I am now creating a Xen guest running Ubuntu 12.04 (yes, thats 12.04 inside 12.04 inside 12.04). Its kind of mind boggling but hopefully I can try and debug the dtrace issues.


Posted at 22:24:31 by fox | Permalink
  Xen blog ... strangeness Sunday, 04 November 2012  
I am finding that running DTrace in a Xen guest is a painful thing to debug. I havent managed to get a decent debugger to help diagnose the issue I am currently investigating, but thought it worth writing up. This might help myself jog my own memory.

I have DTrace working with the various key interrupts (INT1, INT3) and in trying to get the page_fault handler to work, keep breaking the guest. We want the page_fault handler so that DTrace can intercept certain locations within itself, where a user D script might dereference memory incorrectly. Consider:

$ dtrace -n 'syscall::: { printf("%s", stringof(arg0)); }'

when the arg0 to a syscall is not a string pointer, we will get a warning from DTrace about a bad memory reference. (Technically, the kernel generates a GPF but we save outselves from paniccing the kernel).

What is special about the page_fault handler compared to say, INT1 (single step interrupt)? I dont know.

Looking at the kernel code and google searching is not helpful at all. Lets ignore Xen and just visit some basics of assembler.

In assembler, we have subroutines - a CALL instruction jumps to the target subroutine, and the return address is on the stack. The simplest subroutine is:

func:
	ret // for an interrupt routine, this is an IRET instruction

An interrupt handler has to be careful to preserve all registers as it does it stuff. (In user land we have to be careful too, but we have some registers we can use without having to save them, such as the incoming arg list).

So lets modify the above function, and do something as a no-op:

// Example 1
func:
	push %rax
	pop %rax
	ret

This will crash the Xen guest. The following will not:

// Example 2
silly:
	ret

func: call silly call silly ret

Whats the difference between example 1 and 2? I dont know. If I look at example 1, I might hazard a guess that we have an invalid stack, or a non-writable stack. But example 2 seems to work - we write to the stack to call function silly and return.

In the actual Linux page fault handler, it does something slightly weird, along the lines of:

page_fault:
	call *xen_handler // see below

sub $0x78,%rsp call save_regs ...

save_regs: cld mov %rdi,0x78(%rsp) mov %rsi,0x70(%rsp) mov %rdx,0x68(%rsp) mov %rcx,0x60(%rsp) ... ret

Its a strange sequence - the initial "sub $0x78,%rsp" decrements the stack pointer, leaving room on the stack for the registers, and calls a subroutine to populate the saved area, rather than a sequence of "push/push/push.." instructions. The kernel is like this with or without Xen, and possibly this is a good thing to do for various reasons.

Now "xen_handler" is a very interesting function; firstly, its not a function but a pointer to a function. I think its like this because the same kernel can be a Xen guest or running native, so the target function is either a no-op or some actual code. Inside a Xen guest, the eventual function is:

   0xffffffff8100aae0:  mov    0x8(%rsp),%rcx
   0xffffffff8100aae5:  mov    0x10(%rsp),%r11
   0xffffffff8100aaea:  retq   $0x10

That is a very weird function. Examination of the entry_64.S file in the kernel, shows that registers %RCX and %R11 need to be extracted - the Xen hypervisor is pushing these registers on the stack in addition to the normal semantics of a page fault. The "retq $0x10" is returning from the subroutine, and also *removing* the two extra registers.

Lets rewrite the code:

page_fault:
	call xen_pop
	sub $0x78,%rsp
	...

xen_pop: mov 0x8(%rsp),%rcx mov 0x10(%rsp),%r11 retq $0x10

By simplification, this becomes:

page_fault:
	pop %rcx
	pop %r11
	sub $0x78,%rsp
	...

But this appears not to work. It looks like the Xen hypervisor knows something about the code in a page fault handler, and unless the code obeys what it is expecting, we get a guest reboot.

Debugging here is very difficult - when things are wrong, the guest reboots - very few, if any, console messages. Various web references to debug tools which arent available in the Ubuntu apt cache.


Posted at 18:02:31 by fox | Permalink
  DTrace and Xen...continued Thursday, 01 November 2012  
Work on the Xen guest Linux kernel with dtrace. Progress is "middling".

As I recounted in a prior blog entry, there are a number of steps to getting this to work, and it mostly works, but the quality is not what I want in a usable driver...although I may release it sub-par.

Firstly, the syscall provider works. This took some work to get the page tables to be writable - using the correct page table APIs, which in turn map down to the Xen hypervisor calls. A Xen guest is significantly different from a genuine CPU.

A Xen hypervisor call is like a system call, using a special gateway to the hypervisor, and allows the hypervisor and guest to make RPC like calls. Things like page table modifications, APIC and priviledged instruction emulation go through this layer.

This in turn presents a couple of issues. Firstly, the fbt provider is having difficulty doing "fbt:::" where we trap every function in the kernel - the paravirt/hypercall functions must not be intercepted since they are (possibly) needed to take trap calls. In theory this is workaroundable by either excluding them from being probe points (which would be a shame), or by detecting the recursion and auto-disabling them (which would allow some hypercalls to be monitored).

The other area of problem is multi-cpus. When we have multiple CPUs, dtrace invokes the APIC inter-cpu calls to do RPC's to synchronise the cpus. There is no APIC in a Xen guest, or rather there is a very fake one. My DTrace code implements IPI calls in parallel to the kernels, rather than relying on the kernel support, so that we dont deadlock and so that we can trace the kernels use of these calls.

With IPI calls in a Xen guest, there is a lot of reliance on function calls to handle the hypervisor communication. The IPI calls in a standard kernel are the lowest level of operation of the kernel and CPU, implemented using the NMI interrupt.

The standard smp_call_function() family of functions can be used in the Xen dtrace, but it possibly exposes a race condition (I have yet to torture test, but it seems to be easily exposed without torture testing).

So, its a bit like porting to a totally different CPU architecture, and I need to understand these pieces a little more.

Once the above issues are resolved, then I need to validate it isnt broken on older/pre-Xen kernels.

But the end result is being able to use on the Cloud (eg Amazon EC2), so its definitely an interesting project.


Posted at 23:14:48 by fox | Permalink