Driver unloading .. follow up | Saturday, 28 July 2012 |
The following line
.owner = THIS_MODULE,
is the magic which ensures the ref count on the module is incremented for all opens, and you cannot unload the module whilst in use.
For various reasons, many drivers didnt have this - probably an issue with out of date documentation on the older kernels, which predate this feature.
This link:
http://stackoverflow.com/questions/1741415/linux-kernel-modules-when-to-use-try-module-get-module-put
showed me the error of my ways.
This should stop me accidentally unloading the driver when some user space process still has a reference to the driver(s).
Driver unloading | Saturday, 28 July 2012 |
Pretty simple. "lsmod" can show drivers loaded (usually a lot), and many are unused.
It makes sense you cannot unload a device driver if its in use, especially if the device driver implements a filesystem.
Sometimes, the physical world is cruel. No amount of clever coding in the kernel can prevent you physically/forcibly removing a floppy disk or memory card. And the modern kernel can handle such horrible brute force actions.
Lets switch to dtrace. We dont want to unload dtrace whilst its in use. All the code is there (thank you Sun). The Linux port makes various checks before allowing the driver to be unloaded.
[Most people dont care about this, but as a developer, I want to load and unload the driver frequently, without having to reboot the kernel].
So, there are two types of "in-use" actions in dtrace: you are running dtrace waiting for probes, and, you have done a PID/USDT provider probe.
In the first case, I can make sure I am not running dtrace when reloading.
In the second, well, things can go awry. If you use:
$ dtrace -c cmd -n ....
Then theres two parts of dtrace which are active: the dtrace process itself, and the cmd or process being traced. (Remember, user level probes will trap to the kernel).
If we somehow kill -9 the dtrace, then dtrace will leave the probes in tact until the process exits. If the process hits the probe, then the probes will be redundant.
In reality, we can unload dtrace whilst probes are active in a user process - it will terminate with SIGTRAP when the first probe is hit.
What I *didnt realise* (because I am definitely stupid), is that the module unload code in a driver is a "void" function. It cannot prevent itself being unloaded. Once the kernel wants to unload you, it will happen. And if you dont clean up properly, your kernel is likely to have a problem (GPF or panic).
Dtrace will crash if the device driver is open/in-use, because although it tries to prevent unloading of the driver, nobody is listening. Duh !
Ok, so we can probably just let dtrace unload and stop worrying.
Or we could prevent unloading whilst active probes exist. After some investigation, the kernel function try_module_get() is the function to implement the drivers in-use count (as seen, by lsmod). Interestingly, it is rarely called. It is *not* called simply because you opened the device, e.g.
$ sleep 100Its typically incremented for executables coming from the filesystem. I dont think its called because a file is open. (Maybe we can panic the system if we hold on to devices in the system which are unloaded?)
(It might be possible to modify the module reference count on open + close, but this is almost certainly impossible to get right; consider what happens on a fork or dup system call - file descriptors can be cloned, but the underlying driver will never know that).
[And, why do I care? Because as I play with the PID provider and add new probes, I keep crashing the kernel if dtrace is running or hung. Normal users shouldnt care about reloading dtrace].
PID provider...update | Friday, 27 July 2012 |
I've spent a fair bit of time ironing out portability issues introduced in the latest releases - affecting older kernels, and lots of GPL-only issues, which are nearly all worked around.
The GPL issues are interesting; I try to stay away from the politics. The GPL and Linux has its own world to look after, and what with legal implications of license pollution, they need to be careful of defending themselves. DTrace comes under the CDDL. From my perspective, its a bunch of source code, and people are welcome to it. But the CDDL can give rise to closed source derivatives, which is a shame, but understandable.
Anyhow, one of the key issues was related to "dtrace -c cmd" not working. I hadnt tested that.
I also found that the user land dtrace would hang when trying to attach to a suspended process (eg a backgrounded app asking for terminal input).
I decided I had to move past the ptrace() interface - its too limiting to allow dtrace to do what it wants.
A while back I had created the /dev/dtrace_ctl driver - it was designed to emulate the Solaris /proc interface, but I shied away from doing this, because it would have been a distraction - attempts to emulate lots of corner cases from Solaris, made difficult, because the Linux kernel is not the Solaris kernel. This was a good move. (And someone suggested I not do this, so it wasnt my idea).
What I did do was resurrect the code in driver/ctl.c, and make it into a read/write interface for processes - doing what ptrace() does, but without the semantics of ptrace. Switch dtrace to use this interface instead of ptrace() immediately fixed the stopped-process problem and allows the "dtrace -c cmd" to work.
I have some bugs to fix and another kernel/GPL issue to resolve, but hope to release later today or over the weekend.
I had put off doing the PID provider, because I knew it was going to be difficult - but I was lucky. The original Solaris code works a treat, and the only problem was me reading too much into the complex code.
More later....
Hey VirtualBox...what ya' doin' ?! | Thursday, 26 July 2012 |
I have a nice collection of VMs for lots of versions of Linux - going back to very old versions. Originally, under VMWare, but over the last few years, VirtualBox. I like VirtualBox, but I also have problems with it.
VMware (vmplayer) is nice, but very limited. I did get bored of vmware not being installable in new kernels, and requiring hacky patches to the source code to make it work.
But I am somewhat annoyed that many of my VMs seem to have "broken" or "gone off". E.g. Centos 4.7, 5.5, 5.6 -- they no longer boot under VirtualBox. I have tried various things - they used to work, but no longer do. So, my supply of VMs is limited. (Each VM has my set of customisations, to make them comfortable to login to). Its a nuisance having to reset them up, or try and guess why an old kernel no longer works.
I may have to go back to vmware or try out kvm, to see which works best for me.
This really is a big problem - may not recognised by the industry - but a VM which stops being usable, due to a host upgrade or VM software upgrade, really demeans the valuableness of having VMs in the first place.
It might be that I could downgrade my VirtualBox to restore the older VMs, but this is turning into a job-creation scheme, rather than a productivity boost.
I really dislike VirtualBox's nested-snapshot mechanism - despite its power, its confusing -- very confusing and you can end up reverting a snapshot and losing a lot of data. VMwares snapshot/restore was much simpler to get along with.
PID Provider: Did you call? #4 | Sunday, 22 July 2012 |
Creating a kernel thread in Linux is easy. But I immediately slammed into some issues.
Much of the "workqueue" API for doing this is GPL protected. DTrace is a CDDL driver, and attempts to compile or link against these GPL protected functions caused errors. I found a workaround, similar to the dynamic symbol lookup already in dtrace. The implementation of this is slightly ugly due to the functions I wanted being embedded in #define macros. I didnt want to replicate the macros directly to modify them, as this makes the code frail and subject to breakage in future kernels.
Additionally, the calling sequence of one of the functions has changed in recent kernels (3.2 .. 3.4). This means I have to be really careful. I worked around this with a tiny piece of assembler.
But from what I read, this workqueue API is only relatively recent addition to the kernels. They appeared in the 2.5 kernels, changed substantively in 2.6.20. So its possible that the code I have which compiles for later kernels, will fail abysmally for older kernels. The community will need to feedback, or we will have to disable PID provider for older kernels.
So, we are done! PID provider works.
Well, I say it works .. it works for a sample app of mine. It needs a lot more testing, and I daresay reported breakage will be difficult to debug. The good thing is that I made almost zero changes to the Solaris code - only fixing some glue code, and making some changes in libdtrace.
If you have read all of these blog excerpts, and understood it, good for you. I learnt a lot debugging this, and I feel more confident in how dtrace works architecturally, and the code stuff I have done.
Theres still a long road ahead to torture test the PID provider.
And I need to rewrite the libdtrace/ process read/write, to avoid the ptrace() issue or avoid leaving a process in the stopped state.
I plan to release the code - once I have done a little cleanup, later today (20120722).
PID Provider: Did you call? #3 | Sunday, 22 July 2012 |
When I ^C'ed the dtrace process, I would often panic the kernel .. badly. A slurry of messages scrolled on the screen telling me an atomic condition was broken.
Huh?
The act of tearing down the potentially many probes in a process is long enough that various windows of vulnerability exist. If you kill -9 the dtrace process, it will tear down the probes, and it will do so, with various mutexes set. If fasttrap tries to dismantle its version of the probes, a deadlock can exist.
So, fasttrap code, during the teardown, uses an optimistic timer to take out the probes (tracepoints). The mechanism is a classic kernel function - timeout(). Up until 2 weeks ago, timeout() in dtrace4linux was a stub implementation.
I had to implement timeout() and quickly knocked up some code, based on the hrtimer mechanisms in the kernel.
This caused me no end of issues, and took me ages to understand what was going on.
As dtrace was closing the /dev/dtrace device, tearing down the probes it had set, the timer would fire, and interrupt the closing dtrace. The timeout function would dismantle the fasttrap probes, and assert a mutex, held by the terminated dtrace process. Classic deadlock. It also showed up a potential problem in my code (driver/mutex.c), which attempts to call the scheduler if a mutex appears stuck (which lead to the kernel issues, since calling the scheduler from a timeout is not the correct thing to do).
I checked the Solaris code, to remind myself how timeout() works. What I found was interesting. A timer interrupt, in Solaris doesnt just fire, interrupting the current process. Its fired from a special context, effectively interrupting a dummy process. This resolves the deadlock - a timer can never interrupt a mutex protected block of code - it interrupts in the context of another process. So the original process can make progress, release the lock and allow the timeout to make progress.
We are almost done.
There is a piece of code in driver/dtrace.c, which I never understood, and had commented out:
dtrace_taskq = taskq_create("dtrace_taskq", 1, maxclsyspri, 1, INT_MAX, 0);
It hasnt harmed dtrace4linux having that commented out. The reason I was looking was that in looking at /proc/dtrace/fasttrap, I could see what a PID probe looked like. When the target process and dtrace terminated, these entries were not cleaning up. fasttrap.c does garbage collection, but it wasnt clear how this happens when locks prevent progress. Function dtrace_unregister() calls this function to actually remove one of these fasttrap probes:
(void) taskq_dispatch(dtrace_taskq, (task_func_t *)dtrace_enabling_reap, NULL, TQ_SLEEP);
What does that mean? I didnt know. Searched on google, none the wiser. But it slowly dawned on me.
Ever did a ps on Linux, and saw entries like this:
$ ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:01 /sbin/init 2 ? S 0:00 [kthreadd] 3 ? S 0:02 [ksoftirqd/0] 6 ? S 0:00 [migration/0] 7 ? S 0:00 [watchdog/0]
Those processes in square brackets are kernel processes. This is what taskq_create() is doing - creating a kernel process. This is a tiny part of dtrace, but a very VERY important part! To avoid timers from deadlocking with user processes due to mutex contention, we need the timers to fire from a process which cannot possibly be running dtrace. So, taskq_create() creates a kernel process, and when the kernel cannot free a probe (because it looks like it is in use), a timer is fired to retry the cancellation of the probe.
So, I now needed to implement taskq_create() on Linux. A quick google search and I found what I wanted - "workqueue"s. This is the mechanism to create a kernel process and asynchronously handle the callbacks. A quick piece of coding, and it was looking good.
Continued in part 4....
PID Provider: Did you call? #2 | Sunday, 22 July 2012 |
Fasttrap (PID provider) looks in its data structures and maps this to a probe, and logs a record for the originating dtrace user.
But we have to get past the breakpoint. What the code in fasttrap does is similar to the in-kernel code. We single step the instruction which was patched. This is quite complex, because, remember, the original instruction has a breakpoint placed on top of it. So, fasttrap arranges to single step the instruction in a scratch buffer.
It took me ages to "get this". What scratch buffer? When doing in-kernel probes, dtrace4linux has a per-cpu scratch buffer for this purpose. But we cannot use this, for two reasons. Reason 1: its not visible to the process in user space. Reason 2: processes may be preempted, so we cannot guarantee that the scratch buffer would remain unscathed to complete the action for a process, before another process steps on the same thing.
I spent a long time looking at this, trying to figure out how Apple/FreeBSD/Solaris does this. On Solaris, each thread in the system has a scratch buffer in the in-kernel lwp_t structure. Ah! We cannot force this on Linux, without rebuilding the kernel, but we just need a private area to dump the scratch instruction into. I was looking at the idea of jamming a 4K page into the address space of the process, or leveraging the VDSO system call page, but both require some thought because of need to garbage collect, or avoid problems with other users of the area. In the end I decided that the current thread stack is a good place to do this. Most threads have 1-10MB of stack, and rarely use more than a small fraction. In fact, nobody uses the bottom area of the stack, since doing so, might expose the application to random segmentation violations as it runs out of stack. So, stack space is allocated to be much larger than any part of a process needs.
So, we can just use the area below the stack. This isnt ideal, but its simple. Its not ideal, because it means a process, which does not obey the normal stack frame rules, might be perturbed by what we are doing. Tough. :-)
Another long problem I had was figuring out what actually happens during a PID probe. When you use dtrace to plant a probe, the PID provider constructs dynamically a probe for you. (You can see the probes, e.g. pidNNN:::). These probes disappear when either the target disappears or the probing dtrace disappears.
But how? I spent ages looking at the code. I added debug to /proc/dtrace/fasttrap. Whilst PID provider is in action, you can see three tables which fasttrap keeps:
- 1. A table of the probes themselves (tracepoints). Used to map a trapping probe back to the owner process
- 2. A table of processes being probe-provided. When the target process terminates, the owning tracepoints are dismantled.
- 3. A "provider" table. When you attach to a process to probe it, probes are created, but the probes belong to a provider (eg "fbt", "syscall", or "pidNNN"). Each process you attach to is effectively a brand new provider.
Now I found some other interesting things out. If you probe a process, and that process dies, your dtrace does not terminate (unless you make provision for this in your D script). The dtrace hangs, and its up to you to ^C it.
Heres an example of the trace tables in fasttrap:
cat /proc/dtrace/fasttrap tpoints=1024 procs=256 provs=256 total=9 # PID VirtAddr TRCP 5748 000000000040087d TRCP 5748 000000000040087e TRCP 5748 0000000000400881 TRCP 5748 0000000000400886 TRCP 5748 000000000040088b TRCP 5748 0000000000400890 TRCP 5748 0000000000400891 PROV 5748 pid 0 0 9 0 0 PROC 5748 1 1
"5748" is the target PID is was tracing. The TRCP entries show the virtual address in the target process where probes lay in waiting (each instruction of the "do_nothing2" function I attached to). The other fields in the tables are not really interesting (look at the source code to see what they are; I may fix the output to make it more self-describing). The output from /proc/dtrace/fasttrap is three table dumps (the header line above does not reflect that).
Once I had this "view" of what the provider was doing, I could immediately go fix another issue.
I had a lot of trouble with killing the dtrace and the kernel panicing.
Continued in the next blog entry...
PID Provider: Did you call? | Sunday, 22 July 2012 |
dtrace: description 'pid5748::do_nothing2:' matched 9 probes CPU ID FUNCTION:NAME 1 250496 do_nothing2:entry 1 250497 do_nothing2:0 1 250498 do_nothing2:1 1 250499 do_nothing2:4 1 250500 do_nothing2:9 1 250501 do_nothing2:e 1 250494 do_nothing2:13 1 250502 do_nothing2:14 1 250495 do_nothing2:return
Spent a couple of weeks to get this going, and its looking postive now. Let me recount the issues.
First, what is the PID provider? What is USDT? What is a normal probe?
Hopefully, we all understand a normal probe - its effectively a breakpoint placed in the kernel, eg via the FBT provider. (Syscall tracing doesnt rely on breakpoints, but thats doesnt really matter). When the breakpoint is hit, dtrace maps that to a user space caller who is waiting for the event.
USDT is similar to the normal in-kernel providers, but they occur in user space. They occur because someone put the probe in their code (e.g. in the interpreter loop in Perl or Python, or in malloc() in libc). Because dtrace for linux isnt widespread, few if any, apps have user space probes.
The PID provider (also known as "fasttrap", for historical reasons, and the name of the source code file for it) is very similar to USDT. But instead of manually littering source code with probe points, a user can drop a probe into a running process, e.g.
$ dtrace -n pid1234::malloc:entry
Along the way to getting the PID provider working, I found some interesting things.
First, although there is a lot of code in libdtrace to handle supporting the PID provider, it is actually a lot simpler than I thought. The act of placing a probe requires finding the address in the target process. Once located, a breakpoint instruction is placed at the probe address, or addresses. (PID provider lets you instrument the entry, return, or any/all instructions of the target function; in fact, its very similar to the INSTR provider).
dtrace itself doesnt need to ptrace(PTRACE_ATTACH) to the process except to gain write-access to the target process. On Solaris, using the /proc subsystem, ptrace() is not used. (Solaris allows two or more processes to debug a process at the same time, or, at least read/write memory and control the process; ptrace() does not). Although libdtrace in my release uses ptrace(), this is a limitation, which I plan to remove. The reason is that you cannot use dtrace to probe a process running under a debugger; Solaris lets you do this. Its a silly limitation of DTrace/Linux.
Another thing I found out which is very interesting is that you *CANNOT* do the following: put a probe on, for example, malloc, for every process in the system, including those which have yet to be created. If one examines:
$ dtrace -n fbt::function:entry
for probing the kernel, you get a hit no matter which process or interrupt causes the function to be called. But there is no syntax to support something like:
$ dtrace -n pid*::malloc:entry
When using the pid provider, we specify an actual PID, not all PIDs. Architecturally, dtrace cannot support this. When you put a PID probe in place, dtrace creates a new probe out of thin air - its automatically registered in the kernel. These "auto" probes are removed when the target process or dtrace terminates.
I will continue this entry in the next blog piece...
DTrace PID Provider | Sunday, 15 July 2012 |
As I dive in, and its a deep and scary place to investigate, its slower becoming clearer to me how it works.
The PID provider essentially does for processes, what normal dtrace probes do for the kernel (very similar to the FBT provider).
Theres a number of ways to come at the PID provider. Firstly, you could be launching an executable from inside dtrace, or, you could be targetting a specific running process, or you could be looking for any process which hits, for example, a libc call.
This is all very expensive - instead of dealing with the kernels symbol table, which, although large, is generally smaller than most executables, and relatively unchanging. To get the correct symbol table of a running process involves examing /proc/pid/maps, to find the mapped libraries, and then examining the process memory, to find the symbols of interest.
Lets take an example:
$ dtrace -n pid1234::malloc:entry
We locate the process (pid 1234), find the library where malloc is located, find the address of malloc() inside the library, and then we *patch it*. The malloc entry instruction is replaced by a breakpoint instruction. Very similar to the kernel.
But, before we do this, we need to tell the kernel that this breakpoint is a DTrace probe, which is handled by the fasttrap.c and fasttrap_isa.c code. Whilst the above dtrace is running, you can see this "on-demand" probe by examining "dtrace -l" or looking at /proc/dtrace/fbt.
Now - a number of things can happen: dtrace terminates or the process terminates. If the process terminates, we need to rip out the probe, since dtrace has played with and knows this probe exists. The fasttrap provide intercepts fork/exec/exit system calls and should undo the placed probe.
If dtrace exits, it should undo the patch to the process binary and restore the original instruction, and, remove the fasttrap/pid provider probes. (Confusingly, the fasttrap.c code contains the USDT and PID providers - they are nearly identical, the difference being that for USDT, the process places its own probes, but for PID provider, a copy of dtrace [i.e., another process] is doing it).
So far, so good. My DTrace has had a number of bugs in the libdtrace code (not quite fixed, but getting there), which affected ability to find and place the probes. We can now place the probes, and its possible to see this happen. In the above example we used "entry", but we could have used "return", or just left the last field blank. (In which case every instruction of the function is defined as a probe point - a very good, but eventual, test case).
So, eventually the application will hit the breakpoint, and the INT3 breakpoint handler will ask the fasttrap provider to handle the breakpoint.
At this point, things get a little confusing. DTrace is *not* using INT3, but INT 0x7E. INT 0x7E is a two-byte instruction vs INT3 which is a one byte one. DTrace (in libdtrace and fasttrap_isa.c) goes to great lengths to handle this by emulating the instruction which was overwritten. (This in fact was my original approach to single stepping the kernel, but gave up as being too hard and a pain to debug; INT3 is a single byte instruction so its easier to step over the instruction. But we mustnt temporarily reset the instruction to step over it, because another thread might hit the same instruction whilst we have a temporary instruction in place and miss the probe).
Lets go over this again: if we overwrite a user instruction, then we need to do this with a single byte trap (INT3), because if we dont and the target instruction is one byte long, then we can corrupt the subsequent instruction.
We have to be careful that this process may have multiple threads, running on different CPUs at the same time, who may hit the affected probe point.
I note that the Solaris dtrace code distinguishes an entry point from a return point and uses different INT traps to affect this. At present DTrace for Linux isnt supporting these traps, but now I have uncovered them, I need to understand more about *why* different trap types are used. From my work on the original kernel code, INT3 seems sufficient for all types of traps. (There is a potential issue that if we attach to a process in user space, which is being debugged, that we can get confusion about whether the INT3 is for dtrace or for the debugger).
Theres some other problematic areas; dtrace locks the process at certain key points, to avoid race conditions which could cause trouble (eg we mustnt allow a "kill -9" from someone else kill the process we are trying to instrument). Dtrace for Linux is keeping shadow data structures for processes, which the real kernel knows nothing about. So, again we have to be careful that we keep the "mirage" effect of security and safety.
I am going to fix the known areas at issue - I have already demonstrated (to myself!) that we can take a PID provider trap; releases up until today nearly do that, but they are missing a few fixes which I will release when I am happy the next release is better than what is available on my site and github. Hopefully, a few days away.
First, I need to fix /proc/dtrace/fasttrap - I want to dump out the key internal data structures, mostly to prove to myself I understand them - the output will show the tracepoints and PIDs being monitored, but at the moment, they are deficient since they only show the USDT placed probes.
new dtrace release - pid provider | Wednesday, 11 July 2012 |
That is where the PID provider interface comes into life !
Theres a few other bug fixes in the area of userland dtrace getting to the PID provider, but it still doesnt work, and at least now, I can start to debug the missing piece(s).
If you look at /proc/dtrace/trace, after invoking the PID provider, you will see something like:
#0 4753:ESRCH dtrace_ioctl pvp=00000000 #0 4753:fasttrap_ioctl:2238: here #0 4753:fasttrap_ioctl:2238: here #0 4753:fasttrap_ioctl:2238: here #0 4753:fasttrap_ioctl:2250: here #0 4753:fasttrap_ioctl:2275: here #0 4753:here in fasttrap_add_probe #0 4753:fasttrap_provider_lookup: here #0 4753:prfind 0 #0 4753:prfind:find_get_pid couldnt locate pid 0
This means we got as far as the driver, but theres a couple of blips. One is that libdtrace is passing down PID#0 no matter what PID you specify (something in the translation from Solaris /proc handling to Linux, where I have forgotten to set the PID), and the other is "prfind()".
"prfind()" is the Solaris kernel function to lookup a process by PID. That needs to be mapped to the Linux interface, but, the code calling this relies on the Solaris "struct proc" layout and fields, so that code has to be walked through to do the right thing.
Once this is resolved (and theres some complexity in scheduler and process locking, since Linux/Dtrace doesnt modify kernel code), then hopefully the PID provider can spring into action.
This is actually pretty good progress - having spent a long time trying to get ELF handling working, I hadnt realised that some code from near day-0, was a "TODO" item.
Excuse me? | Sunday, 08 July 2012 |
http://www.youtube.com/watch?v=y0C59pI_ypQ
So, I am adding TCP provider probes, and this is what happens on a connection (to an unopen port on the destination host):
$ build/dtrace -n tcp::: dtrace: description 'tcp:::' matched 4 probes CPU ID FUNCTION:NAME 1 54 :connect-refused 1 51 :state-change 2 51 :state-change 2 53 :connect-request
The connection is refused, and a little while later, we make the connection attempt. Something very strange is going on. Note that this happens on different CPUs, so its possible that there is an ordering problem between the CPUs, but that shouldnt normally happen.
Definitely some form of timing issue. If I connect to a remote or non-existant host, then it looks much "saner".
dtrace progress | Thursday, 05 July 2012 |
I am trying to get dtrace enhanced in some areas - but also having to revisit some of my ugly code or hacks - more a factor of the Linux kernel evolving. What works for todays kernels may not be true for older kernels, and its difficult trying to be careful not to break old or new kernels. I dont like #ifdef spaghetti, but sometimes my interpretation of the kernel evolution, is mistaken, and some code rot creeps in.
Just as a minor update, I am adding a little more TCP provider support. (Just added tcp::connect-established, for instance). More work is done to mirror all the TCP probes, as documented here:
https://wikis.oracle.com/display/DTrace/tcp+Provider
and to eventually support the callback arguments.