dtrace update | Monday, 20 June 2011 |
Some build issues are fixed (2.6.18 kernels confuse the syscall extraction code); it mostly works - but some warnings are present. Additionally, a 'dtrace -n syscall:::' will crash the kernel. I suspect some mismatch on the ptregs syscalls and/or 32b syscalls on this kernel. Need to debug.
Also found that on 16-core machine, the xcall code leads to a lot of noise when things arent the way it expects. This eventually led to an assertion failure in dtrace.c (on a buffer switch - which is in agreement that the dtrace_sync() didnt hit the expected cpus, i.e. some race condition/bug), and eventually a failure from the kernel that a vm_free was invalid.
Oh dear.
To date I have been testing on dual-core cpus. I need to get an i7 so I can ramp up to 8 cores and do more heavy torture tests.
So, keep an eye out for updates (which are likely to be slow in coming in next week or two), whilst I hopefully try to refine the xcall issue.
NMI support added | Sunday, 19 June 2011 |
It looks like the APIC allows specific interrupts to be marked as NMI - which would be great since rather than sharing the NMI with other users of the interrupt, we could just make the IPI interrupt work like an NMI and avoid the deadlock scenario.
For now, the interrupt handler tries to be careful and not trigger when its uncalled for. It does present a problem if we need the NMI and someone else does at the same time, but I can investigate what/how the APIC works a little better (or check the Solaris code to see if indeed, that is what it does).
I also need to update the dtrace_linux.c code so that I dont just grab interrupt vector 0xea (random interrupt which appears not to be used, but it could be). I am a naughty programmer.
Release 20110619 contains the above fixes.
The Final Phase of dtrace | Sunday, 19 June 2011 |
After a lot of effort, writing, rewriting and rewriting yet again, the code is (nearly) finished. It looks good - it handles arbitrary cpus calling into other cpus and allows for a xcall to be interrupted by another call to xcall (effectively a mesh of NCPU * NCPU callers).
However, I have found a flaw. If I modify the dtrace_sync() function to sync 100-200 times instead of just once, then occasionally there are delays and kernel printk()s from the code - where spinlocks are taking too long.
Turns out, we could deadlock if we try to invoke an IPI on another CPU which has interrupts disabled. Not totally sure how Solaris handles this - I get a little lost in the maze of mutex_enter() and splx() code.
There is a solution to take us to the next level - NMI - the NMI interrupt is not maskable (unless an NMI is in progress). NMIs are typically used by Linux for a watchdog facility - make sure CPUs arent locking up, as well as "danger signals" (like ECC/parity memory errors).
I will experiment to see if I can run via an NMI rather than a normal interrupt and that should help reduce the problems of lock-busting significantly.
At the moment dtrace is pretty good - my ultra-torture tests really are horrible, and most people wont do that in real life.
So, as always, tread carefully until *you* feel happy this is not going to panic your production system.
Update to prior post | Thursday, 16 June 2011 |
More testing to follow and I need to fix AS4 (Linux 2.6.9) kernel compilation issues.
DTrace xcall issue -- fixed? Website of the day. | Thursday, 16 June 2011 |
http://forum.osdev.org/viewtopic.php?f=1&t=21768&start=0
Now, this has been driving me nuts for months. Why was my spanking new cross-cpu code hanging occasionally? I had spent ages building up the courage to write it, and was fairly proud of it. But it just wasnt relilable enough and I disabled it in recent releases of dtrace.
Heres the problem: a cross-cpu synchronisation call is needed in dtrace. Not often, but in key components. I feel like the way this was done in dtrace was almost laziness, because there are other ways to achieve this (I believe). But the single cross call (in dtrace_sync()) is a problem...
Interestingly, I was surprised it was called so often. Its called during the tear-down of /usr/bin/dtrace as the process exits. I had wondered why dtrace intercepts ^C and doesnt die immediately. It does something very curious - it intercepts ^C and asks the driver nicely to tear down the probes we may have set up. Of course, you can kill -9 the process, and it works. *But*. *But*. If you do that the probes arent torn down! Instead, they are left running. After about 20-30s, since nothing in user land empties the buffers, the kernel auto garbage collects, but it means on a kill -9 scenario, whatever you were tracing may continue to take effect.
I dont like the way ^C works in dtrace and I may attempt to fix it (eg fork a child to tear down the probes; tear down is done by a STOP ioctl(), btw).
Ok - so cross calls happen a lot especially during tear down (and also during timer/tick interrupt handling).
So .. what happens? Well, on a two cpu system, the cpu invoking cross call deadlocks against the other cpu waiting for the remote procedure call to be acknowledged.
With the original Linux smp_call_function() there were lots of issues in calling it with interrupts disabled (ie from the timer tick interrupt). This is not allowed - two cpus calling each other at the same time will deadlock.
The cross-call code has to run with interrupts enabled and that means being very careful with reentrancy and mutual invocation.
One day I put some debug into the code to try and spot mutual or nested invocations and I got a hit. On a real machine. But never on my VMs.
I modified the code to allow a break-out - after too long waiting, the code gives up and allows the machine to stay in tact. Without this, the machine would lock up (deadlock with interrupts disabled).
I fixed the code to handle mutual invocation and recursion.
But I could not figure out what the locked-up CPU was doing. I tried to get stack dumps from the locked CPU - but these would only happen after dtrace had given up waiting. Its as if the other CPU was asleep and wouldnt wake up until the primary CPU had given up looking (a definite Heisenbug!).
The web link at the top of this page illustrates the exact same setup I was seeing. So, I followed the page (it tells that acknowledging an end of interrupt to the APIC too prematurely may not work on a VM).
Not only had I spent a huge amount of time to understand, fix and engineer a solution but I almost had a working solution without realising it. I had moved the APIC_EOI code to the end of the interrupt routine previously, but because of the lack of support for mutual invocation, it hadnt worked. So I put it back again.
So I think this is looking good - much better than before. I need to do more torture testing and cleanup before I release.
On the way, I tried or started trying with lots of things (like using a crash dump to analyse this problem .. which wasnt successful). Or, using NMI interrupts instead of normal interrupts. I've learnt a lot and been frustrated by a lot too along the way.
Keep an eye on twitter .. I'll report a status update if I think I am not close enough.
My blogs | Sunday, 12 June 2011 |
People may find my blogs a bit confusing. I thought it worth detailing "why".
Originally I set up a series of blog posts, using my own Perl blog code, which was in turn, based on the nanoblogger code. (http://nanoblogger.sourceforge.net/).
The website I publish to (www.crisp.demon.co.uk) is interesting in itself. Demon was the first ISP in the UK (back in early 1990s) to offer access to the Internet. Alas, they have never done anything useful since then, and I pay subscriptions for a near-useless service (teeny amount of web space, no perl or cli or anything else). Because space is so tight, I tend to leave most things, including CRiSP and Dtrace downloads on my internet facing machine at home. The only thing that Demon usefully serves me is the email address, although I do try to get people to switch to my (numerous) gmail accounts.
I was using the Dyndns service for a DNS entry but due to some sillyness on my behalf, I lost the name entry, which put dtrace off the map for many people. I reinstated a new address (via crisp.dyndns-server.com).
I should just pay for a normal DNS entry but I havent decided what I want.
The crisp.demon.co.uk is costly, much more costly than a decent hosted web applicance, so I do need to do something.
At home, I have two main dev machines - and when I blog post, I try to update both the original Demon hosted site, and also blogger. It turns out to be easier to update blogspot first, and the Demon at a later date when I "get around to it". ("Get around to it" means powering on my main PC, running a script, and shutting it down again). Things got confused because I have two dev machines and have to be careful how I sync to and from each other.
So, thats the feeble excuse for me appearing and disappearing in the waves.
dtrace progress | Saturday, 11 June 2011 |
Bear in mind when we think of a kernel - there are multiple views of the kernel:
- 64b kernel running 64b apps - 32b kernel running 32b apps - 64b kernel running 32b apps
The apps get to the kernel via system calls. System calls are implemented in a variety of ways - depending on the kernel version and the CPU. (Some older cpus, such as i386, i486 dont support instructions like SYSCALL, SYSENTER).
So dtrace traps the system calls by patching the system call table. The code is mostly the same but subtley different for a 32b and 64b kernel.
But when a 32b app is running on a 64b kernel - the app doesnt know any different, but the kernel does. The kernel has two system call tables: the system call, for eg. "open" is a different index on the two OS's. The two OS's developed differently. i386 kernels have had to maintain backwards compatibility, but the amd64 kernel did not and started afresh at the point these cpus became available.
Dtrace handles that.
Except it didnt handle the special syscalls: when a 32b app invokes fork(), clone(), etc, we usually ended up panicing the kernel.
Most Linux distros are "pure": a 64b distro has 64b apps, so you rarely see the effect of a 32b app.
Linux/dtrace has a nice interface for system calls. The probe name, e.g.
$ dtrace -n syscall:::
matches all system calls. But the 32b and 64b calls are different probes. So, you can intercept all 32b syscalls on a 64b system:
$ dtrace -n syscall:x32::
which is useful in many ways.
I have nearly fixed these special syscalls on the 64b kernel - just have clone() to fix. The symptom of not fixing is a cascade of kernel OOPs and panics (because the kernel stack layout is not what it should be).
I hope to release later today a fix for this problem.
dtrace -- some updates | Sunday, 05 June 2011 |
Very occasionally, Perl would emit a warning relating to a file handle being referred to which belong to a file which couldnt be opened. (/etc/hosts - which always exists).
Similarly, other apps would occasionally fail to start with rtld linker errors.
This proved very hard to track down: I was pretty certain it was related to the xcall work I was doing. The error rates were rare - less than 1 in a million, and almost impossible to track down.
I moved away from xcall debugging and found that by having two simple perl scripts (on a dual core machine), which continuously opened files and nothing else, that the error rate would increase whilst the two scripts ran.
To try and get a better handle on this, I moved from 64-bit kernel debugging to 32-bit kernel, where the error rate was significantly higher.
After a lot of experimentation, it transpired that the error wasnt to do with xcall, but the syscall provider. Specifically, a piece of assembler glue turned out to be rubbish. I am not sure why it appeared to work, but it didnt. (I had made some changes earlier on which may have broken the syscall tracing on 32-bit kernels).
After recoding the assembler glue - things looked much better. The errors in syscall processing appeared to be gone. But a new problem surfaced - one I wasnt too surprised to see. There are a handful of 32-bit syscalls which use a differing calling convention to the others. (The 64-bit code handles this, but not the 32-bit code).
I have nearly finished redoing the 32-bit syscall tracing, and, once done, will need to validate the 64-bit syscall tracing.
If I am lucky, hopefully in the next few days or weeks, the resiliency issues will disappear and I can put out a new release.
The syscall tracing code is horribly ugly - because we have to support different calling conventions across the two types of cpu architecture. I may split the code up into an x86 and x86_64 code file.
Bad websites | Thursday, 02 June 2011 |
I have a beef with a variety of websites - nice websites, let down by the "We dont care attitude" or "We didnt test it".
http://www.tvguide.co.uk
I despair of this web site. Its a great guide to TV channels for UK people. Nice layout. Lots of content.
So, whats wrong with it?
A number of things. One - the menu bar at the top of the screen is over engineered. If you try to do something, like select one of the sub-menu items, the ability to navigate and not lose context is near impossible. Try and select something, e.g. "New series". I leave you to find which submenu thats under (a minor annoyance).
Secondly, the huge amount of real estate given over to pointless banners. These arent advertising banners, but program banners. On a small screen you have no information content on the first screen at all. On a large screen you barely get 50% of your screen with the TV grid.
The search function is badly over engineered using javascript.
And if you turn off some ad sites via an ad-blocker, the whole page becomes non-functional.
And lastly, the page quite often forgets who you are and your channel selections.
I gave up with this site, and wrote my own TV highlighting application.
BBC RadioTimes
The BBC provides XML files containing 14 days of TV schedules. This is a great source of data (which I use in my TV planning application).
But the reviews are *awful*. No, make that, *truly awful*. When I see a film or a series of potential interest, the paragraph of review is of this form:
This film, made by XXX YYY, is a follow on to his earlier work ZZZ, AAA, BBB. The director did blah, and the actors did bloop. The film won an award at Cannes, and went straight to video.
Can you tell whats wrong with the above? Its totally devoid of any information about what the film or program is *about*. The reviews/write-ups on tvguide.co.uk at least tell you what the program is about.
Heres a real quote from the BBC:
One of two low-budget westerns made by Barbara Stanwyck - the other was 1956's The Maverick Queen - before she found her glorious late-career stride with such titles as Forty Guns and TV's The Big Valley. Aided by thoughtful direction from the prolific and talented Allan Dwan, this movie now has great curiosity value, in that the leading man is former US president Ronald Reagan, a bland and colourless performer when pitted against screen villains Gene Evans and Jack Elam. The location scenery is very attractive, the action sequences well staged, and Stanwyck as tough as ever: it's a shame the script didn't give her or any of the cast more opportunities. Still, this will pass the time nicely, and teenage girls might discover a useful role model.
Gizmodo.com
My first site of the day is http://www.dailymail.co.uk. Yes, I know - thats a poor choice of a news website, but consider it bubblegum for the brain first thing in the morning. My second is Engadget. A very nice and highly fluid website with news stories of interest to me.
And this, *was* Gizmodo. But I have removed the link from my web browser.
On my ipad, I have a cached page dating back to April - I cannot get it to update. I dont know what they did. On my other devices, I dont have the caching problem. Gizmodo used to track Engadget in style and content. But recently, they have overhauled it. And they have not done any user testing as far as I can tell.
First, I would be redirected to the mobile site, even though I dont want that. Now, they have reformatted the website, and its totally devoid of content on the front page.
It used to work and be a great site, but I waste my monthly bandwidth quota vising Gizmodo and hoping for something useful to browse.
So, goodbye to Gizmodo. Maybe, when others start linking to it again and it contains useful content (even if its a rehash of other sites), I will revisit.
Slashdot
This site has been great for years. Until now. The pool of people powered news stories they have is great. But slashdot have been playing games with their presentation and - as my 4th choice of read of the day - is close to being binned as well.
For starters, the three column format is annoying. Very annoying. When browsing on a mobile/small screen device, the left hand column requires you to scroll the screen to view the text. I never look at the left hand column - because I know it never changes. So why waste prime real estate with that, *there*.
Next. Slashdot has tried to create slow and large home page loads. I applaud that. But they have done that by limiting the number of visible stories to about 6. Given that they seem to dribble items out at about once per hour, that means its pointless visiting the site repeatedly during the day. And if you leave it too long, you lose continuity of stories you have read/not-read. (You have to scroll to the bottom of the screen, click on "More", wait, wait, and then you see the stories you saw a few hours ago).
Slashdot seems to have "lost it" - it used to be an interesting place to read non-news stories, about technology, but they have taken the Gizmodo approach - reduce the amount of useful info on the page to the point where visiting it has taken on a boring attitude.
BBC
BBC - what a poor website. It used to be awful. Now it is pointless. Another home page devoid of content. Its full of flash cleverness where you can edit the layout, but I dont want to do that. I want to see news. The news page is devoid of information - almost like it is a commodity which is in short supply.
(Compare the BBC news defaults with the Dailymail website - theres enough information in each paragraph on Dailymail to decide if you want to read further. On BBC, you have to guess if the news item says anything useful).
Next, try reading BBC on a mobile device. The customisations do not work (at least, not on my android device). The site is untested in real life. I rarely look at BBC - every few years when I look, I think the same thing. A waste.
There *is* good content on the BBC site - if you spend the time to find the programme schedule and radio information. But using the BBC website is like having an unfaithful lover: things move around so much you are never sure if the site will be the same when you visit it. It would not be so bad if it got better when the changes happen. But it gets worse.
The real-estate vs information content is so low, that it reminds me of the days of a Teletype (ASR-33 with a paper punch drive).
Can I do better?
I dont for one moment think I can do better than these sites. I have learnt lots of interesting things (both in terms of content and in terms of presentation). But the dilution of news sites which all feed off each other, has made the internet quite boring.
Which is a shame.