Why I love Xen Sunday, 23 March 2014  
I dont. But I thought my prior blog required balance :-)

I think I found the Xen + DTrace culprit. I have a couple of routines in dtrace which invoke the CLI and STI instructions. They are a bit naughty, and designed to shield dtrace from the complex number of enable/disable interrupt primitives in the kernel. (The kernel has multiple layers of interrupt functions and macros - to try and hide the implementation details of each cpu, along with preemption and scheduling functions).

I spent a lot of time thinking about DTrace and Xen - what was I doing wrong. I spent a lot of time not thinking about DTrace and Xen. As much as I wanted and needed help, to find this horrible bug, I couldnt even phrase the right question.

The evidence was clear: occasional single step traps would not properly single step - it was as if the single-step trap flag was not set in the FLAGS register. I tried molesting the code in the interrupt routines - checking very careful the assembler code.

The Linux kernel has two sets of interrupts for many of the core traps (int1, int3 and others). Which one is used depends on whether Xen is active or not. When Xen is active, there are two extra registers on the interrupt stack.

After much playing with the dtrace code and the linkage from dtrace to the underlying kernel...I started to investigate something else.

I noticed the dtrace dumps were caused at the point we switched from one cpu to another. My VirtualBox VM had 4 cpus. So I set it to 1 cpu. DTrace worked remarkably better - but soon enough, we still core dumped on:

$ dtrace -n fbt::sys_chdir:

in one window and

$ while true
do
  cd /
done

in the other. The fact that we had a single-step trap, which was not single stepping is really curious. So I started grepping for CLI/STI in the driver code, and came across my old friends dtrace_enable_interrupts() and dtrace_disable_interrupts().

These functions use assembler to execute the CLI/STI instructions. Now, what if we had done the wrong thing? Xen wouldnt really know (except by accident) that we had executed these instructions and we may be running without correct disabling interrupts. Xen is a hypervisor and if it doesnt know whats going on, all bets are off.

So I replaced the CLI/STI with arch_local_irq_restore and arch_local_irq_save, and it works! I had gotten away with this for a small duration and this is the Xen issue (I hope its the only issue). Whilst writing this blog, my test case ran with no problems.

Now I can formalise the change and push out a new release.

I still dont like Xen - I would rather have my VM die and tell me that something is wrong immediately, not lie through its teeth that the guest code is working when it isnt.


Posted at 21:38:34 by fox | Permalink
  Why I dislike Xen Friday, 21 March 2014  
I have a hate/hate relationship for Xen. All forms of virtualisation *except* Xen, emulate a CPU in a way where it is extremely difficult to tell the difference. Xen replaces certain instructions with an API and a portal to the outside world. If you look at such instructions as CLI and STI - classic 8080 instructions (yes, all the way back to the original 8-bit cpus), they work in a well understood way.

Fast forward to Xen. Those instructions dont exist - nor any other instruction which can affect the virtual state of the CPU. Instead, the kernels are compiled to call a function via the hypercall portal.

Now, if you do the wrong thing - you are on your own. You wont know what you did wrong. It is so easy to crash the VM *host* if you are not careful.

So, lets look at dtrace. Dtrace appears to work well. But if you are on Xen, as of this writing, the results are erratic - after a few probes, you will crash the process, or the kernel, or the VM.

Heres an example of a process crash:

[ 2390.769639] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[ 2390.769640] IP: [<ffffffffa01cd193>] cpu_core_exec+0x133/0xfffffffffffbcfa0 [dtracedrv]
[ 2390.769641] PGD 17319067 PUD 172b9067 PMD 0
[ 2390.772090] Oops: 0002 [#1] SMP
[ 2390.772090] CPU 1
[ 2390.772090] Modules linked in: dtracedrv(PO) nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc psmouse microcode serio_raw i2c_piix4 e1000 [last unloaded: dtracedrv]
[ 2390.772090]
[ 2390.772090] Pid: 3057, comm: bash Tainted: P        W  O 3.4.0 #1 innotek GmbH VirtualBox/VirtualBox
[ 2390.772090] RIP: e030:[<ffffffffa01cd193>]  [<ffffffffa01cd193>] cpu_core_exec+0x133/0xfffffffffffbcfa0 [dtracedrv]
[ 2390.772090] RSP: e02b:ffff8800173d5f78  EFLAGS: 00010193

When a dtrace probe fires (because we replaced the instruction with a breakpoint), we have to single step the replaced instruction. What is happening above is that the single step is not honored. The EFLAGS register shows bit 0x100 is set - a single step is in effect. But Xen has decided it isnt. So the instruction at the buffer location continues after the instruction. Heres the disassembly:

(gdb) x/i 0xffffffffa01cd190
   0xffffffffa01cd190:  push   %rbp
   0xffffffffa01cd191:  nop
   0xffffffffa01cd192:  nop <=== shouldnt get here
   0xffffffffa01cd193:  add    %al,(%rax)  <=== but we died here

Such sweet joy to figure out what "rule" I broke. I had reported in a prior blog entry that the Xen documentation is very poor.

More than likely I am setting the trap flag by bit-mask operation on the flags and not by telling Xen what we did.

Off to find the code which is stopping Ubuntu 12.04 from working reliably.


Posted at 22:24:54 by fox | Permalink
  Malloc(RSS) Monday, 17 March 2014  
I recently wrote about the rss.pl RSS news crawler I wrote. Majority of it took a few hours to write - the goal was to reduce my mobile bandwidth and avoid annoying adverts being injected into pages. I set my sights very low - this doesnt need to be a technically difficult piece of software, and it gave me something to play with - something I have done numerous times over the years. (CRiSP incorporates a news side-bar, for instance; and CRiSP used to support a web spider - it still does, but theres less need for it in the day of superfast broadband - compared to the early modems it was written against).

I can successfully report that rss.pl exceeded my desire to go on a diet. Previously I was approaching 500MB/month of bandwidth use and it was out of control, due to the obnoxious way adverts were being generated (they were mutilating the HTML of a page - after I had paid the cost of downloading the page - and effectively paying twice for a page, whilst denying compression). In addition, the mechanism would add significant delay - I was lucky to read 1 or 2 web pages in multiple minutes.

With rss.pl, I do a page refresh maybe 2 or 3 times a day. The single (compressed) page load of ~90kb (~250kb uncompressed) lets me wade through maybe 100-200 news items over the course of a few hours - I can read far more, without annoying delays.

In one month, I hit 60MB of use - I had set myself a diet of 100MB/month but I exceeded that. I would have hit 50MB or lower, but occasionally I use my device for email.

Anyway, I started adding stats to the service. Allows me to fine tune how it works. I noticed other people use it (by all means do - rss.pl - but I reserve the right to take it down or restrict usage if I find any abuse).

One of the things I am debating is how much history to carry. I currently allow for the last 2000 new sites in a cache, but of the 10-12 sites I poll, at varying frequencies, some sites are exposing old items. (I dont use date aging - yet - simply URL history). The working set is about 700-800 news items, so within 2 or so days, I will be exposed to stale news items popping through. Its interesting watching the effect. For example, www.osnews.com is not a fast updating site - maybe 1 item per day, yet my code is allowing multiple old items to pop through. "The Old New Thing" (http://blogs.msdn.com/b/oldnewthing/rss.aspx) is a great site, and updates at one update per day. Yet very old items are peeping through.

Somehow, the algorithm for retrieving sites, and the cycle update from these sites is causing a kind of "shearing" effect - something typically seen with a fragmented heap (hence the reference to malloc) or locality issues in a page swapping algorithm.

Its as if the sinusoidal wave of each news feed is leading to some interesting effects.

I can just increase my history limit - doing so wastes more CPU cycles on my server (so I now need to use a better data structure than what I presently do). Maybe I will relent and do If-Modified-Since requests and use the timestamps to keep everything fresh.

At the moment, I discard older news items and pages I generate, but I may decide to keep them for long term archival - for more data mining expeditions.


Posted at 22:48:08 by fox | Permalink
  Latest DTrace news Sunday, 09 March 2014  
I havent much worked on DTrace recently - having been looking at some side projects (adding multiple cursor support to CRiSP, my rss.pl enhancements, and various bugs/enhancements in CRiSP).

Of course, when I stop looking at dtrace, after a while, a trickle of things come in. I thought I would summarise some of my "TODO" items.

First, a user reported an issue on a 3.2 kernel. Seems like a Xen hosted kernel is very unreliable. Although he reported an issue with an fbt::: torture test, I could reproduce with a task switch function - even after a few probes, suggesting the basic handling of an IRET instruction may be having problems on Xen. (Xen does paravirtualisation which is another way of saying, all the work is on the user and not itself, to virtualise. I really do not like Xen - its a work generation scheme!)

Next was an issue with the PID provider. I did start working on this and found some missing functionality in libproc/libdtrace. I started working on this, but its not ready for consumption. I have included the partial work in the next release. At issue is running a process and intercepting the shlib events so that dtrace can look at the shlib and add probes.

Next up is the 3.12 kernel. I usually leave it a while before touching new kernels - Ubuntu 14.04 is due in a few weeks and I can kill two birds with one stone by waiting for that to come out.

Another annoyance is the libdwarf vs libdw confusion which prevents ctfconvert from being built. This is mostly irrelevent unless people want to use kernel struct's in their probes. I need to discard both libraries and use readelf and parse the output from it - its the only portable way to do what is necessary vs the confusion inherent in both dwarf libraries.

So, its unlikely I will release a new dtrace release for a few more weeks, unless I magically fix one or more of the issues above.


Posted at 20:43:00 by fox | Permalink
  candycrush - file format Sunday, 02 March 2014  
After all the recent discussions about candycrush, I thought I would test it out - see what the fuss is about.

Its a nicely put together game - delivering a "Bejewelled" experience which gets you hooked quite early on. Starts really easy and then becomes challenging. You quickly run out of lives. Lives are given back to you at a rate of 1 every 30 mins. If you leave the game for a few hours you can have a few lives to retry the levels again, until you run out.

Great so far. With over 300 levels - the game will keep you hooked for a long time. But paying for the extras and in-app purchases can soon be expensive.

Looking at all the cheats on the web - there seems to be two: one is to use facebook - login/deinstall/reinstall, and you can bump up your lives.

The other is to download a "save" file which is premodified to give you 999 lives and other stuff.

I took a quick look and located that (for Android), the file of interest is something like

/data/data/com.king.candycrushsaga/app_storage/save_NNN.dat

where NNN is some unique number for you.

Looking at this file in a binary editor shows some form of binary structure. Given that I was at level 3, the following pattern showed itself:

0002f0:   0003 0000 00c0 1200  0003 0000 00ac 1c00  ................
000300:   0002 0000 00b4 3c00  0003 0000 003c 8700  ......<......<..
000310:   0003 0000 0020 f800  0003 0000 0034 fb02  ..... .......4..
000320:   0003 0000 00ec fe00  0003 0000 00d4 4301  ..............C.
000330:   0003 0000 00ac 3e01  0002 0000 0038 ee01  ......>......8..
000340:   0003 0000 00a8 5b01  0002 0000 005c 5c01  ......[......\\.
000350:   0003 0000 0008 7d02  0003 0000 0018 c801  ......}.........
000360:   0003 0000 00dc 6c02  0002 0000 0010 0d01  ......l.........
000370:   0003 0000 008c 1103  0003 0000 0014 ed03  ................
000380:   0003 0000 00e4 9800  0002 0000 0088 2b01  ..............+.

After that is a sea of all zeroes. I modified those "3"'s - at 8 byte intervals, and uploaded to my device.

And yes - it now shows that I have completed (or started) those levels. I am assuming the format here is - number of stars, and a high score account. Not clear what the 0000 is after the stars - maybe the stars are an int (i.e. 4 bytes) and 4 bytes for the score.

I presume theres lots more stuff in there to find, including a timer or other attributes you can earn/pay for.


Posted at 17:40:41 by fox | Permalink
  Google and my rss feed site Saturday, 01 March 2014  
I wrote previously about my RSS reader - a simple Perl script which does RSS aggregation and internal web service - very small and simple, designed to reduce bandwidth over my mobile. It has been very effective so far - instead of 50-75+MB/d I achieved 30MB for the *week*.

Its effective because its no frills and supports a zipped payload page - no graphics, no external references, etc. Wow! Is it good :-)

http://crisp.dyndns-server.com:3000/p.html

I'll make gradual changes and enhancements to it over time - so if you use it, you may see changes (and feel free to mail me with ideas or questions. There is a definite fine line between bloat and functionality.

Anyway - I am staring at the http access log and notice something very peculiar. This app - by its nature of aggregating news from a few web sites has attracted the googlebot crawler. Great - why not.

But I notice a request come in with a "Referer" link like this:

http://www.google.com/search?q=Christina+Mari....

Interesting. So, someone clicked from google to get to me? I must be famous! Anyhow I play that link into the browser, and I can see - there I am - on page 1 of the results. Neat! Managed to achieve something most people dream about.

But - isnt this a security hole? If I attract these requests en-masse and start collecting them, then I am finding out something very personal from the requestor and their IP address and browser.

Interesting...will have to look more at this to see what is going on. I have no direct use in such data - but of course, everyone out there, coming from google is getting a referer link.


Posted at 21:50:05 by fox | Permalink