x86_x32 instruction set Wednesday, 25 December 2013  
There was an article on slashdot today (Slashdot) on whether the x32 instruction set architecture is of any use. The comments were equally divided - some loving it for embedded use where every byte or cycle counts, and some saying its pointless.

In general, I think support for this is a good thing. We mostly wont use it on our desktops, and probably not even on our Androids - the scope for confusion, missing libraries, and other stuff is huge.

For those that dont know - the Intel/AMD chips support classic i396 32-bit mode binaries and instructions, whilst x86_64 supports full 64-bit operation. i386 mode is important (or was important), in the transition to full 64-bit chips, since at the time, the OS, compilers, and apps didnt support 64-bit and not everything was recompiled "overnight".

Getting a new architecture ready is a lot of work - we see this today in the race for ARM-64. Very typically, what is needed is the OS to support the architecture, followed by the compilers. Actually, the compilers have to come first to support the OS, so there can be a standoff until the two stabilise.

Next comes the glibc library which provides the interface to the OS, and pretty much a bottom-up recompile of every library (X11, GTK, Gnome, KDE, QT, etc) and lastly the apps; initially the core OS/distribution apps, and then vendor apps.

This can take a while to stabilise across the full suite of apps - maybe 1-2y optimistically.

Given the maturity of 64-bit architectures, nobody is in a race to support x32 variant. The x32 variant is simply a 4GB address space versoin of x86_64. All the instructions and registers are available from the 64-bit architecture, but the address space is limited, which in turn reduces the sizes of pointers. Hence, smaller memory demands. In todays multigigabyte desktops, thats not a big issue. But this can lead to smaller binaries - smaller footprints, faster to load, less pages of memory, less TLB and less cache misses. The fact that more code can exist in the instruction cache is important and can give additional gains.

In general, the reported gains are small to minor - maybe a few percent, unless the app is "pointer heavy" - lots of pointers. Many apps have large datasets in memory (eg XML or text, or web pages); but sophisticated apps will have big complex structures with lots of pointers, so the savings may be good.

I tried recompiling CRiSP as a new linux-x86_x32 architecture to see what the real world effect is. (Side note: I updated my ptrace implementation to support x32 architecture - it required very few lines of code to do so; the kernel system call interface is nearly identical to x86_64, bar a quirk of how syscalls are encoded).

Recompiling (and defining) a new crisp variant took around 5-10 minutes of effort - although I had to put a workaround in for <sys/sysctl.h> which has a pragma complaining that the sysctl() system call is not supported for x32 architecture apps. (I dont think I use that, but had to conditionalise out the #include).

The results are interesting - x32 seems to win, compared to x86_64. The results suggest from 5-10% performance improvement. I attach two runs of my performance benchmark in CRiSP macros - what these macros do doesnt matter much, except to note each test attempts to take 5s, counting how many of certain operations can be performed. How this would translate into real world performance is not worth of a comparison. CRiSP is pretty efficient anyhow, and if you run CRiSP, your CPU is going to be mostly idle; and when it isnt, maybe you get upto 5% performance improvement - i.e. you wouldnt notice it. If we were optimising, for example, battery performance on a mobile, tablet or laptop, then having x32 could be beneficial (not just for CRiSP, but the OS and all the other apps you use). Imagine an extra 30-60mins of battery life on your portable device! Worth having, but may not happen.

Anyhow, heres the relevant benchmark data:

   text    data     bss     dec     hex filename
1379459   50844   70212 1500515  16e563 bin.linux-x86_32/cr
1424517   93000   78024 1595541  185895 bin.linux-x86_64/cr
1349955   52064   70692 1472711  1678c7 bin.linux-x86_x32/cr

The above shows the sizes of the 'cr' executable - the x32 variant certainly wins. (Due to more registers and better instruction layout compared to the x86_32 variant).

Heres the CPU benchmarks. I didnt run the x86_32 benchmark, since most people will run pure 64-bit desktops and use the 64-bit binary. One final note: CRiSP compiled nicely, and identically compared to the two other linux architectures. *Except* for missing X11 libraries - i.e. I cannot build a "crisp" windowed GUI version of CRiSP (maybe I havent installed the relevant packages, so I will take a look after this blog post to see if it is viable).

PERF3: 25 December 2013 23:14 v11.0.22a -- linux-x86_x32
  1) loop              Time: 5.00   3,635,000/sec
  2) macro_list        Time: 5.01      58,200/sec
  3) command_list      Time: 5.00       4,185/sec
  4) strcat            Time: 5.00      77,000/sec
  5) listcat           Time: 5.01         240/sec
  6) string_assign     Time: 5.00     448,800/sec
  7) get_nth           Time: 4.99       8,890/sec
  8) put_nth           Time: 5.00     142,340/sec
  9) if                Time: 5.00     559,000/sec
  10) trim             Time: 5.00     143,800/sec
  11) compress         Time: 5.00     116,000/sec
  12) loop_float       Time: 5.00   2,950,000/sec
  13) edit_file        Time: 5.00      80,600/sec
  14) edit_file2       Time: 5.00       2,592/sec
  15) macro_call       Time: 5.00      74,100/sec
  16) gsub             Time: 5.00     226,300/sec
  17) sieve            Time: 4.99           0/sec
                   Total: 100.980000  Elapsed:  1:42

PERF3: 25 December 2013 23:20 v11.0.22a -- linux-x86_64
  1) loop              Time: 5.00   3,300,000/sec
  2) macro_list        Time: 5.00      44,450/sec
  3) command_list      Time: 5.00       3,360/sec
  4) strcat            Time: 5.00      72,500/sec
  5) listcat           Time: 5.02         212/sec
  6) string_assign     Time: 5.00     437,200/sec
  7) get_nth           Time: 5.00       8,435/sec
  8) put_nth           Time: 4.99     129,000/sec
  9) if                Time: 5.00     546,000/sec
  10) trim             Time: 5.00     129,300/sec
  11) compress         Time: 5.00     118,600/sec
  12) loop_float       Time: 5.00   2,790,000/sec
  13) edit_file        Time: 5.00      88,600/sec
  14) edit_file2       Time: 5.01       2,340/sec
  15) macro_call       Time: 5.00      66,100/sec
  16) gsub             Time: 5.00     187,100/sec
  17) sieve            Time: 5.00           0/sec
                   Total: 101.420000  Elapsed:  1:42

Posted at 23:22:19 by fox | Permalink
  ptrace now available Tuesday, 24 December 2013  
Over recent weeks, I have been working on ptrace - not dtrace. ptrace is a clone of strace (or rather, strace is a clone of truss, and ptrace is a clone of truss - a very old and early tool of mine).

ptrace (like strace) let you monitor a process or tree of processes and see the system calls. ptrace tries to combine the best bits of truss and strace, along with new functionality - very important functionality.

A particular target of interest is process-termination records. When a process terminates, there is a lot of valuable information available - but normally discarded. This includes:

  • Process size - current and peak
  • CPU time - user + system, along with child user + system
  • Open files - file descriptors in use, and peak open files
  • Parent process id
  • Context switches

There is a lot of information available; existing Unix tools, like the shell "time" or /bin/time command make much of this avaliable, but usually focussed on the single command you run. Getting the tree of information from a command script (e.g. make), is not doable with out a lot of work.

I have used LD_PRELOAD in the past to collect this information, but this is fairly intrusive, and using a LD_PRELOAD sits inside the process. With ptrace, you can attach to a running process or launch the process, and with many options to control what to monitor and the amount of output, it is an invaluable tool in the toolkit. strace is a very good alternative, but ptrace purports to be better.

I hadnt touched my ptrace implementation in over 6 years - the code was very stale and lacking many system calls, missing the newer features of strace (do you know what the "-y" switch does? Neither did I till I look at strace in depth).

ptrace comes with good documentation describing the features, but the primary switch is "-exit". When this is used, the process termination records are written and you can use the Perl script (psummary.pl) to print out the most interesting processes, or see a timeline to show the order or process fork/exit's, in a tree like fashion, or modify the script to suit yourself.

I am putting this out as the first public release - I may release to github at a later date, but for now, this is a binary only release. (Delaying publication of the source simply because ptrace is not free of the CRiSP utility routines I use to wrap malloc and a few other functiosn); the source code makefile is ugly - it used to work on Solaris, but I have given that up for now, and concentrated on getting ptrace to work - to fulfil the goal of getting useful output.

Here is the link to ptrace. (It is available on my crisp.demon.co.uk site as well).

Download here

And heres some examples of the psummary.pl script:

/home/fox/src/trace@dixxy: psummary.pl
Number of processes      :       245
#1 Longest running:     3.322815  31025 /usr/bin/make
#2 Longest running:     3.283731  31029 /bin/dash
#3 Longest running:     3.271198  31033 /usr/bin/make
#1 Peak RSS:               28184  31209 /home/fox/crisp/bin.linux-x86_64/cr
#2 Peak RSS:                8988  31260 /home/fox/crisp/bin.linux-x86_64/hc
#3 Peak RSS:                8988  31261 /home/fox/crisp/bin.linux-x86_64/hc
Maximum file descriptor  :          14  31209 /home/fox/crisp/bin.linux-x86_64/cr
Maximum RSS              :        4692  31209 /home/fox/crisp/bin.linux-x86_64/cr
Max VmStk                :         140  31033 /usr/bin/make
Max VmData               :           0  31025 /usr/bin/make
Max voluntarty ctx       :        6301  31209 /home/fox/crisp/bin.linux-x86_64/cr
Max non-voluntarty ctx   :          68  31209 /home/fox/crisp/bin.linux-x86_64/cr
Max CPU time             :        0.98  31025 /usr/bin/make
Max cumulative CPU time  :         1.1  31025 /usr/bin/make

/home/fox/src/trace@dixxy: psummary.pl -timeline | head -15 21:44:24.987056 start 31025 /usr/bin/make 21:44:25.009420 start 31026 /bin/rm 21:44:25.013167 end 31026 /bin/rm 21:44:25.014206 start 31027 /bin/rm 21:44:25.018092 end 31027 /bin/rm 21:44:25.019040 start 31028 /bin/rm 21:44:25.023004 end 31028 /bin/rm 21:44:25.025169 start 31029 /bin/dash 21:44:25.028427 start 31030 /bin/dash 21:44:25.028740 end 31030 /bin/dash 21:44:25.030161 start 31031 /bin/dash 21:44:25.030594 end 31031 /bin/dash 21:44:25.031529 start 31032 /usr/bin/basename 21:44:25.035048 end 31032 /usr/bin/basename 21:44:25.036802 start 31033 /usr/bin/make

/home/fox/src/trace@dixxy: psummary.pl -top 5 -l Number of processes : 245 #1 Longest running: 3.322815 31025 /usr/bin/make #2 Longest running: 3.283731 31029 /bin/dash #3 Longest running: 3.271198 31033 /usr/bin/make #4 Longest running: 2.871467 31077 /usr/bin/make #5 Longest running: 0.808828 31104 /bin/dash #1 Peak RSS: 28184 31209 /home/fox/crisp/bin.linux-x86_64/cr -batch -mregress #2 Peak RSS: 8988 31260 /home/fox/crisp/bin.linux-x86_64/hc -compress -m prim/prim.hpj #3 Peak RSS: 8988 31261 /home/fox/crisp/bin.linux-x86_64/hc -compress -m prog/prog.hpj #4 Peak RSS: 8988 31262 /home/fox/crisp/bin.linux-x86_64/hc -compress -m relnotes/relnotes.hpj #5 Peak RSS: 8988 31263 /home/fox/crisp/bin.linux-x86_64/hc -compress -m user/user.hpj Maximum file descriptor : 14 31209 /home/fox/crisp/bin.linux-x86_64/cr Maximum RSS : 4692 31209 /home/fox/crisp/bin.linux-x86_64/cr Max VmStk : 140 31033 /usr/bin/make Max VmData : 0 31025 /usr/bin/make Max voluntarty ctx : 6301 31209 /home/fox/crisp/bin.linux-x86_64/cr Max non-voluntarty ctx : 68 31209 /home/fox/crisp/bin.linux-x86_64/cr Max CPU time : 0.98 31025 /usr/bin/make Max cumulative CPU time : 1.1 31025 /usr/bin/make

The final piece of ptrace is to print out the peak cpu/memory information, to help guide on how much resources are needed to run a script (typically needed for a compile server or other shared compute resource).

If you are interested in the tool or the source, let me know (email) (CrispEditor -a.t- gmail.com). If there is enough interest, I will take it to the next level.

Future ideas for this could be to add dtrace like functionality, but I am trying not to mix the two systems, for now (e.g. the ability to run a script on entry/exit to system calls). ptrace contains facilities for emulating ltrace - but these arent debugged as yet (allowing tracing of arbitrary shared library entry points - you will see many libXXX.so files in the bin directory, but that is not a goal at present).

If people have other "I wish ..." ideas, please forward them to me. I may have the guts of such tools, or have a way to link ideas together.

Happy Xmas!

Posted at 18:57:21 by fox | Permalink
  dtrace, ptrace, strace, xtrace, fatrace, ... Saturday, 07 December 2013  
Been working on my ptrace tool (very similar to strace, but does more...hopefully!). Of course, there are many tracing tools out there, but I was intrigued, when I decided to try "ftrace", which wasnt on my system, and got the following recommendations...

/home/fox/src/trace@dixxy: ftrace
No command 'ftrace' found, did you mean:
 Command 'fstrace' from package 'openafs-client' (universe)
 Command 'mftrace' from package 'mftrace' (universe)
 Command 'dtrace' from package 'systemtap-sdt-dev' (universe)
 Command 'itrace' from package 'irpas' (multiverse)
 Command 'ltrace' from package 'ltrace' (main)
 Command 'fatrace' from package 'fatrace' (universe)
 Command 'strace' from package 'strace' (main)
 Command 'btrace' from package 'blktrace' (universe)
 Command 'rtrace' from package 'radiance' (universe)
 Command 'xtrace' from package 'xtrace' (universe)
 Command 'mtrace' from package 'libc-dev-bin' (main)
ftrace: command not found

Ok - so there are a lot, what do they do? I havent tried all of them, but I tried a few:

xtrace - an X11 tracing utility.

fatrace - a "who is writing to the disk" utility. Very nice. It uses system calls (fanotify_init and fanotify_mark) to access this data - I wasnt aware of these syscalls, but now I am.

ltrace - is a tool to monitor shared library calls. dtrace can do this, but ltrace is nicer/easier for the simple scenarios. (my ptrace can do this too, but it needs more work to verify it is still functional).

Then theres dtrace (systemtap-sdt-dev) - from the man page, just seems to be a tool to define systemtap equivalents to DTrace static probes.

Posted at 23:31:24 by fox | Permalink