x86_x32 instruction set | Wednesday, 25 December 2013 |
In general, I think support for this is a good thing. We mostly wont use it on our desktops, and probably not even on our Androids - the scope for confusion, missing libraries, and other stuff is huge.
For those that dont know - the Intel/AMD chips support classic i396 32-bit mode binaries and instructions, whilst x86_64 supports full 64-bit operation. i386 mode is important (or was important), in the transition to full 64-bit chips, since at the time, the OS, compilers, and apps didnt support 64-bit and not everything was recompiled "overnight".
Getting a new architecture ready is a lot of work - we see this today in the race for ARM-64. Very typically, what is needed is the OS to support the architecture, followed by the compilers. Actually, the compilers have to come first to support the OS, so there can be a standoff until the two stabilise.
Next comes the glibc library which provides the interface to the OS, and pretty much a bottom-up recompile of every library (X11, GTK, Gnome, KDE, QT, etc) and lastly the apps; initially the core OS/distribution apps, and then vendor apps.
This can take a while to stabilise across the full suite of apps - maybe 1-2y optimistically.
Given the maturity of 64-bit architectures, nobody is in a race to support x32 variant. The x32 variant is simply a 4GB address space versoin of x86_64. All the instructions and registers are available from the 64-bit architecture, but the address space is limited, which in turn reduces the sizes of pointers. Hence, smaller memory demands. In todays multigigabyte desktops, thats not a big issue. But this can lead to smaller binaries - smaller footprints, faster to load, less pages of memory, less TLB and less cache misses. The fact that more code can exist in the instruction cache is important and can give additional gains.
In general, the reported gains are small to minor - maybe a few percent, unless the app is "pointer heavy" - lots of pointers. Many apps have large datasets in memory (eg XML or text, or web pages); but sophisticated apps will have big complex structures with lots of pointers, so the savings may be good.
I tried recompiling CRiSP as a new linux-x86_x32 architecture to see what the real world effect is. (Side note: I updated my ptrace implementation to support x32 architecture - it required very few lines of code to do so; the kernel system call interface is nearly identical to x86_64, bar a quirk of how syscalls are encoded).
Recompiling (and defining) a new crisp variant took around 5-10 minutes of effort - although I had to put a workaround in for <sys/sysctl.h> which has a pragma complaining that the sysctl() system call is not supported for x32 architecture apps. (I dont think I use that, but had to conditionalise out the #include).
The results are interesting - x32 seems to win, compared to x86_64. The results suggest from 5-10% performance improvement. I attach two runs of my performance benchmark in CRiSP macros - what these macros do doesnt matter much, except to note each test attempts to take 5s, counting how many of certain operations can be performed. How this would translate into real world performance is not worth of a comparison. CRiSP is pretty efficient anyhow, and if you run CRiSP, your CPU is going to be mostly idle; and when it isnt, maybe you get upto 5% performance improvement - i.e. you wouldnt notice it. If we were optimising, for example, battery performance on a mobile, tablet or laptop, then having x32 could be beneficial (not just for CRiSP, but the OS and all the other apps you use). Imagine an extra 30-60mins of battery life on your portable device! Worth having, but may not happen.
Anyhow, heres the relevant benchmark data:
text data bss dec hex filename 1379459 50844 70212 1500515 16e563 bin.linux-x86_32/cr 1424517 93000 78024 1595541 185895 bin.linux-x86_64/cr 1349955 52064 70692 1472711 1678c7 bin.linux-x86_x32/cr
The above shows the sizes of the 'cr' executable - the x32 variant certainly wins. (Due to more registers and better instruction layout compared to the x86_32 variant).
Heres the CPU benchmarks. I didnt run the x86_32 benchmark, since most people will run pure 64-bit desktops and use the 64-bit binary. One final note: CRiSP compiled nicely, and identically compared to the two other linux architectures. *Except* for missing X11 libraries - i.e. I cannot build a "crisp" windowed GUI version of CRiSP (maybe I havent installed the relevant packages, so I will take a look after this blog post to see if it is viable).
PERF3: 25 December 2013 23:14 v11.0.22a -- linux-x86_x32 1) loop Time: 5.00 3,635,000/sec 2) macro_list Time: 5.01 58,200/sec 3) command_list Time: 5.00 4,185/sec 4) strcat Time: 5.00 77,000/sec 5) listcat Time: 5.01 240/sec 6) string_assign Time: 5.00 448,800/sec 7) get_nth Time: 4.99 8,890/sec 8) put_nth Time: 5.00 142,340/sec 9) if Time: 5.00 559,000/sec 10) trim Time: 5.00 143,800/sec 11) compress Time: 5.00 116,000/sec 12) loop_float Time: 5.00 2,950,000/sec 13) edit_file Time: 5.00 80,600/sec 14) edit_file2 Time: 5.00 2,592/sec 15) macro_call Time: 5.00 74,100/sec 16) gsub Time: 5.00 226,300/sec 17) sieve Time: 4.99 0/sec Total: 100.980000 Elapsed: 1:42
PERF3: 25 December 2013 23:20 v11.0.22a -- linux-x86_64 1) loop Time: 5.00 3,300,000/sec 2) macro_list Time: 5.00 44,450/sec 3) command_list Time: 5.00 3,360/sec 4) strcat Time: 5.00 72,500/sec 5) listcat Time: 5.02 212/sec 6) string_assign Time: 5.00 437,200/sec 7) get_nth Time: 5.00 8,435/sec 8) put_nth Time: 4.99 129,000/sec 9) if Time: 5.00 546,000/sec 10) trim Time: 5.00 129,300/sec 11) compress Time: 5.00 118,600/sec 12) loop_float Time: 5.00 2,790,000/sec 13) edit_file Time: 5.00 88,600/sec 14) edit_file2 Time: 5.01 2,340/sec 15) macro_call Time: 5.00 66,100/sec 16) gsub Time: 5.00 187,100/sec 17) sieve Time: 5.00 0/sec Total: 101.420000 Elapsed: 1:42
ptrace now available | Tuesday, 24 December 2013 |
ptrace (like strace) let you monitor a process or tree of processes and see the system calls. ptrace tries to combine the best bits of truss and strace, along with new functionality - very important functionality.
A particular target of interest is process-termination records. When a process terminates, there is a lot of valuable information available - but normally discarded. This includes:
- Process size - current and peak
- CPU time - user + system, along with child user + system
- Open files - file descriptors in use, and peak open files
- Parent process id
- Context switches
There is a lot of information available; existing Unix tools, like the shell "time" or /bin/time command make much of this avaliable, but usually focussed on the single command you run. Getting the tree of information from a command script (e.g. make), is not doable with out a lot of work.
I have used LD_PRELOAD in the past to collect this information, but this is fairly intrusive, and using a LD_PRELOAD sits inside the process. With ptrace, you can attach to a running process or launch the process, and with many options to control what to monitor and the amount of output, it is an invaluable tool in the toolkit. strace is a very good alternative, but ptrace purports to be better.
I hadnt touched my ptrace implementation in over 6 years - the code was very stale and lacking many system calls, missing the newer features of strace (do you know what the "-y" switch does? Neither did I till I look at strace in depth).
ptrace comes with good documentation describing the features, but the primary switch is "-exit". When this is used, the process termination records are written and you can use the Perl script (psummary.pl) to print out the most interesting processes, or see a timeline to show the order or process fork/exit's, in a tree like fashion, or modify the script to suit yourself.
I am putting this out as the first public release - I may release to github at a later date, but for now, this is a binary only release. (Delaying publication of the source simply because ptrace is not free of the CRiSP utility routines I use to wrap malloc and a few other functiosn); the source code makefile is ugly - it used to work on Solaris, but I have given that up for now, and concentrated on getting ptrace to work - to fulfil the goal of getting useful output.
Here is the link to ptrace. (It is available on my crisp.demon.co.uk site as well).
And heres some examples of the psummary.pl script:
/home/fox/src/trace@dixxy: psummary.pl Number of processes : 245 #1 Longest running: 3.322815 31025 /usr/bin/make #2 Longest running: 3.283731 31029 /bin/dash #3 Longest running: 3.271198 31033 /usr/bin/make #1 Peak RSS: 28184 31209 /home/fox/crisp/bin.linux-x86_64/cr #2 Peak RSS: 8988 31260 /home/fox/crisp/bin.linux-x86_64/hc #3 Peak RSS: 8988 31261 /home/fox/crisp/bin.linux-x86_64/hc Maximum file descriptor : 14 31209 /home/fox/crisp/bin.linux-x86_64/cr Maximum RSS : 4692 31209 /home/fox/crisp/bin.linux-x86_64/cr Max VmStk : 140 31033 /usr/bin/make Max VmData : 0 31025 /usr/bin/make Max voluntarty ctx : 6301 31209 /home/fox/crisp/bin.linux-x86_64/cr Max non-voluntarty ctx : 68 31209 /home/fox/crisp/bin.linux-x86_64/cr Max CPU time : 0.98 31025 /usr/bin/make Max cumulative CPU time : 1.1 31025 /usr/bin/make/home/fox/src/trace@dixxy: psummary.pl -timeline | head -15 21:44:24.987056 start 31025 /usr/bin/make 21:44:25.009420 start 31026 /bin/rm 21:44:25.013167 end 31026 /bin/rm 21:44:25.014206 start 31027 /bin/rm 21:44:25.018092 end 31027 /bin/rm 21:44:25.019040 start 31028 /bin/rm 21:44:25.023004 end 31028 /bin/rm 21:44:25.025169 start 31029 /bin/dash 21:44:25.028427 start 31030 /bin/dash 21:44:25.028740 end 31030 /bin/dash 21:44:25.030161 start 31031 /bin/dash 21:44:25.030594 end 31031 /bin/dash 21:44:25.031529 start 31032 /usr/bin/basename 21:44:25.035048 end 31032 /usr/bin/basename 21:44:25.036802 start 31033 /usr/bin/make
/home/fox/src/trace@dixxy: psummary.pl -top 5 -l Number of processes : 245 #1 Longest running: 3.322815 31025 /usr/bin/make #2 Longest running: 3.283731 31029 /bin/dash #3 Longest running: 3.271198 31033 /usr/bin/make #4 Longest running: 2.871467 31077 /usr/bin/make #5 Longest running: 0.808828 31104 /bin/dash #1 Peak RSS: 28184 31209 /home/fox/crisp/bin.linux-x86_64/cr -batch -mregress #2 Peak RSS: 8988 31260 /home/fox/crisp/bin.linux-x86_64/hc -compress -m prim/prim.hpj #3 Peak RSS: 8988 31261 /home/fox/crisp/bin.linux-x86_64/hc -compress -m prog/prog.hpj #4 Peak RSS: 8988 31262 /home/fox/crisp/bin.linux-x86_64/hc -compress -m relnotes/relnotes.hpj #5 Peak RSS: 8988 31263 /home/fox/crisp/bin.linux-x86_64/hc -compress -m user/user.hpj Maximum file descriptor : 14 31209 /home/fox/crisp/bin.linux-x86_64/cr Maximum RSS : 4692 31209 /home/fox/crisp/bin.linux-x86_64/cr Max VmStk : 140 31033 /usr/bin/make Max VmData : 0 31025 /usr/bin/make Max voluntarty ctx : 6301 31209 /home/fox/crisp/bin.linux-x86_64/cr Max non-voluntarty ctx : 68 31209 /home/fox/crisp/bin.linux-x86_64/cr Max CPU time : 0.98 31025 /usr/bin/make Max cumulative CPU time : 1.1 31025 /usr/bin/make
The final piece of ptrace is to print out the peak cpu/memory information, to help guide on how much resources are needed to run a script (typically needed for a compile server or other shared compute resource).
If you are interested in the tool or the source, let me know (email) (CrispEditor -a.t- gmail.com). If there is enough interest, I will take it to the next level.
Future ideas for this could be to add dtrace like functionality, but I am trying not to mix the two systems, for now (e.g. the ability to run a script on entry/exit to system calls). ptrace contains facilities for emulating ltrace - but these arent debugged as yet (allowing tracing of arbitrary shared library entry points - you will see many libXXX.so files in the bin directory, but that is not a goal at present).
If people have other "I wish ..." ideas, please forward them to me. I may have the guts of such tools, or have a way to link ideas together.
Happy Xmas!
dtrace, ptrace, strace, xtrace, fatrace, ... | Saturday, 07 December 2013 |
/home/fox/src/trace@dixxy: ftrace No command 'ftrace' found, did you mean: Command 'fstrace' from package 'openafs-client' (universe) Command 'mftrace' from package 'mftrace' (universe) Command 'dtrace' from package 'systemtap-sdt-dev' (universe) Command 'itrace' from package 'irpas' (multiverse) Command 'ltrace' from package 'ltrace' (main) Command 'fatrace' from package 'fatrace' (universe) Command 'strace' from package 'strace' (main) Command 'btrace' from package 'blktrace' (universe) Command 'rtrace' from package 'radiance' (universe) Command 'xtrace' from package 'xtrace' (universe) Command 'mtrace' from package 'libc-dev-bin' (main) ftrace: command not found
Ok - so there are a lot, what do they do? I havent tried all of them, but I tried a few:
xtrace - an X11 tracing utility.
fatrace - a "who is writing to the disk" utility. Very nice. It uses system calls (fanotify_init and fanotify_mark) to access this data - I wasnt aware of these syscalls, but now I am.
ltrace - is a tool to monitor shared library calls. dtrace can do this, but ltrace is nicer/easier for the simple scenarios. (my ptrace can do this too, but it needs more work to verify it is still functional).
Then theres dtrace (systemtap-sdt-dev) - from the man page, just seems to be a tool to define systemtap equivalents to DTrace static probes.