CRiSP/crtags Optimisation | Saturday, 08 November 2014 |
The initial implementation is getting on for nearly 20y old. The goal originally was to provide the "ctags" facility of vi, but better. Machines of the day were looking like 4-16MB of RAM rather than 4-16GB of RAM which is common today, so effort was made to optimise space usage. The crtags file format is a series of sections of files and items which are tagged. It has been optimised - right from the beginning, to avoid optimise space usage, and avoid bad paging behavior. (Now, a distant artifact ! Systems rarely page or are memory constrained). This attempt to optimise memory goes back to the 1MB machines that CRiSP was originally built on. These optimisations are no longer necessary - but removing them would only offer a small change in performance.
crtags is designed to work with reasonably sized projects and directories. It takes a few seconds to scan the nearly 8000 files in the CRiSP source tree, and I regularly use it on the Linux kernel. The 3.16.1 kernel has 47426 files in it. Scanning that takes a little while. I have benchmarked this over the years.
Recently I did some more work to look at the performance and optimisation facilities in crtags. I collected a series of Linux kernels - so I would have a good/large test case - about 500MB of source files, 67356 files in all. On my i7 laptop (crtags is single threaded), it was taking about 2m20s to scan the files. On investigating the performance, I could see that we had an O(n^2) algorithm on filename matching. This is silly, given the complexity of the language parsers - that the mere filenames were using a lot of the CPU.
I modified the code to put in a hash table for filename matching, and this gave a huge win - down to about 23s for scanning the same files - about a 7x improvement in speed.
In looking at crtags, most of the processing is a constant per file - and each file is handled, one after the other. This opens up a huge win by multithreading the code. Potentially an Nx speedup, on an N cpu system. The code is difficult to convert to multithreading - it would require a lot of edits and refactoring, to ensure each thread is avoiding global state - a common reason why converting a non-threaded application to multithreaded is so difficult.
Its depressing how little attention the C standards bodies and compiler writers have, for converting non-threaded code to threaded. Really, there are a set of transformations (refactoring) and one would think that tools could help identify the major areas at issue (use of global variables, and use of "static" variables). I may create a refactoring macro in CRiSP to handle this.
On Unix, using a fork/join model of operation, one can create the equivalent of a multithreaded code, by use of fork() and wait() system calls. For example, divide all the files to be processed up, into separate groups, processed by individual CPUs. Then the issue of global state and locking disappears - at the expensive of more work on the "join" or merge at the end of processing.
I have modified crtags to use this fork/join model (it is a command line option, and not enabled by default), and reran my test. The above test went from 23s to 4s by using "crtags -j 8" - using the 4 real and 4 hyperthread cpu's on my i7. About a 6x performance increase. (The final code will be slower due to the lack of a merge).
So we went from 2m20s to 4s - a 35x speed up, with just a handful of lines of code.
The depressing thing about Windows (and this is late 2014) is that it still does not support fork(). It does support threads. So the above code will have no effect on Windows, and real threading will have to be implemented or an alternate way of achieving the same result.
I do fail to understand why Windows cant implement fork() - from the user space point of view, there is almost nothing to implement. From the kernel point of view, its not a huge amount of code. Granted, Windows processes may carry more state and forking may be more expensive if Windows could do it, but that would be such a big benefit when writing portable code or porting code to Windows. Oh well. (Cygwin supports fork(), and it is hugely expensive in operation, since it relies on software to copy huge blocks of memory around, rather than relying on the CPU's MMU to do copy-on-write (COW) operation - the key to why fork() is so efficient on Unixes).
Having said that, fork() on Linux is not brillliantly fast. Despite processors being so much faster than years of old - fork() seems to be getting slower, possibly related to the need for all CPUs to synchronise MMU and other state, rather than fork() itself getting more complicated.
To end users of CRiSP, they may see the initial performance optimisation, but unless they are working on extremely huge projects, they may hardly ever notice the change put in place. Also note, that in CRiSP, one rarely tags an entire project - CRiSP does incremental updates to the tag database, as you are editing/saving files.