Code copying (cut + paste) Sunday, 21 July 2013  
I recently started looking into the concept of code copying. The idea is that in a large body of code, it is normal to cut/paste blocks of code, typically small "idioms", e.g. iterate over a list, or connect to a database etc.

Theres some tools out there, but I wanted to add this to CRiSP, so thought this wouldnt be too hard. Its not. But the ideas behind it are interesting. If you have never thought about this, then go and do so.

Some people use tools like this for plagiarism - thats nice, and may be warranted, but even on your own code base, or some one elses, it gives rise to code quality metrics. (Code quality metrics cannot tell that code is good or bad, but can merely hint and give the reader a chance to inspect areas of code - with some clues about where to start).

Anyway, I ran this on the CRiSP code base - and the results are surprising. CRiSP is closed source (there is an open source very early version out there). There is no "cut/paste" code from external sources, other than the odd API calls. But what I found was lots of examples of "code copying". (It reminds me that many algorithms are the same, albeit in different flavors, and I may copy code, but usually this is no more than a few lines, e.g. to iterate over a list or extract some information from a datastructure).

Firstly, in my unoptimised code, looking at 100k lines of code over 79 source files, it took 15s of CPU to do a scan. This is not optimised code, but it is CPU intensive.

Next, even in this small sample, there is code similarities. To do this, one effectively needs to compare each region of lines against every other. This is a O(n^3) or worse operation - using a naive algorithm. (A file of 1000 lines for instance, contains 1m regions - consider line #1 can be in a group by itself, or lines 1+2, or 1+2+3, etc).

Comparing regions of size "1" is silly - all the blank lines or "{" lines would show as code dups. So we need to exclude trivial blocks. I set my block to 10 or more consecutive lines. And I found genuine cut/paste examples.

Looking at the dups, they are mostly insignificant - the longest is a 21 line match.

Posted at 20:08:09 by fox | Permalink
  Performance - Javascript and Mobile Monday, 15 July 2013  

The excellent article by Drew Crawford is a good read. It goes into detail why Javascript on mobile is slow, and why its so different compared to desktop Javascript.

I thought I would add a comment, since performance has always been one of my hobbies. Performance is what drives a lot of programmers - how to optimise code to be small or efficient. (Nowadays there is a 3rd axis to this - how to be power-efficient).

When coding, one starts with simple code and algorithms. Code is blindingly fast because a lot of the bureaucratic "noise" is omitted. This "noise" is what is responsible for error checking and validation, and creeps into code over time, to address issues - e.g. buffer overflows, race conditions or locking.

For small pieces of code or algorithms, it is possible to be close to optimal. Eg "sorting" is a classic standard algorithm - people rarely implement their own but use the classes in their language of choice to do this. Sorting is complicated - you either optimise for the small set scenario or the large set, and the differing algorithms have different best/worse/average case performance and/or memory use.

As a simple example of this. Consider 256x256 bit multiplication. 256 bits is longer than any normal processor word size, so you could just implement multiplication as a series of shifts and adds, or you could just create a massive 2^256 square table and index directly into the table to get O(1) performance (at a cost of memory which is infinite - 2^256 == 10^80, i.e. more atoms than in the universe, assuming my maths is sane).

Javascript does an incredible job. Javascript is like a Turing machine - you can do anything with it (implement an x86 emulator, Linux emulator, MSDOS emulator, or a full blown graphics package - all have been done). Yes, it can be done. In each case, one has to ask "why?"

Drew's article makes excellent comparisons about memory use and why mobiles fall short at present. If Javascript is your only language, you will do whatever it takes. With millions of programmers, the boundaries have yet to be found, but the cost is enormous - our battery life, the power used by our laptops, and the fact that 8GB RAM is really a minimum when running a desktop browser.

But some things are easier to do in Javascript - writing HTML/CSS/DOM based web pages is very powerful, and easy to change when you change your mind. Doing the same in C#, Java, C++ is painful - assuming you have a stable graphics library. (FYI, CRiSP negates the standard visual class libraries and implements its own; it had to - since none of these existed way back when CRiSP was being born; this in turn, has lead to CRiSP being available across platforms and can work anywhere - there is no requirement to install package X to run it. On the other hand this adherence to basic principals come with an implementation price, and CRiSP needs surgery - it actually needs to use more memory and CPU to compete with its competitors; CRiSP still fits into the iCache of most CPUs and is fast in many many areas).

CRiSPs "speed" is impressive (hey, this is an advert, after all !) But there are trade offs. There are things which CRiSP will perform badly at (if anyone is interested, I can demonstrate some of them). These deficiencies are similar to many other apps. For example, in Perl it is easier to regexp matches than direct string comparisons. Thats simple horrible - the Perl interpreter has to work overtime to optimise the regexp-that-isnt cases (and it does a good job). But this "costs".

CRiSP does software virtual memory - allowing huge files to be edited, without needing the same amount of memory proportional to the file. For small files you dont notice this; for large files, the I/O overhead exceeds your ability to notice what is going on. But some pathological cases will show whats going on.

As an experiment, I test out many competing text editors to see "how they are implemented" and its not difficult to determine how, by suitable probing and pathological test. Most software evolves over time so that it is doing things the original authors never intended. (Consider Excel - a complaint for many years was its inability to edit files with more than 64k rows - why would anyone do that?! But they do because this data comes from a database or other source; many people have purchased CRiSP exactly because of Excel's limitation).

It is admirable that mobiles can run in 512MB-2GB of RAM, but most of us know the annoying design decisions which prevent us, the users from doing a better job of optimising our workset than the generic algorithms in iOS or Android (which in turn do a good job, but never quite good enough). Most of the time this doesnt matter. And in 2013, the minimum spec mobile is impressive. If one wrote a web app 2 years ago, the hardware has moved on - what was unlikely a while ago is now "the norm". (This is no comfort for certain websites which consume huge amounts of bandwidth or waste the users time with a huge graphics oriented page, which contains one or two paragraphs of text following by a "Next" button to take to another equally obnoxious page with no information content.

Some sites like, are good - the information density is palatable. Slashdot, after many false starts has a better mobile site than the desktop version (no server side fetches to read the story and comments, and the annoying sideways scrolling problem has disappeared). Hm. Just tried, but that fails on my desktop browser.

Posted at 21:49:11 by fox | Permalink
  Why is security hard? Wednesday, 10 July 2013  
Was reading a security blog series of posting, and wondering at how much interest there is in this. So, I thought I would embarrass myself and write my own monologue.

Why is security hard? Its easy to be dismissive or make fun, but the answer is similar to that of entropy increasing. Its easier to do security wrong than it is to do it right.

Likewise, theres more ways to create a broken or buggy application than there are ways to do it right.

With the increasing hoards of people trying to do harm in the security world, equally there are people learning how to prevent the harm.

I saw a reference the other day that as each exploit or hole is patched, theres more to make way (eg jailbreaks or other similar holes). Every segmentation violation, or unanticipated exception is problematic - potentially a way to tunnel into an application. The reference to there beings tens of thousands of exploits left to use is worrying. It shows that even for a brand spanking new, fully up to date package, that the next hole is just around the corner.

There is great initiatives and software packages or development methodologies to increase the quality of software, but these may be insufficient - or worse, they are sufficient, but expose so much "noise" that they get ignored.

It *should* be possible to automate software weaknesses using automation processes, but its not clear if the frailty of humans who write code, can win in the face of the ease at which these vulnerabilities can be created.

At the heart here is human vanity. Coding is like a drug - you get hooked; you get sucked in when the first dialog box appears, or the first correct answer appears. We stop looking after that. It works! And it requires a detached mind and lack of pressure to look below the surface, and discover than even if your app can do "2+2", it might not handle "3+3" (a contrived but realistic example, especially if you were working on a floating point or bignum package).

Ever tried to parse an IP address in C/C++? Try and write a parser, e.g. in C. And wonder what happens when your "char" does sign extension, or you move across compilers or standards, where the rules changed. Or where you went from 8-bit, to 16-bit, to 32-bit, to 60-bit, to 64-bit cpu architectures. Plug that IP address encoder/decoder into an encryption utility, and you wont be able to tell the difference between "good" and "bad". (Took me ages to debug the CRiSP license manager and remove these kinds of issues when faced with IP addresses that looked like signed numbers, or other non-standards compliant code - even although I thought I knew what I was doing).

As security and safety become more and more problematic, most people dont understand it. (If you believe you understand security... you dont. Rule #1 !) The best security people (likely - I dont know) understand the area of human psychology, armed with many tools to tackle mundane to super serious.

I write this, as I stare at the certifications available for security, and wonder: are you more worried about those that proclaim themselves as security people armed with certificates, or those who neither declare knowledge or anything about themselves?

Meanwhile...I go hunt for more bugs in my own code. Or shall I go web surfing...

Posted at 23:30:37 by fox | Permalink