Perl’s motto “There’s more than one way to do it” applies to verification too. The proliferation of formal methods may seem haphazard to greenhorns; the claimed differences between variants of the same technique exaggerated. Granted, not every improvement is a groundbreaking breakthrough (anyway, not most of the time). But the space of solutions to the verification problem is huge and the theoretical limits outstanding; hence the interesting trade-offs are many, each sensible given different constraints. One size does not fit all.

Nadia‘s guest post about dimensions in verification covered the big picture. The follow-up point I’d like to make is that there’s a significant variability also between variants of the same verification technique. Francesco Logozzo‘s post about abstract interpretation triggered these thoughts of mine; in fact, I’m going to build on his example to illustrate my point.

Francesco presents the example of a routine that computes the maximum element in an array, using a for loop that scans the array from its first to its last element, to show that his CodeContracts static checker — based on abstract interpretation — is not limited to dealing with “simple” properties such as nonnullness: it is capable of inferring the complete functional postcondition of max, namely (using a more readable syntax):

\forall k\colon 0 \leq k < \texttt{a.Length} \Longrightarrow \texttt{a}[k] \leq \texttt{Result}[/latex] [latex]\exists k\colon 0 \leq k < \texttt{a.Length} \wedge \texttt{a}[k] = \texttt{Result}[/latex]

This is remarkable, but let's try to modify the example to see where we can go from here. First, encode max as a function that operates on an attribute a by scanning it from last to first. (The following code is C#.)

Function max_downward and the original one max have the very same input/output behavior, and indeed they implement two variants of the same simple algorithm. Ran on max_downward, the CodeContracts checker correctly infers that Result is an element of a (the existentially quantified postcondition above), and that it is greater than or equal to a's last element, but it falls short of the complete functional postcondition (it misses the universally quantified postcondition).

I don't know the details of CodeContracts checker's implementation of an abstract interpreter, and hence I cannot track down precisely the source of this behavior. However, there is no reason to be surprised about this restriction: any sound verification technique edges on intractability; as users, it normally is not difficult to push it until undecidability rears its ugly head and, in this case, spills some precision. If anything, we should be surprised whenever a verification technique works flexibly in different conditions!

To continue our example, we accept to give up some automation in exchange for more flexibility. We turn to deductive verification; more precisely, auto-active verifiers. Rustan Leino and Michał Moskal introduced the term "auto-active" to denote tools that are partly automatic and partly interactive. (By the way, my proposal to call them "inter-matic" has yet to catch on.) They are automatic in that they do not require interactive guidance while they try to verify an input program. But they still indirectly interact with users through annotations, such as pre- and postconditions, invariants, and suggestions such as intermediate assertions, which are an additional burden to write and crucially influence their capabilities and performance.

Even among auto-active verifiers, there is room for considerable variations in terms of flexibility and simplicity. Let's first consider Dafny, a state-of-the-art verifier mainly developed by the SMT whisperer — a.k.a. Rustan Leino. Dafny has a relatively simple input language, with built-in support for specification models such as mathematical sequences, of which we're going to take advantage in our example. Here's a variant of the example that verifies in Dafny: have a glance at the code, and then I'll point out the interesting parts.

Besides the array a, class PSList ("Possibly Sorted" List) includes a Boolean attribute sorted, whose role is storing whether the content of a is sorted. One can imagine a complete implementation were this flag is set whenever the array is processed by a sorting algorithm or is populated by elements in order. The new variant of maximum, called max_ps, takes advantage of this additional piece of information: if sorted is true, it immediately returns the value a[a.Length - 1] which is initially assigned to Result. Indeed, the maximum of a sequence that is sorted in increasing order is its last element. The correctness of max_ps relies on this fact; which we encode, following Dafny's idiomatic approach to representation invariants, as part of predicate Valid: if sorted is true then the content a[..] of a is sorted in increasing order. Valid is supposed to be an invariant property, which methods of class PSList can rely on and have to preserve. In fact, method max_ps lists Valid in its pre- and postcondition; and Dafny uses its clause about sortedness in the proof of branch if sorted { ... }. (In case you're wondering: the assert there is needed to help Dafny convert between a's content and its sequence representation a[..].)

This example worked smoothly thanks to Dafny's transparent syntax, which explicitly represents all salient features of the code and its annotations. But suppose we make the example more realistic (and more complex) by adding max_ps to the interface of a full-fledged general-purpose list data structure. In this case, we may prefer an auto-active tool with more options and control knobs, which gives additional flexibility in exchange for a steeper learning curve. AutoProof is a verifier that tries to offer such a trade off; thus, the last version of our maximum example is in Eiffel — AutoProof's input language.

Class PS_LIST builds its functionalities by extending those of some verified library implementation of a list with all the bells and whistles. In practice, FULL_FLEDGED_LIST could be some list implementation among those in EiffelBase2. A couple of hints should suffice to explain the example's (imperfectly highlighted) syntax: across expressions represent universal (across ... all) and existential (across ... some) quantification over integer intervals x |..| y; sequence is an inherited model attribute that describes the list's content as a mathematical sequence (indexed from 1, unlike the C# and Dafny examples).

We encode the sortedness property as part of the class invariant, in a clause named when_sorted, which appears at the bottom of PS_LIST's text. Remember that we are inheriting all of the functionalities of a full-fledged list, including a detailed, complex class invariant to which we add clause when_sorted. Dealing with the kind of full functional specifications that are needed to reason about realistic implementations has some practical consequences: since the invariant is a lot of specification, carrying it around in the proof of every routine of the class is likely to bog down the verifier and to render the proof of even trivial properties a challenge. AutoProof, however, has been developed with these situations in mind. Routine max_ps can recall the only invariant element that is needed precisely where it is needed in the proof, using the assertion check inv_only ("when_sorted") end in the conditional branch if sorted then ... . The many other consistency properties inherited with the invariant are simply ignored to the benefit of a straightforward, agile proof.

We've had a glance at two auto-active verifiers of functional correctness to get an idea of how they target different sweet spots in the sweeping landscape of verification techniques and their implementations. Stay tuned for more posts about AutoProof. In the meanwhile, you can check out its project page or try it out in your web browser. AutoProof's main developers, Nadia Polikarpova and Julian Tschannen, welcome feedback about the tool — especially in the form of comments to this post! 🙂

It’s really a combination of all three. More questions? 🙂

Seriously, the issue of the real “nature” of informatics (I don’t call it “computer science” only to avoid adding to the terminological confusion) — applied mathematics, science of information, or engineering discipline — remains a topic of intense debate in more than one circle. A related discussion is the relationship between informatics theoreticians and practitioners, and whether they can get along (yes they can and do, as Scott Aaronson patiently explains).

I must admit that discussions about these topics leave me mostly indifferent, if not mildly baffled. What we call “informatics” or “computer science” has grown from roots in different branches of science and engineering, so that its true “nature” is not in any single one of them but in their amalgam. This should be evident by looking at the history of modern informatics and its eminent pioneers. Its eclectic origins are a tremendous source of vitality and part of the reason why computer science and its applications have become ubiquitous in pretty much any other science to provide conceptual and practical analytical tools. And this is also a great deal of what makes computer science so interesting!

But perhaps I can make my position more convincing with an example. In the rest of this post, I select a computational problem and follow the train of thought that leads to a possible solution and implementation. My point is demonstrating how solving the problem satisfactorily requires a combination of theoretical and practical tools whose intimate connection cannot be severed without ruining their efficacy.

Take the problem of enumerating prime numbers: a number is prime if it is only divisible by itself and by one; a number that is not prime is called composite. Prompted by this blog’s motto (“A little specification can go a long way”), let’s approach the problem descriptively: we provide a simple characterization of prime numbers, and then use standard tools to match the characterization against any integer that we want to test for primality. The characterization is going to use regular expressions.

Primes is not regular

Wait a minute! Is this even possible? This was my reaction too the first time I read (I can’t remember where or when, but there are several articles exploring this idea) about using regular expression to characterize primality. The notion of prime numbers involves divisibility and integer arithmetic, which are beyond the capabilities of regular expressions or their operational counterparts (finite-state automata), whose expressiveness defines the so-called regular languages.

In order to analyze the problem rigorously, we have to cast it into the framework of formal languages. Using a simple unary alphabet of letters x, the language P of prime numbers can be defined as the set of all strings of x’s whose length is a prime number:

P = \{ x^p \mid p \text{ is prime} \}.

Let’s show that the language P of prime numbers is not regular. We could use the well-known pumping lemma, but a direct proof by contradiction is even more convincing. Assume that P is regular; then there exists a deterministic finite state automaton A_P that accepts precisely strings in P. Automaton A_P consists of a finite number N of states; take any prime p > N. (Side note: this is possible because there are infinitely many primes — if you’re not familiar with Euclid’s neat proof of this fact go check it out.) Consider the accepting path of x^p when it is processed by A_P: each transition consumes exactly one x; since there are more x’s in input than states in A_P, the path must include a loop that goes from some state S back to it. Call \ell the length of such loop, that is the number of x’s consumed while traversing it. We can see that adding a number of x’s multiple of \ell leads to an input that A_P still accepts. This is because \ell more x’s traverse the loop one more time, and the behavior of the automaton from S on does not change if the trailing input is the same because A_P is deterministic. Going back to the definition of P, we have that A_P accepts all strings of length p + k \cdot \ell for any nonnegative integer k. In particular, it accepts the string of p + p \cdot \ell x’s. But this is incorrect behavior: p + p \cdot \ell is not prime because it is divisible by p. This contradicts the hypothesis that A_P recognizes language P — the language of primes is not regular.

While we’re at it, let’s also prove that the language P of primes is not even context free. We can leverage Parikh’s theorem for that. A convenient way to state the theorem is: every context-free language is commutatively equivalent to some regular language. This roughly means that, if P is context-free, there exists a regular language P_r whose strings have the same number of letters as those in P irrespective of order. But order is immaterial for a language over unary alphabet such as P — length is the only distinguishing feature — which entails that P is context-free only if it is also regular. Since we have seen that P is not regular, it’s not even context-free.

Primes as a regex

The trick to circumvent the non-regularity of the language P of primes is switching from regular expressions — à la Kleene, used in formal language theory — to regexes — à la Thompson, used in classic Unix text processing utilities and part of the POSIX standard. Perl’s Larry Wall championed this useful terminological distinction (regular expressions vs. regexes), which we’ll consistently use in this presentation. Crucially, the regex feature that greatly increases the expressive power of regular expressions are backreferences, which can express repetitions of previously captured substrings. In fact, it’s not hard to see that all other features of POSIX regexes are just syntactic sugar expressible using Kleene’s regular expressions.

Back to the problem of expressing the language P of primes using regexes and backreferences. It’s actually easier to look at the complement problem: defining the language \overline{P} of composite numbers expressed as unary strings. A composite number c has at least one divisor (other than one and itself); that is, it can be written as a product c_1 \cdot c_2 for 1 < c_1, c_2 < c[/latex] or, equivalently, as a sum of [latex]c_2[/latex] equal terms [latex]\underbrace{c_1 + \cdots + c_1}_{c_2 \text{ terms}}[/latex]. Visualizing [latex]c[/latex] in unary, we can write it as partitioned into [latex]c_2[/latex] groups, each consisting of [latex]c_1[/latex] letters [latex]x[/latex]. This is a pattern we can express using a regex: a leading substring of two or more (that is, [latex]c_1[/latex]) [latex]x[/latex]’s (<code>xx+</code>, with <code>+</code> denoting one or more repetitions) followed by one or more (that is, [latex]c_2 - 1) of its replicas (\1+, with \1 denoting a backreference to the first matched substring). In all, regex (xx+)\1+ matches the composite numbers greater than one in unary notation. If we want to handle all natural numbers, we can use alternation (the | operator) to specify the special cases of 0 and 1 as composite: (xx+)\1+ | x? (with ? denoting zero or one occurrences, and assuming | has lower precedence than the other operators). We'd better do that as no mathematician since Lebsgue considers 1 a prime according to Wikipedia 🙂

The power of regexes

What is then the expressive power of regexes? Our primes example indicates that backreferences can express languages in classes beyond context-free. Precisely, we've shown that language P is not context-free and that regexes can express the complement language \overline{P}, but regexes are not explicitly closed under complement. However, it's clear that \overline{P} is also neither regular (because regular languages are closed under complement) nor context-free (because unary context-free languages boil down to regular languages). Alternatively, we can consider languages whose strings have the form w\,w for any string w, which are clearly expressible by means of regexes with backreferences ((.*)\1) but are neither regular nor context free (as one can show using, say, the pumping lemmas).

On the other hand, regexes seems incapable of expressing all context-free languages; in particular, they cannot express the languages of balanced characters such as parentheses — known as Dyck languages in formal language theory —, which are paradigmatic examples of context-free languages. Consider D = \{ a^n b^n \mid n > 0 \}, which is well-known to be context-free but not regular. Short of a rigorous proof, an argument that D is inexpressible by regexes could work as follows. As we match a regex against strings in D, each pair of captured pattern and backreference to it must occur either entirely within the a^n substring or entirely within the b^n substring. Otherwise, the backreference would deviate from the structure of strings in D where all b’s follow all a’s. This constraint, however, means that backreferences cannot carry any information between the a^n and the b^n parts of the string. All that is left is then limited to regular languages; hence it cannot express language D.

I have to remark again that this discussion applies to classic Unix/POSIX regexes, whose only non-regular operators are backreferences. Modern scripting languages often support a much larger superset of "Perl Compatible Regular Expressions" (PCREs) whose expressiveness far exceeds that of Unix/POSIX regexes as they include fully recursive patterns and even the possibility of invoking arbitrary code as subpatterns are matched. See this article for a nice overview of the expressiveness of PCREs as they are available in PHP; and this Stack Overflow discussion about how Java regexes can match patterns such a^n b^n.

Enumerating primes in Bash

After this long detour it's time to go back to the original problem of enumerating primes. Equipped with a descriptive characterization of composite numbers — the regex (xx+)\1+ | x? — let's turn ourselves to Unix command line tools that can interpret it. As an additional challenge in the original spirit of solving the problem descriptively, we'll try to avoid explicit loops entirely in our program of sorts for enumerating primes.

First, since our regex works on unary strings, we need a way to convert numbers from decimal to unary. We can use printf's format strings to this end: the format specifier %*s reads a numeric argument followed by a string argument and prints the latter as a space-padded string whose length is the numeric argument. If the string argument is absent, it just prints as many spaces as the value of the numeric argument. Then, we use tr to turn spaces into x's and voilà: our unary representation. For example, to print 24 in unary:

There's an alternative that uses printf's format specifier %.0s without requiring tr: printf 'x%.0s' {1..24}. See printf's documentation for an explanation of how it works.

Second, we have to call the above snippet involving printf for all numbers that we want to test. Instead of an explicit loop, we can generate a sequence {0..N} of nonnegative integers up to N and pass its values on to printf using xargs's option -n1, which dispatches one argument value per invocation of printf. However, printf's format string now needs a trailing newspace character to separate unary numbers by line.

It's finally time to use our regex. There's a minor issue here regarding the distinction between POSIX basic and extended regexes. The former include backreferences but no alternation; the latter include alternation but no backreferences. This is hardly a problem in practice, as most GNU tools (the one I'm assuming we're using) support backreferences using both syntaxes. As for the alternation, we don't really need to include it explicitly in the regex; instead, we use grep's option -e to specify multiple regexes and have the tool match any of them. We need to know about grep's options -E — to use extended regex syntax (just because it's nicer as it doesn't require to slash-escape metacharacters) — and -v to invert the match (because we want to keep the primes and discard the composites).

Don't try the previous snippet yet: it doesn't work! The problem is that grep matches the regexes against any portion of the input string. Since x? trivially matches the empty substring, the inverted match is the empty match. Let's introduce the ^ and $ markers to require that the regexes match the whole substring (which, by the way, is the standard interpretation of regular expressions in formal language theory). You can try and see that the following snippet enumerates primes up to 24 in unary.

The output in unary form is not very readable. We convert it back (without loops) using the expr utility, again with arguments dispatched one at a time using xargs. This is the only non-POSIX compliant part of the snippet. I couldn't find another way that doesn't use loops or variables; if you know one, please leave a comment. Anyway, we have our loopless prime enumerator!

A little empirical performance analysis

We have reasons to suspect that our prime enumerator is not very efficient. Enumerating primes up to 41 takes over 14 seconds on my desktop machine. This is not very good given that primality testing is in P and randomized algorithms such as Miller-Rabin are very fast in practice. It is also unsurprising: regex matching with backreferences uses backtracking to implement nondeterminism, which nullifies the otherwise linear-time performance of matching using finite-state automata. For the same reasons, memory usage should instead be roughly constant or grow very slowly: the size of the automaton used for the matching only depends on the regex to be matched and, linearly, on the length of the captured substring that is backreferenced.

Let's see if we can get some empirical data to corroborate this intuition. If the running time T(n) of our enumerator is exponential in the input, the ratio T(n) / 10^n should remain constant as the input size n increases. Input normally is measured in number of bits (or digits) for numerical algorithms; hence let n = \log(k), where k is the number we test for primality. Let's measure T(k) / k for each input k and see what it looks like. We allow ourselves to use loops this time, so that we can simplify the enumeration snippet using a for Bash loop.

We enumerate from 30 because this is when grep's running time starts to become non-negligible. It's time to use another command-line utility. Not to be confused with the homonym Bash command, utility time is normally located in /usr/bin/time and offers options to customize output using a format string. For each invocation of grep, we print the current value of k, and the CPU user time (%U) and maximum RAM resident set size (%M) taken by the process running grep. Since we now only care about these measures, we add the -q option to grep, which suppresses output. Finally, we redirect these statistics to awk, which will print on each line a triple k, T(k)/k, M(k) (T and M are time and memory). In all, run the command:

In case you're wondering, the 2>&1 is needed because time outputs to standard error, so a normal pipe won't work. In Bash version 4 and later you can write |& instead. Running the command on my machine gives the output:

It seems we were on the right track. Memory consumption doesn't noticeably grow. Time grows instead even more quickly than exponentially in the number of digits. If we play around a bit more, it seems that time approximately goes like a polynomial in k, and certainly asymptotically more slowly than a double exponential. See this article for a detailed discussion of the efficiency of regex matching implementations.


Was this just math? Just science? Just engineering? The sensible conclusion seems that it was a combination of all three of them: the theory of regular languages has mathematical flavor, their practical implementations in Unix tools is an engineering feat, and empirical analysis smells like experimental science. And that's the beauty of it!

Last week I attended the second Workshop on Software Correctness and Reliability organized by Peter Müller and Martin Vechev here at ETH. As in the first edition one year ago, the workshop featured an impressive collection of speakers who talked about diverse topics in verification. Videos of the talks will be publicly available, so there’s no need to write detailed summaries. Instead, I give my telegraphic impressions of each talk in the form of one-liners. If I used Twitter these could’ve been live tweets; except that I, like Machete, don’t tweet 🙂 (not at the moment anyway). As the name suggests, the impressions do not necessarily talk about the main intended message of each talk; but at least they’re short: if anything tickles your fancy go and watch the videos!

James Larus
Improving the software development culture is not the job of academics; their job is solving technical problems.
Martin Rinard
Just because it passes all tests, it doesn’t mean the program is correct.
David Basin
Security is a form of enforceable safety.
Rajeev Alur
Regular functions or: the power of automata with strings.
Armando Solar-Lezama
Synthesis to improve the code of programmers who use loops where they could use SQL.
Anders Møller
Static analysis to check the code of programmers who use jQuery where they could use loops.
Martin Vechev
Machine learning to guess the meaning of minified (or otherwise obscure) programs.
Peter Müller
Teenage mobile app developers benefit from abstract interpretation.
Vijay Saraswat
Failing asynchronously is OK, as long as someone cleans up.
Ranjit Jhala
Piggybacking type systems to prove properties of (post-)functional languages.
(Rustan Leino featured in absentia as the one who knows what to do with universal quantifiers — the “SMT whisperer”.)
Ahmed Bouajjani
Checking linearizability may be hard, but checking trace inclusion need not be.

Since I skipped the last session, I cannot offer my impressions of Mayur Naik‘s and Manfred Broy‘s talks. If you were there or have watched the videos, feel free to do so in the comments.

I’m very glad to host a guest post by Nadia Polikarpova, a recent PhD graduate of our group on her way to a post-doctoral position at MIT. Nadia is definitely into bug counting, even though her ultimate ambitions are more of exterminating them (for the casual reader: we’re talking about software bugs — no animals were harmed in the making of her research). Thanks to this guest post of hers, the blog finally goes back to its core interests!

      — Carlo A.

Dimensions in program verification

Model checking, abstract interpretation, program logics, symbolic execution — if you are new to program verification, the variety of techniques it offers might be intimidating. Each area seems to start from a different set of basic principles and use its own terminology, which makes classification and comparison difficult.

To make matters worse, verification experts are often partisans of their school of thought: for an example, see a slide from a recent talk by Patrick Cousot. Abstract interpretation devotees may look down on other approaches, such as auto-active verification, that do not easily scale to systems with million lines of code. Auto-active verification researches, like myself, may make fun of abstract interpretation folks for being stuck with array-out-of-bounds errors, while we can prove full functional correctness (which, in turn, is uninteresting to model-checking experts, who care about temporal properties). The acolytes of the dark art of interactive theorem proving (such as Coq and Isabelle) don’t take any other approach seriously unless it comes with an air-tight proof of soundness. Meanwhile testing people consider all sound techniques pure intellectual exercise, since real programmers are not going to put up with false positives…

How do you make sense of all this and learn to appreciate relative merits and fundamental limitations of different techniques? Here is my short guide to program verification, which I worked out after studying the area for nearly six years.

There are three dimensions to a program verification technique: soundness, flexibility, and automation. Soundness (level of assurance) determines how certain we are that a program is correct if verification succeeds. Flexibility (expressiveness) measures the diversity of programs and correctness properties that the approach can handle. Automation accounts for the amount of human effort — beyond writing the program — that is needed to carry out verification. As is often the case, you can pick any two, but you cannot have all three.

The verification cube

For example, abstract interpretation is sound and fully automatic, but the kind of properties and programs it can check effectively is limited by a predefined set of abstract domains encoded in the analyzer. Bounded model checking, on the other hand, is flexible and fairly automatic — users still have to formalize requirements — but in general cannot provide a correctness guarantee for all inputs (note that we can also trade flexibility for soundness by limiting ourselves to analyzing programs with a finite state space). Deductive verification techniques are flexible and sound, but require a fair amount of manual guidance; among deductive tools, interactive proof assistants are pushing the first two dimensions to the extreme (with higher-order specification languages and tiny trusted cores), while auto-active tools are prepared to sacrifice a bit of both to retain some automation. Unbounded (CEGAR-based) model checking is sound, automatic, and inflexible; symbolic execution is flexible, automatic, and unsound.

I’m not aware of any technique that doesn’t lie close to one of those three corners of the “verification cube”. Why is this the case? Intuitively, the reason is simple: any program analysis can only consider a finite number of states, so you get the choice between only analyzing programs with finitely many states (and lose flexibility), considering only a subset of all states (and lose soundness), or over-approximating an infinite set of states by a finite set of abstract states. Abstraction, in turn, can be performed manually (this is what loop invariants are for!) or automatically (like in abstract interpretation); the former kills automation, while the latter is limited to a restricted set of properties, which kills flexibility. Naturally, acing all three dimensions has always been the holy grail of program verification. This is where all the approaches are converging, albeit taking very different routes, with each technique improving along its weak dimension little by little. Static analyzers are enriched with new abstract domains; model checkers can handle ever larger state spaces; auto-active tools compete in economy of required user annotations; proof assistants boast more and more sophisticated automated tactics. They are all working towards the same goal — still far away from the finish line — and they still differ vastly in terms of strengths and weaknesses, application domains, and user experience.

So, next time you hear someone say that their program verification technique is better that all the rest, just remember the cube.

      — Nadia

This summer, Sebastian Nanz and I have finally figured out what the best programming language is. The answer is…

Of course you immediately understood that the incipit is a joke. When it comes to complex feature-laden artifacts like general-purpose programming languages there is no such thing as the best tool for the job. In the reality of software development, different programmers with different skills, different attitudes, and different mindsets solve different problems following different methods, different practices, and different processes in different environments, using different tools and different programming languages. As a result, each programming language design strives to find trade-offs that are convenient to someone in the motley crowd of programmers.

Still, the trade-offs of different languages should be demonstrable by impacting measurable features of programs written in those languages. In this recent work [Nanz and Furia, 2014], we have tried to contribute empirical evidence to better our understanding of how programming languages can be used in practice. One of the aspects we were interested in investigating was whether one can find empirical evidence to justify some of the folk knowledge about programming languages, which is very often passed on as a series of ipse dixit that should be self-evident — except that sometimes different authorities have dissenting opinions!

Before summarizing the results of our study, here’s something about the methodology. An important decision was to use Rosetta Code as source of raw data in the form of programs. Rather than hosting full projects — a service provided by other sites such as GitHub and Bitbucket — Rosetta Code focuses on well-defined programming tasks that can be implemented by small programs (the average program in our study is around 30 lines of code); this makes implementations of the same task in different programming languages directly comparable. The revision history and submission practices of Rosetta Code also suggest that programs are often revised by multiple programmers, and hence likely have a good quality on average; and the task list includes many relevant problems that are often part of large real-world projects. This setup helped make sound inter-language comparisons based on proper language usage, thus reducing dispersion and bias in the data. Based on a combination of their popularity in TIOBE and Rosetta Code, we selected 8 languages in four categories: C and Go as procedural languages; C# and Java as object-oriented languages; F# and Haskell as functional languages; and Python and Ruby as object-oriented languages. If your favorite language is not there do not despair: let us know in the comments why you think it deserves to be included; we might consider it for future work.

Let’s have a look at the main results (see the paper for all the details). The biggest surprise is that there are no huge surprises: well-trained programmers and software engineers will recognize several well-known adages about the advantages of certain programming language features over others. To make this apparent, I’ve collected excerpts from classics texts on programming languages that somewhat match our empirical findings.


It is generally understood that practical expressiveness boils down to conciseness:

The main benefit of the use of expressive languages seems to be the ability to abstract from programming patterns with simple statements and to state the purpose of a program in the concisest possible manner.

We have come to believe that the major negative consequence of a lack of expressiveness is the abundance of programming patterns to make up for the missing, non-expressible constructs.

[Felleisen, 1991]

Higher-order features such as list comprehensions, reflection, higher-order functions, and idiomatic support for lists and maps should increase the level of practical expressiveness, and hence conciseness:

Higher-order procedures can serve as powerful abstraction mechanisms, vastly increasing the expressive power of our language. [Pg. 75]

[…] expressive power […] is attained […] by accumulation and filtering [on lists]. [Pg. 81]

Elevate the conceptual level at which we can design our programs [means enhancing] the expressive power of our language. [Pg. 108]

[Abelson and Sussman, 1996]

Such higher-order features are more readily available in functional and scripting languages than imperative languages:

Against Java, we can say that (compared to, say, Python) some parts of it appear over-complex and others deficient.

[Pg. 340 in Raymond, 2003]

We measured conciseness in terms of lines of code, comparing solutions in each language against those in other languages. Our numbers draw a picture that is largely consistent with the above quotations: functional and scripting languages provide significantly more concise code than procedural and object-oriented languages. Their higher-order features increase practical expressiveness, to wit, conciseness. While in principle one can follow a functional style using object-oriented languages, it is idiomatic support that seems to make a tangible difference in practice.


Performance is another often controversial programming language feature. We tried to contribute to the understanding of performance in practice by distinguishing between two kinds of tests. Much of the controversy when discussing performance may derive from conflating these two kinds of problems, which represent very different conditions.

The first kind of performance comparison targets raw execution speed on large inputs; for example, sorting million-element arrays or compressing tens of megabytes of data. The outcome of our experiments using Rosetta Code tasks on such problems is what most people would expect: C is indisputably the fastest — if it was a race, it’d lap all other languages. A bit more generally, language features cost raw speed, and more features tend to cost more speed. In fact, the only runner-up (still from a distance) is Go, a language that is richer than C — it offers automatic memory management and strong typing — but deliberately renounces other expressive features, such as inheritance and genericity, that have become commonplace in modern high-level programming languages.

Programs that require maximum speed […] are good candidates for C. [Pg. 326]

Python cannot compete with C or C++ on raw execution speed. [Pg. 337]

[Raymond, 2003]

[The] main problem [of automatic memory management], however, is that “useful” processing time is lost when the garbage collector is invoked.

[Pg. 168 in Ghezzi and Jazayeri, 1997]

Most of the time, however, the extreme differences in raw speed that emerge with algorithmically-intensive programs on large inputs do not matter much because such jobs are absent or extremely infrequent in the overwhelming majority of applications, which hardly ever have to deal with number crunching. How many million-element arrays did your web browser have to sort while you were browsing the news? To understand performance differences more commonly occurring in everyday conditions, we identified a second kind of targets for comparison, consisting of well-defined problems on input of moderate size, such as checksum algorithms and string manipulation tasks. The results are quite different when we consider this second kind of everyday problems. Scripting and functional languages tend to emerge as the fastest, even surpassing C. More generally, the absolute differences between languages are smallish, which means that every language is usable, and engineering concerns other than raw speed emerge as more relevant.

Most applications do not actually need better performance than Python offers.

[Pg. 337 in Raymond, 2003]

To sum up, the most significant, and somehow neglected, finding that surfaced from our performance comparisons is this distinction between “raw speed” and “everyday” performance requirements, and the corresponding emphasis on the latter for most workaday programming tasks.

Failure proneness

Counting faults (or, more appropriately for this blog, bugs) is often used to measure the quality or success of software projects. The errors that are of principal interest in that context are those resulting from program behavior that diverges from specification; for example, a banking application that increases your balance when you withdraw money (although this is a bug most of us could live with 🙂 ). In contrast, our comparison of programming languages looked for errors that are independent of a program’s specification and have to do with what we might characterize as well-formedness, such as typing and compatibility errors. This is an admittedly restricted notion of error, but it lends itself to effective detection and analysis. The classic view on the problem of detecting such errors is clear:

[Checking for the presence of well-formedness errors] can be accomplished in different ways that can be classified in two broad categories: static and dynamic. […] In general, if a check can be performed statically, it is preferable to do so instead of delaying the check to run time […].

[Pg. 137 in Ghezzi and Jazayeri, 1997]

To see which languages follow this prescription and tend to perform more checks statically, we counted what fraction of programs in each language that compile correctly terminate without error (exit status zero). The compiled strongly-typed languages (that is, all compiled languages but C which is weakly typed) clearly emerged as those with the fewest runtime failures triggered in our experiments; their compilers do a fairly good job at catching errors at compile time by type checking and other static analyses. In contrast, the interpreted languages triggered runtime failures more frequently; nearly all checks but syntactic ones are done at runtime, when things can go wrong in many different ways.

Go was the least failure prone of the compiled strongly-typed languages. Given that we analyzed a large number of programs written by different contributors, we are reluctant to explain this difference mainly by attributing better programming skills to the authors of Go programs in our sample. Instead, this results refines the picture about what compiled strongly-typed languages can achieve: Go’s type system is more restricted than that of functional or object-oriented languages, which may help achieve type safety by minimizing dark corners where errors may originate.

Our results on failure proneness cannot set the debate about static vs. dynamic checks. This is another issue where a multitude of technical and organizational concern concur to creating several different local optima.

Be fruitful and replicate

No single study, no matter how carefully designed and executed, can claim to provide conclusive evidence about complex issues such as the relative merits and practical impact of programming language features. Scientific inductive knowledge grows slowly, one little piece of evidence at a time. Technological factors further complicate the picture, since they may shift the importance of certain features and render some obsolete (for example, the exponential growth of processor speed in past decades has favored the development of feature-rich graphical user interfaces that were previously impractical). Yet, it is encouraging that we have found significant concordance between our empirical results and others’ point of view. A lot of what you learned from the classics about programming language was right — as long as you picked the right classics!


  1. Sebastian Nanz and Carlo A. Furia: A comparative study of programming languages in Rosetta Code. Technical report, September 2014.
  2. Matthias Felleisen: On the expressive power of programming languages. Science of Computer Programming, 17(1-3):35-75, 1991.
  3. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and interpretation of computer programs. 2nd edition, MIT Press, 1996.
  4. Eric S. Raymond: The art of UNIX programming. Addison-Wesley, 2003. Available online.
  5. Carlo Ghezzi and Mehdi Jazayeri: Programming language concepts. 3rd edition, Wiley & Sons, 1997.