It’s really a combination of all three. More questions? 🙂

Seriously, the issue of the real “nature” of informatics (I don’t call it “computer science” only to avoid adding to the terminological confusion) — applied mathematics, science of information, or engineering discipline — remains a topic of intense debate in more than one circle. A related discussion is the relationship between informatics theoreticians and practitioners, and whether they can get along (yes they can and do, as Scott Aaronson patiently explains).

I must admit that discussions about these topics leave me mostly indifferent, if not mildly baffled. What we call “informatics” or “computer science” has grown from roots in different branches of science and engineering, so that its true “nature” is not in any single one of them but in their amalgam. This should be evident by looking at the history of modern informatics and its eminent pioneers. Its eclectic origins are a tremendous source of vitality and part of the reason why computer science and its applications have become ubiquitous in pretty much any other science to provide conceptual and practical analytical tools. And this is also a great deal of what makes computer science so interesting!

But perhaps I can make my position more convincing with an example. In the rest of this post, I select a computational problem and follow the train of thought that leads to a possible solution and implementation. My point is demonstrating how solving the problem satisfactorily requires a combination of theoretical and practical tools whose intimate connection cannot be severed without ruining their efficacy.

Take the problem of enumerating prime numbers: a number is prime if it is only divisible by itself and by one; a number that is not prime is called composite. Prompted by this blog’s motto (“A little specification can go a long way”), let’s approach the problem descriptively: we provide a simple characterization of prime numbers, and then use standard tools to match the characterization against any integer that we want to test for primality. The characterization is going to use regular expressions.

Primes is not regular

Wait a minute! Is this even possible? This was my reaction too the first time I read (I can’t remember where or when, but there are several articles exploring this idea) about using regular expression to characterize primality. The notion of prime numbers involves divisibility and integer arithmetic, which are beyond the capabilities of regular expressions or their operational counterparts (finite-state automata), whose expressiveness defines the so-called regular languages.

In order to analyze the problem rigorously, we have to cast it into the framework of formal languages. Using a simple unary alphabet of letters x, the language P of prime numbers can be defined as the set of all strings of x’s whose length is a prime number:

P = \{ x^p \mid p \text{ is prime} \}.

Let’s show that the language P of prime numbers is not regular. We could use the well-known pumping lemma, but a direct proof by contradiction is even more convincing. Assume that P is regular; then there exists a deterministic finite state automaton A_P that accepts precisely strings in P. Automaton A_P consists of a finite number N of states; take any prime p > N. (Side note: this is possible because there are infinitely many primes — if you’re not familiar with Euclid’s neat proof of this fact go check it out.) Consider the accepting path of x^p when it is processed by A_P: each transition consumes exactly one x; since there are more x’s in input than states in A_P, the path must include a loop that goes from some state S back to it. Call \ell the length of such loop, that is the number of x’s consumed while traversing it. We can see that adding a number of x’s multiple of \ell leads to an input that A_P still accepts. This is because \ell more x’s traverse the loop one more time, and the behavior of the automaton from S on does not change if the trailing input is the same because A_P is deterministic. Going back to the definition of P, we have that A_P accepts all strings of length p + k \cdot \ell for any nonnegative integer k. In particular, it accepts the string of p + p \cdot \ell x’s. But this is incorrect behavior: p + p \cdot \ell is not prime because it is divisible by p. This contradicts the hypothesis that A_P recognizes language P — the language of primes is not regular.

While we’re at it, let’s also prove that the language P of primes is not even context free. We can leverage Parikh’s theorem for that. A convenient way to state the theorem is: every context-free language is commutatively equivalent to some regular language. This roughly means that, if P is context-free, there exists a regular language P_r whose strings have the same number of letters as those in P irrespective of order. But order is immaterial for a language over unary alphabet such as P — length is the only distinguishing feature — which entails that P is context-free only if it is also regular. Since we have seen that P is not regular, it’s not even context-free.

Primes as a regex

The trick to circumvent the non-regularity of the language P of primes is switching from regular expressions — à la Kleene, used in formal language theory — to regexes — à la Thompson, used in classic Unix text processing utilities and part of the POSIX standard. Perl’s Larry Wall championed this useful terminological distinction (regular expressions vs. regexes), which we’ll consistently use in this presentation. Crucially, the regex feature that greatly increases the expressive power of regular expressions are backreferences, which can express repetitions of previously captured substrings. In fact, it’s not hard to see that all other features of POSIX regexes are just syntactic sugar expressible using Kleene’s regular expressions.

Back to the problem of expressing the language P of primes using regexes and backreferences. It’s actually easier to look at the complement problem: defining the language \overline{P} of composite numbers expressed as unary strings. A composite number c has at least one divisor (other than one and itself); that is, it can be written as a product c_1 \cdot c_2 for 1 < c_1, c_2 < c[/latex] or, equivalently, as a sum of [latex]c_2[/latex] equal terms [latex]\underbrace{c_1 + \cdots + c_1}_{c_2 \text{ terms}}[/latex]. Visualizing [latex]c[/latex] in unary, we can write it as partitioned into [latex]c_2[/latex] groups, each consisting of [latex]c_1[/latex] letters [latex]x[/latex]. This is a pattern we can express using a regex: a leading substring of two or more (that is, [latex]c_1[/latex]) [latex]x[/latex]’s (<code>xx+</code>, with <code>+</code> denoting one or more repetitions) followed by one or more (that is, [latex]c_2 - 1) of its replicas (\1+, with \1 denoting a backreference to the first matched substring). In all, regex (xx+)\1+ matches the composite numbers greater than one in unary notation. If we want to handle all natural numbers, we can use alternation (the | operator) to specify the special cases of 0 and 1 as composite: (xx+)\1+ | x? (with ? denoting zero or one occurrences, and assuming | has lower precedence than the other operators). We'd better do that as no mathematician since Lebsgue considers 1 a prime according to Wikipedia 🙂

The power of regexes

What is then the expressive power of regexes? Our primes example indicates that backreferences can express languages in classes beyond context-free. Precisely, we've shown that language P is not context-free and that regexes can express the complement language \overline{P}, but regexes are not explicitly closed under complement. However, it's clear that \overline{P} is also neither regular (because regular languages are closed under complement) nor context-free (because unary context-free languages boil down to regular languages). Alternatively, we can consider languages whose strings have the form w\,w for any string w, which are clearly expressible by means of regexes with backreferences ((.*)\1) but are neither regular nor context free (as one can show using, say, the pumping lemmas).

On the other hand, regexes seems incapable of expressing all context-free languages; in particular, they cannot express the languages of balanced characters such as parentheses — known as Dyck languages in formal language theory —, which are paradigmatic examples of context-free languages. Consider D = \{ a^n b^n \mid n > 0 \}, which is well-known to be context-free but not regular. Short of a rigorous proof, an argument that D is inexpressible by regexes could work as follows. As we match a regex against strings in D, each pair of captured pattern and backreference to it must occur either entirely within the a^n substring or entirely within the b^n substring. Otherwise, the backreference would deviate from the structure of strings in D where all b’s follow all a’s. This constraint, however, means that backreferences cannot carry any information between the a^n and the b^n parts of the string. All that is left is then limited to regular languages; hence it cannot express language D.

I have to remark again that this discussion applies to classic Unix/POSIX regexes, whose only non-regular operators are backreferences. Modern scripting languages often support a much larger superset of "Perl Compatible Regular Expressions" (PCREs) whose expressiveness far exceeds that of Unix/POSIX regexes as they include fully recursive patterns and even the possibility of invoking arbitrary code as subpatterns are matched. See this article for a nice overview of the expressiveness of PCREs as they are available in PHP; and this Stack Overflow discussion about how Java regexes can match patterns such a^n b^n.

Enumerating primes in Bash

After this long detour it's time to go back to the original problem of enumerating primes. Equipped with a descriptive characterization of composite numbers — the regex (xx+)\1+ | x? — let's turn ourselves to Unix command line tools that can interpret it. As an additional challenge in the original spirit of solving the problem descriptively, we'll try to avoid explicit loops entirely in our program of sorts for enumerating primes.

First, since our regex works on unary strings, we need a way to convert numbers from decimal to unary. We can use printf's format strings to this end: the format specifier %*s reads a numeric argument followed by a string argument and prints the latter as a space-padded string whose length is the numeric argument. If the string argument is absent, it just prints as many spaces as the value of the numeric argument. Then, we use tr to turn spaces into x's and voilà: our unary representation. For example, to print 24 in unary:

There's an alternative that uses printf's format specifier %.0s without requiring tr: printf 'x%.0s' {1..24}. See printf's documentation for an explanation of how it works.

Second, we have to call the above snippet involving printf for all numbers that we want to test. Instead of an explicit loop, we can generate a sequence {0..N} of nonnegative integers up to N and pass its values on to printf using xargs's option -n1, which dispatches one argument value per invocation of printf. However, printf's format string now needs a trailing newspace character to separate unary numbers by line.

It's finally time to use our regex. There's a minor issue here regarding the distinction between POSIX basic and extended regexes. The former include backreferences but no alternation; the latter include alternation but no backreferences. This is hardly a problem in practice, as most GNU tools (the one I'm assuming we're using) support backreferences using both syntaxes. As for the alternation, we don't really need to include it explicitly in the regex; instead, we use grep's option -e to specify multiple regexes and have the tool match any of them. We need to know about grep's options -E — to use extended regex syntax (just because it's nicer as it doesn't require to slash-escape metacharacters) — and -v to invert the match (because we want to keep the primes and discard the composites).

Don't try the previous snippet yet: it doesn't work! The problem is that grep matches the regexes against any portion of the input string. Since x? trivially matches the empty substring, the inverted match is the empty match. Let's introduce the ^ and $ markers to require that the regexes match the whole substring (which, by the way, is the standard interpretation of regular expressions in formal language theory). You can try and see that the following snippet enumerates primes up to 24 in unary.

The output in unary form is not very readable. We convert it back (without loops) using the expr utility, again with arguments dispatched one at a time using xargs. This is the only non-POSIX compliant part of the snippet. I couldn't find another way that doesn't use loops or variables; if you know one, please leave a comment. Anyway, we have our loopless prime enumerator!

A little empirical performance analysis

We have reasons to suspect that our prime enumerator is not very efficient. Enumerating primes up to 41 takes over 14 seconds on my desktop machine. This is not very good given that primality testing is in P and randomized algorithms such as Miller-Rabin are very fast in practice. It is also unsurprising: regex matching with backreferences uses backtracking to implement nondeterminism, which nullifies the otherwise linear-time performance of matching using finite-state automata. For the same reasons, memory usage should instead be roughly constant or grow very slowly: the size of the automaton used for the matching only depends on the regex to be matched and, linearly, on the length of the captured substring that is backreferenced.

Let's see if we can get some empirical data to corroborate this intuition. If the running time T(n) of our enumerator is exponential in the input, the ratio T(n) / 10^n should remain constant as the input size n increases. Input normally is measured in number of bits (or digits) for numerical algorithms; hence let n = \log(k), where k is the number we test for primality. Let's measure T(k) / k for each input k and see what it looks like. We allow ourselves to use loops this time, so that we can simplify the enumeration snippet using a for Bash loop.

We enumerate from 30 because this is when grep's running time starts to become non-negligible. It's time to use another command-line utility. Not to be confused with the homonym Bash command, utility time is normally located in /usr/bin/time and offers options to customize output using a format string. For each invocation of grep, we print the current value of k, and the CPU user time (%U) and maximum RAM resident set size (%M) taken by the process running grep. Since we now only care about these measures, we add the -q option to grep, which suppresses output. Finally, we redirect these statistics to awk, which will print on each line a triple k, T(k)/k, M(k) (T and M are time and memory). In all, run the command:

In case you're wondering, the 2>&1 is needed because time outputs to standard error, so a normal pipe won't work. In Bash version 4 and later you can write |& instead. Running the command on my machine gives the output:

It seems we were on the right track. Memory consumption doesn't noticeably grow. Time grows instead even more quickly than exponentially in the number of digits. If we play around a bit more, it seems that time approximately goes like a polynomial in k, and certainly asymptotically more slowly than a double exponential. See this article for a detailed discussion of the efficiency of regex matching implementations.

Conclusion

Was this just math? Just science? Just engineering? The sensible conclusion seems that it was a combination of all three of them: the theory of regular languages has mathematical flavor, their practical implementations in Unix tools is an engineering feat, and empirical analysis smells like experimental science. And that's the beauty of it!

Last week I attended the second Workshop on Software Correctness and Reliability organized by Peter Müller and Martin Vechev here at ETH. As in the first edition one year ago, the workshop featured an impressive collection of speakers who talked about diverse topics in verification. Videos of the talks will be publicly available, so there’s no need to write detailed summaries. Instead, I give my telegraphic impressions of each talk in the form of one-liners. If I used Twitter these could’ve been live tweets; except that I, like Machete, don’t tweet 🙂 (not at the moment anyway). As the name suggests, the impressions do not necessarily talk about the main intended message of each talk; but at least they’re short: if anything tickles your fancy go and watch the videos!

James Larus
Improving the software development culture is not the job of academics; their job is solving technical problems.
Martin Rinard
Just because it passes all tests, it doesn’t mean the program is correct.
David Basin
Security is a form of enforceable safety.
Rajeev Alur
Regular functions or: the power of automata with strings.
Armando Solar-Lezama
Synthesis to improve the code of programmers who use loops where they could use SQL.
Anders Møller
Static analysis to check the code of programmers who use jQuery where they could use loops.
Martin Vechev
Machine learning to guess the meaning of minified (or otherwise obscure) programs.
Peter Müller
Teenage mobile app developers benefit from abstract interpretation.
Vijay Saraswat
Failing asynchronously is OK, as long as someone cleans up.
Ranjit Jhala
Piggybacking type systems to prove properties of (post-)functional languages.
(Rustan Leino featured in absentia as the one who knows what to do with universal quantifiers — the “SMT whisperer”.)
Ahmed Bouajjani
Checking linearizability may be hard, but checking trace inclusion need not be.

Since I skipped the last session, I cannot offer my impressions of Mayur Naik‘s and Manfred Broy‘s talks. If you were there or have watched the videos, feel free to do so in the comments.

I’m very glad to host a guest post by Nadia Polikarpova, a recent PhD graduate of our group on her way to a post-doctoral position at MIT. Nadia is definitely into bug counting, even though her ultimate ambitions are more of exterminating them (for the casual reader: we’re talking about software bugs — no animals were harmed in the making of her research). Thanks to this guest post of hers, the blog finally goes back to its core interests!

      — Carlo A.

Dimensions in program verification

Model checking, abstract interpretation, program logics, symbolic execution — if you are new to program verification, the variety of techniques it offers might be intimidating. Each area seems to start from a different set of basic principles and use its own terminology, which makes classification and comparison difficult.

To make matters worse, verification experts are often partisans of their school of thought: for an example, see a slide from a recent talk by Patrick Cousot. Abstract interpretation devotees may look down on other approaches, such as auto-active verification, that do not easily scale to systems with million lines of code. Auto-active verification researches, like myself, may make fun of abstract interpretation folks for being stuck with array-out-of-bounds errors, while we can prove full functional correctness (which, in turn, is uninteresting to model-checking experts, who care about temporal properties). The acolytes of the dark art of interactive theorem proving (such as Coq and Isabelle) don’t take any other approach seriously unless it comes with an air-tight proof of soundness. Meanwhile testing people consider all sound techniques pure intellectual exercise, since real programmers are not going to put up with false positives…

How do you make sense of all this and learn to appreciate relative merits and fundamental limitations of different techniques? Here is my short guide to program verification, which I worked out after studying the area for nearly six years.

There are three dimensions to a program verification technique: soundness, flexibility, and automation. Soundness (level of assurance) determines how certain we are that a program is correct if verification succeeds. Flexibility (expressiveness) measures the diversity of programs and correctness properties that the approach can handle. Automation accounts for the amount of human effort — beyond writing the program — that is needed to carry out verification. As is often the case, you can pick any two, but you cannot have all three.

The verification cube

For example, abstract interpretation is sound and fully automatic, but the kind of properties and programs it can check effectively is limited by a predefined set of abstract domains encoded in the analyzer. Bounded model checking, on the other hand, is flexible and fairly automatic — users still have to formalize requirements — but in general cannot provide a correctness guarantee for all inputs (note that we can also trade flexibility for soundness by limiting ourselves to analyzing programs with a finite state space). Deductive verification techniques are flexible and sound, but require a fair amount of manual guidance; among deductive tools, interactive proof assistants are pushing the first two dimensions to the extreme (with higher-order specification languages and tiny trusted cores), while auto-active tools are prepared to sacrifice a bit of both to retain some automation. Unbounded (CEGAR-based) model checking is sound, automatic, and inflexible; symbolic execution is flexible, automatic, and unsound.

I’m not aware of any technique that doesn’t lie close to one of those three corners of the “verification cube”. Why is this the case? Intuitively, the reason is simple: any program analysis can only consider a finite number of states, so you get the choice between only analyzing programs with finitely many states (and lose flexibility), considering only a subset of all states (and lose soundness), or over-approximating an infinite set of states by a finite set of abstract states. Abstraction, in turn, can be performed manually (this is what loop invariants are for!) or automatically (like in abstract interpretation); the former kills automation, while the latter is limited to a restricted set of properties, which kills flexibility. Naturally, acing all three dimensions has always been the holy grail of program verification. This is where all the approaches are converging, albeit taking very different routes, with each technique improving along its weak dimension little by little. Static analyzers are enriched with new abstract domains; model checkers can handle ever larger state spaces; auto-active tools compete in economy of required user annotations; proof assistants boast more and more sophisticated automated tactics. They are all working towards the same goal — still far away from the finish line — and they still differ vastly in terms of strengths and weaknesses, application domains, and user experience.

So, next time you hear someone say that their program verification technique is better that all the rest, just remember the cube.

      — Nadia

This summer, Sebastian Nanz and I have finally figured out what the best programming language is. The answer is…

Of course you immediately understood that the incipit is a joke. When it comes to complex feature-laden artifacts like general-purpose programming languages there is no such thing as the best tool for the job. In the reality of software development, different programmers with different skills, different attitudes, and different mindsets solve different problems following different methods, different practices, and different processes in different environments, using different tools and different programming languages. As a result, each programming language design strives to find trade-offs that are convenient to someone in the motley crowd of programmers.

Still, the trade-offs of different languages should be demonstrable by impacting measurable features of programs written in those languages. In this recent work [Nanz and Furia, 2014], we have tried to contribute empirical evidence to better our understanding of how programming languages can be used in practice. One of the aspects we were interested in investigating was whether one can find empirical evidence to justify some of the folk knowledge about programming languages, which is very often passed on as a series of ipse dixit that should be self-evident — except that sometimes different authorities have dissenting opinions!

Before summarizing the results of our study, here’s something about the methodology. An important decision was to use Rosetta Code as source of raw data in the form of programs. Rather than hosting full projects — a service provided by other sites such as GitHub and Bitbucket — Rosetta Code focuses on well-defined programming tasks that can be implemented by small programs (the average program in our study is around 30 lines of code); this makes implementations of the same task in different programming languages directly comparable. The revision history and submission practices of Rosetta Code also suggest that programs are often revised by multiple programmers, and hence likely have a good quality on average; and the task list includes many relevant problems that are often part of large real-world projects. This setup helped make sound inter-language comparisons based on proper language usage, thus reducing dispersion and bias in the data. Based on a combination of their popularity in TIOBE and Rosetta Code, we selected 8 languages in four categories: C and Go as procedural languages; C# and Java as object-oriented languages; F# and Haskell as functional languages; and Python and Ruby as object-oriented languages. If your favorite language is not there do not despair: let us know in the comments why you think it deserves to be included; we might consider it for future work.

Let’s have a look at the main results (see the paper for all the details). The biggest surprise is that there are no huge surprises: well-trained programmers and software engineers will recognize several well-known adages about the advantages of certain programming language features over others. To make this apparent, I’ve collected excerpts from classics texts on programming languages that somewhat match our empirical findings.

Conciseness

It is generally understood that practical expressiveness boils down to conciseness:

The main benefit of the use of expressive languages seems to be the ability to abstract from programming patterns with simple statements and to state the purpose of a program in the concisest possible manner.

We have come to believe that the major negative consequence of a lack of expressiveness is the abundance of programming patterns to make up for the missing, non-expressible constructs.

[Felleisen, 1991]

Higher-order features such as list comprehensions, reflection, higher-order functions, and idiomatic support for lists and maps should increase the level of practical expressiveness, and hence conciseness:

Higher-order procedures can serve as powerful abstraction mechanisms, vastly increasing the expressive power of our language. [Pg. 75]

[…] expressive power […] is attained […] by accumulation and filtering [on lists]. [Pg. 81]

Elevate the conceptual level at which we can design our programs [means enhancing] the expressive power of our language. [Pg. 108]

[Abelson and Sussman, 1996]

Such higher-order features are more readily available in functional and scripting languages than imperative languages:

Against Java, we can say that (compared to, say, Python) some parts of it appear over-complex and others deficient.

[Pg. 340 in Raymond, 2003]

We measured conciseness in terms of lines of code, comparing solutions in each language against those in other languages. Our numbers draw a picture that is largely consistent with the above quotations: functional and scripting languages provide significantly more concise code than procedural and object-oriented languages. Their higher-order features increase practical expressiveness, to wit, conciseness. While in principle one can follow a functional style using object-oriented languages, it is idiomatic support that seems to make a tangible difference in practice.

Performance

Performance is another often controversial programming language feature. We tried to contribute to the understanding of performance in practice by distinguishing between two kinds of tests. Much of the controversy when discussing performance may derive from conflating these two kinds of problems, which represent very different conditions.

The first kind of performance comparison targets raw execution speed on large inputs; for example, sorting million-element arrays or compressing tens of megabytes of data. The outcome of our experiments using Rosetta Code tasks on such problems is what most people would expect: C is indisputably the fastest — if it was a race, it’d lap all other languages. A bit more generally, language features cost raw speed, and more features tend to cost more speed. In fact, the only runner-up (still from a distance) is Go, a language that is richer than C — it offers automatic memory management and strong typing — but deliberately renounces other expressive features, such as inheritance and genericity, that have become commonplace in modern high-level programming languages.

Programs that require maximum speed […] are good candidates for C. [Pg. 326]

Python cannot compete with C or C++ on raw execution speed. [Pg. 337]

[Raymond, 2003]

[The] main problem [of automatic memory management], however, is that “useful” processing time is lost when the garbage collector is invoked.

[Pg. 168 in Ghezzi and Jazayeri, 1997]

Most of the time, however, the extreme differences in raw speed that emerge with algorithmically-intensive programs on large inputs do not matter much because such jobs are absent or extremely infrequent in the overwhelming majority of applications, which hardly ever have to deal with number crunching. How many million-element arrays did your web browser have to sort while you were browsing the news? To understand performance differences more commonly occurring in everyday conditions, we identified a second kind of targets for comparison, consisting of well-defined problems on input of moderate size, such as checksum algorithms and string manipulation tasks. The results are quite different when we consider this second kind of everyday problems. Scripting and functional languages tend to emerge as the fastest, even surpassing C. More generally, the absolute differences between languages are smallish, which means that every language is usable, and engineering concerns other than raw speed emerge as more relevant.

Most applications do not actually need better performance than Python offers.

[Pg. 337 in Raymond, 2003]

To sum up, the most significant, and somehow neglected, finding that surfaced from our performance comparisons is this distinction between “raw speed” and “everyday” performance requirements, and the corresponding emphasis on the latter for most workaday programming tasks.

Failure proneness

Counting faults (or, more appropriately for this blog, bugs) is often used to measure the quality or success of software projects. The errors that are of principal interest in that context are those resulting from program behavior that diverges from specification; for example, a banking application that increases your balance when you withdraw money (although this is a bug most of us could live with 🙂 ). In contrast, our comparison of programming languages looked for errors that are independent of a program’s specification and have to do with what we might characterize as well-formedness, such as typing and compatibility errors. This is an admittedly restricted notion of error, but it lends itself to effective detection and analysis. The classic view on the problem of detecting such errors is clear:

[Checking for the presence of well-formedness errors] can be accomplished in different ways that can be classified in two broad categories: static and dynamic. […] In general, if a check can be performed statically, it is preferable to do so instead of delaying the check to run time […].

[Pg. 137 in Ghezzi and Jazayeri, 1997]

To see which languages follow this prescription and tend to perform more checks statically, we counted what fraction of programs in each language that compile correctly terminate without error (exit status zero). The compiled strongly-typed languages (that is, all compiled languages but C which is weakly typed) clearly emerged as those with the fewest runtime failures triggered in our experiments; their compilers do a fairly good job at catching errors at compile time by type checking and other static analyses. In contrast, the interpreted languages triggered runtime failures more frequently; nearly all checks but syntactic ones are done at runtime, when things can go wrong in many different ways.

Go was the least failure prone of the compiled strongly-typed languages. Given that we analyzed a large number of programs written by different contributors, we are reluctant to explain this difference mainly by attributing better programming skills to the authors of Go programs in our sample. Instead, this results refines the picture about what compiled strongly-typed languages can achieve: Go’s type system is more restricted than that of functional or object-oriented languages, which may help achieve type safety by minimizing dark corners where errors may originate.

Our results on failure proneness cannot set the debate about static vs. dynamic checks. This is another issue where a multitude of technical and organizational concern concur to creating several different local optima.

Be fruitful and replicate

No single study, no matter how carefully designed and executed, can claim to provide conclusive evidence about complex issues such as the relative merits and practical impact of programming language features. Scientific inductive knowledge grows slowly, one little piece of evidence at a time. Technological factors further complicate the picture, since they may shift the importance of certain features and render some obsolete (for example, the exponential growth of processor speed in past decades has favored the development of feature-rich graphical user interfaces that were previously impractical). Yet, it is encouraging that we have found significant concordance between our empirical results and others’ point of view. A lot of what you learned from the classics about programming language was right — as long as you picked the right classics!

References

  1. Sebastian Nanz and Carlo A. Furia: A comparative study of programming languages in Rosetta Code. Technical report arXiv.org:1409.0252, September 2014.
  2. Matthias Felleisen: On the expressive power of programming languages. Science of Computer Programming, 17(1-3):35-75, 1991.
  3. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and interpretation of computer programs. 2nd edition, MIT Press, 1996.
  4. Eric S. Raymond: The art of UNIX programming. Addison-Wesley, 2003. Available online.
  5. Carlo Ghezzi and Mehdi Jazayeri: Programming language concepts. 3rd edition, Wiley & Sons, 1997.

How do you determine, empirically, if a theory is correct?

A standard approach uses the toolset of statistical hypothesis testing, whose modern formulation was introduced by influential figures such as Fisher and Pearson around a hundred years ago. The first step consists in expressing the absence of an effect, the one we’re investigating empirically, as a null hypothesis. Then, a mathematical toolset exists that assesses the likelihood that an observed sample (the empirical data) is drawn under the assumption that the null hypothesis holds. Likelihood is normally given as a p-value — the probability of the observation assuming the null hypothesis. A small p-value (say, p < 0.01[/latex]) means that the observation would be very unlikely to occur under the null hypothesis. This is compelling evidence <em>against</em> the hypothesis, which can be <em>rejected</em> with confidence: evidence suggests the effect is real.       To make the discussion concrete, let me introduce an example taken from my own research. <a href="http://se.inf.ethz.ch/people/nordio/">Martin Nordio</a> conceived the idea of assessing the impact of using <a href="http://en.wikipedia.org/wiki/Agile_software_development">agile</a> development processes (such as <a href="http://en.wikipedia.org/wiki/Scrum_%28software_development%29">Scrum</a>) on the success of software projects carried out by distributed teams. With <a href="http://se.inf.ethz.ch/people/estler/">Christian Estler</a>, he set up a broad data collection effort where they queried information about more than sixty software projects worldwide. The overall goal of the study was to collect empirical evidence that following agile processes, as opposed to more traditional development processes, <em>does make</em> a difference. To this end, the null hypothesis was roughly:   <blockquote>    There is <b>no</b> difference in [latex]X for projects developed following agile processes compared to projects developed using other processes.

We instantiated X for the several aspects of project outcome gathered in the data collection, including "overall success", team "motivation" and "amount of communication", and "importance" for customers. This gave us not one but six null hypotheses (extended to sixteen in the follow-up version of the work) to be rejected by the data.

Much to our dismay, when we sat down to analyze the data we found no evidence whatsoever to disprove any of the hypotheses. The p-values were in the range 0.25..0.97 — even greater than 0.75 for four of the hypotheses! Was the collected data any good?

More generally, what can one conclude from lack of evidence against the null hypothesis, to wit large p-values? In the orthodox Fisherian interpretation: nothing except that the available data is inconclusive. In Fisher's own words [Fisher, 1935, emphasis added]:

the null hypothesis is never proved or established [...] every experiment [...] exist[s] only in order to give the facts a chance of disproving the null hypothesis

Such an approach, were facts are assessed only by disproving hypotheses, resonates with Popper's notion of falsifiability, where only theories that can be conclusively rejected by experiments are considered scientific. The closeness of dates when Fisher's and Popper's works became popular is an inkling that the similarity may not be entirely coincidental.

And yet, the reality of science is a bit more nuanced than that. Regarding falsifiability, it is generally false (ahem 🙂 ) that empirical falsifiability is a sine qua non for a theory to be scientific. As Sean Carroll nicely explains, the most theoretical branches of physics routinely deal with models that are not testable (at least as of yet) but nonetheless fit squarely into the practice of science as structured interpretation of data. Science is as scientist does.

But even for theories that are well within the canon of testability and falsifiability, there are situations where the lack of evidence against a hypothesis is indicative of the hypothesis being valid. To find these, we should look beyond individual experiments, to the bigger picture of accumulating evidence. As experiments searching for an effect are improved and repeated on larger samples, if the effect is genuinely present it should manifest itself more strongly as the experimental methodology sharpens. In contrast, if significance keeps waning, it may well be a sign that the effect is absent. In such contexts, persistent lack of significance may be even more informative than its occasional presence. Feynman made similar observations about what he called "cargo cult science" [Feynman, 1985], where effects becoming smaller as experimental techniques improve is a sure sign of the effects being immaterial.

We drew conclusions along these lines in the study of agile processes. Ours was not the first empirical analysis of the effect of agile practices. Overall, the other studies provided mixed, sometimes conflicting, evidence in favor of agile processes. Our study stood out in terms of size of the samples (most other studies target only a handful of projects) and variety of tested hypotheses. In this context, the fact that the data completely failed to disprove any of the hypothesis was telling us more than just "try again". It suggested that agile processes have no definite absolute edge over traditional processes; they can, and often do, work very well, but so can traditional processes if applied competently.

These results, presented at ICGSE 2012 in Brazil and later in an extended article in Empirical Software Engineering, stirred some controversy — Martin Nordio's presentation, where he posited Maradona a better footballer than Pelé to an audience with a majority of Brazilian people, may have contributed some to the stirring. Enthusiasts of agile processes disputed our conclusions, reporting direct experiences with successful projects that followed agile processes. Our study should not, however, be interpreted as a dismissal of agile processes. Instead, it suggests that agile and other processes can be equally effective for distributed software development. But neither are likely to be silver bullets whose mere application, independent of the characteristics and skills of teams, guarantees success. Overall, the paper was generally well received and it went on to win the best paper award of ICGSE 2012.

There are two bottom lines to this post's story. First: ironically, statistics seems to suggest that Martin might have been wrong about Maradona. Second: when your p-values are too large don't throw them away just yet! They may well be indirectly contributing to sustaining scientific evidence.

References

  1. Ronald A. Fisher: The design of experiments, 1935. Most recent edition: 1971, Macmillan Pub Co.
  2. Richard Feynman, Ralph Leighton, contributor, Edward Hutchings, editor: Surely You're Joking, Mr. Feynman!: Adventures of a Curious Character, 1985. The chapter "Cargo cult science" is available online.