September « 2014 « Bug counting

I’m very glad to host a guest post by Nadia Polikarpova, a recent PhD graduate of our group on her way to a post-doctoral position at MIT. Nadia is definitely into bug counting, even though her ultimate ambitions are more of exterminating them (for the casual reader: we’re talking about software bugs — no animals were harmed in the making of her research). Thanks to this guest post of hers, the blog finally goes back to its core interests!

— Carlo A.

Dimensions in program verification

Model checking, abstract interpretation, program logics, symbolic execution — if you are new to program verification, the variety of techniques it offers might be intimidating. Each area seems to start from a different set of basic principles and use its own terminology, which makes classification and comparison difficult.

To make matters worse, verification experts are often partisans of their school of thought: for an example, see a slide from a recent talk by Patrick Cousot. Abstract interpretation devotees may look down on other approaches, such as auto-active verification, that do not easily scale to systems with million lines of code. Auto-active verification researches, like myself, may make fun of abstract interpretation folks for being stuck with array-out-of-bounds errors, while we can prove full functional correctness (which, in turn, is uninteresting to model-checking experts, who care about temporal properties). The acolytes of the dark art of interactive theorem proving (such as Coq and Isabelle) don’t take any other approach seriously unless it comes with an air-tight proof of soundness. Meanwhile testing people consider all sound techniques pure intellectual exercise, since real programmers are not going to put up with false positives…

How do you make sense of all this and learn to appreciate relative merits and fundamental limitations of different techniques? Here is my short guide to program verification, which I worked out after studying the area for nearly six years.

There are three dimensions to a program verification technique: soundness, flexibility, and automation. Soundness (level of assurance) determines how certain we are that a program is correct if verification succeeds. Flexibility (expressiveness) measures the diversity of programs and correctness properties that the approach can handle. Automation accounts for the amount of human effort — beyond writing the program — that is needed to carry out verification. As is often the case, you can pick any two, but you cannot have all three.

For example, abstract interpretation is sound and fully automatic, but the kind of properties and programs it can check effectively is limited by a predefined set of abstract domains encoded in the analyzer. Bounded model checking, on the other hand, is flexible and fairly automatic — users still have to formalize requirements — but in general cannot provide a correctness guarantee for all inputs (note that we can also trade flexibility for soundness by limiting ourselves to analyzing programs with a finite state space). Deductive verification techniques are flexible and sound, but require a fair amount of manual guidance; among deductive tools, interactive proof assistants are pushing the first two dimensions to the extreme (with higher-order specification languages and tiny trusted cores), while auto-active tools are prepared to sacrifice a bit of both to retain some automation. Unbounded (CEGAR-based) model checking is sound, automatic, and inflexible; symbolic execution is flexible, automatic, and unsound.

I’m not aware of any technique that doesn’t lie close to one of those three corners of the “verification cube”. Why is this the case? Intuitively, the reason is simple: any program analysis can only consider a finite number of states, so you get the choice between only analyzing programs with finitely many states (and lose flexibility), considering only a subset of all states (and lose soundness), or over-approximating an infinite set of states by a finite set of abstract states. Abstraction, in turn, can be performed manually (this is what loop invariants are for!) or automatically (like in abstract interpretation); the former kills automation, while the latter is limited to a restricted set of properties, which kills flexibility. Naturally, acing all three dimensions has always been the holy grail of program verification. This is where all the approaches are converging, albeit taking very different routes, with each technique improving along its weak dimension little by little. Static analyzers are enriched with new abstract domains; model checkers can handle ever larger state spaces; auto-active tools compete in economy of required user annotations; proof assistants boast more and more sophisticated automated tactics. They are all working towards the same goal — still far away from the finish line — and they still differ vastly in terms of strengths and weaknesses, application domains, and user experience.

So, next time you hear someone say that their program verification technique is better that all the rest, just remember the cube.

— Nadia

This summer, Sebastian Nanz and I have finally figured out what the best programming language is. The answer is…

Of course you immediately understood that the incipit is a joke. When it comes to complex feature-laden artifacts like general-purpose programming languages there is no such thing as the best tool for the job. In the reality of software development, different programmers with different skills, different attitudes, and different mindsets solve different problems following different methods, different practices, and different processes in different environments, using different tools and different programming languages. As a result, each programming language design strives to find trade-offs that are convenient to someone in the motley crowd of programmers.

Still, the trade-offs of different languages should be demonstrable by impacting measurable features of programs written in those languages. In this recent work [Nanz and Furia, 2014], we have tried to contribute empirical evidence to better our understanding of how programming languages can be used in practice. One of the aspects we were interested in investigating was whether one can find empirical evidence to justify some of the folk knowledge about programming languages, which is very often passed on as a series of ipse dixit that should be self-evident — except that sometimes different authorities have dissenting opinions!

Before summarizing the results of our study, here’s something about the methodology. An important decision was to use Rosetta Code as source of raw data in the form of programs. Rather than hosting full projects — a service provided by other sites such as GitHub and Bitbucket — Rosetta Code focuses on well-defined programming tasks that can be implemented by small programs (the average program in our study is around 30 lines of code); this makes implementations of the same task in different programming languages directly comparable. The revision history and submission practices of Rosetta Code also suggest that programs are often revised by multiple programmers, and hence likely have a good quality on average; and the task list includes many relevant problems that are often part of large real-world projects. This setup helped make sound inter-language comparisons based on proper language usage, thus reducing dispersion and bias in the data. Based on a combination of their popularity in TIOBE and Rosetta Code, we selected 8 languages in four categories: C and Go as procedural languages; C# and Java as object-oriented languages; F# and Haskell as functional languages; and Python and Ruby as object-oriented languages. If your favorite language is not there do not despair: let us know in the comments why you think it deserves to be included; we might consider it for future work.

Let’s have a look at the main results (see the paper for all the details). The biggest surprise is that there are no huge surprises: well-trained programmers and software engineers will recognize several well-known adages about the advantages of certain programming language features over others. To make this apparent, I’ve collected excerpts from classics texts on programming languages that somewhat match our empirical findings.

Conciseness

It is generally understood that practical expressiveness boils down to conciseness:

The main benefit of the use of expressive languages seems to be the ability to abstract from programming patterns with simple statements and to state the purpose of a program in the concisest possible manner.

We have come to believe that the major negative consequence of a lack of expressiveness is the abundance of programming patterns to make up for the missing, non-expressible constructs.

[Felleisen, 1991]

Higher-order features such as list comprehensions, reflection, higher-order functions, and idiomatic support for lists and maps should increase the level of practical expressiveness, and hence conciseness:

Higher-order procedures can serve as powerful abstraction mechanisms, vastly increasing the expressive power of our language. [Pg. 75]

[…] expressive power […] is attained […] by accumulation and filtering [on lists]. [Pg. 81]

Elevate the conceptual level at which we can design our programs [means enhancing] the expressive power of our language. [Pg. 108]

[Abelson and Sussman, 1996]

Such higher-order features are more readily available in functional and scripting languages than imperative languages:

Against Java, we can say that (compared to, say, Python) some parts of it appear over-complex and others deficient.

[Pg. 340 in Raymond, 2003]

We measured conciseness in terms of lines of code, comparing solutions in each language against those in other languages. Our numbers draw a picture that is largely consistent with the above quotations: functional and scripting languages provide significantly more concise code than procedural and object-oriented languages. Their higher-order features increase practical expressiveness, to wit, conciseness. While in principle one can follow a functional style using object-oriented languages, it is idiomatic support that seems to make a tangible difference in practice.

Performance

Performance is another often controversial programming language feature. We tried to contribute to the understanding of performance in practice by distinguishing between two kinds of tests. Much of the controversy when discussing performance may derive from conflating these two kinds of problems, which represent very different conditions.

The first kind of performance comparison targets raw execution speed on large inputs; for example, sorting million-element arrays or compressing tens of megabytes of data. The outcome of our experiments using Rosetta Code tasks on such problems is what most people would expect: C is indisputably the fastest — if it was a race, it’d lap all other languages. A bit more generally, language features cost raw speed, and more features tend to cost more speed. In fact, the only runner-up (still from a distance) is Go, a language that is richer than C — it offers automatic memory management and strong typing — but deliberately renounces other expressive features, such as inheritance and genericity, that have become commonplace in modern high-level programming languages.

Programs that require maximum speed […] are good candidates for C. [Pg. 326]

Python cannot compete with C or C++ on raw execution speed. [Pg. 337]

[Raymond, 2003]

[The] main problem [of automatic memory management], however, is that “useful” processing time is lost when the garbage collector is invoked.

[Pg. 168 in Ghezzi and Jazayeri, 1997]

Most of the time, however, the extreme differences in raw speed that emerge with algorithmically-intensive programs on large inputs do not matter much because such jobs are absent or extremely infrequent in the overwhelming majority of applications, which hardly ever have to deal with number crunching. How many million-element arrays did your web browser have to sort while you were browsing the news? To understand performance differences more commonly occurring in everyday conditions, we identified a second kind of targets for comparison, consisting of well-defined problems on input of moderate size, such as checksum algorithms and string manipulation tasks. The results are quite different when we consider this second kind of everyday problems. Scripting and functional languages tend to emerge as the fastest, even surpassing C. More generally, the absolute differences between languages are smallish, which means that every language is usable, and engineering concerns other than raw speed emerge as more relevant.

Most applications do not actually need better performance than Python offers.

[Pg. 337 in Raymond, 2003]

To sum up, the most significant, and somehow neglected, finding that surfaced from our performance comparisons is this distinction between “raw speed” and “everyday” performance requirements, and the corresponding emphasis on the latter for most workaday programming tasks.

Failure proneness

Counting faults (or, more appropriately for this blog, bugs) is often used to measure the quality or success of software projects. The errors that are of principal interest in that context are those resulting from program behavior that diverges from specification; for example, a banking application that increases your balance when you withdraw money (although this is a bug most of us could live with 🙂 ). In contrast, our comparison of programming languages looked for errors that are independent of a program’s specification and have to do with what we might characterize as well-formedness, such as typing and compatibility errors. This is an admittedly restricted notion of error, but it lends itself to effective detection and analysis. The classic view on the problem of detecting such errors is clear:

[Checking for the presence of well-formedness errors] can be accomplished in different ways that can be classified in two broad categories: static and dynamic. […] In general, if a check can be performed statically, it is preferable to do so instead of delaying the check to run time […].

[Pg. 137 in Ghezzi and Jazayeri, 1997]

To see which languages follow this prescription and tend to perform more checks statically, we counted what fraction of programs in each language that compile correctly terminate without error (exit status zero). The compiled strongly-typed languages (that is, all compiled languages but C which is weakly typed) clearly emerged as those with the fewest runtime failures triggered in our experiments; their compilers do a fairly good job at catching errors at compile time by type checking and other static analyses. In contrast, the interpreted languages triggered runtime failures more frequently; nearly all checks but syntactic ones are done at runtime, when things can go wrong in many different ways.

Go was the least failure prone of the compiled strongly-typed languages. Given that we analyzed a large number of programs written by different contributors, we are reluctant to explain this difference mainly by attributing better programming skills to the authors of Go programs in our sample. Instead, this results refines the picture about what compiled strongly-typed languages can achieve: Go’s type system is more restricted than that of functional or object-oriented languages, which may help achieve type safety by minimizing dark corners where errors may originate.

Our results on failure proneness cannot set the debate about static vs. dynamic checks. This is another issue where a multitude of technical and organizational concern concur to creating several different local optima.

Be fruitful and replicate

No single study, no matter how carefully designed and executed, can claim to provide conclusive evidence about complex issues such as the relative merits and practical impact of programming language features. Scientific inductive knowledge grows slowly, one little piece of evidence at a time. Technological factors further complicate the picture, since they may shift the importance of certain features and render some obsolete (for example, the exponential growth of processor speed in past decades has favored the development of feature-rich graphical user interfaces that were previously impractical). Yet, it is encouraging that we have found significant concordance between our empirical results and others’ point of view. A lot of what you learned from the classics about programming language was right — as long as you picked the right classics!

References

Sebastian Nanz and Carlo A. Furia: A comparative study of programming languages in Rosetta Code. Technical report arXiv.org:1409.0252, September 2014.
Matthias Felleisen: On the expressive power of programming languages. Science of Computer Programming, 17(1-3):35-75, 1991.
Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and interpretation of computer programs. 2nd edition, MIT Press, 1996.
Eric S. Raymond: The art of UNIX programming. Addison-Wesley, 2003. Available online.
Carlo Ghezzi and Mehdi Jazayeri: Programming language concepts. 3rd edition, Wiley & Sons, 1997.

Bug counting

A little specification can go a long way

Monthly Archives: September 2014

Dimensions in program verification

Dimensions in program verification

When speed doesn’t matter, Python is faster than C

Conciseness

Performance

Failure proneness

Be fruitful and replicate

References