This summer, Sebastian Nanz and I have finally figured out what the best programming language is. The answer is…
Of course you immediately understood that the incipit is a joke. When it comes to complex feature-laden artifacts like general-purpose programming languages there is no such thing as the best tool for the job. In the reality of software development, different programmers with different skills, different attitudes, and different mindsets solve different problems following different methods, different practices, and different processes in different environments, using different tools and different programming languages. As a result, each programming language design strives to find trade-offs that are convenient to someone in the motley crowd of programmers.
Still, the trade-offs of different languages should be demonstrable by impacting measurable features of programs written in those languages. In this recent work [Nanz and Furia, 2014], we have tried to contribute empirical evidence to better our understanding of how programming languages can be used in practice. One of the aspects we were interested in investigating was whether one can find empirical evidence to justify some of the folk knowledge about programming languages, which is very often passed on as a series of ipse dixit that should be self-evident — except that sometimes different authorities have dissenting opinions!
Before summarizing the results of our study, here’s something about the methodology. An important decision was to use Rosetta Code as source of raw data in the form of programs. Rather than hosting full projects — a service provided by other sites such as GitHub and Bitbucket — Rosetta Code focuses on well-defined programming tasks that can be implemented by small programs (the average program in our study is around 30 lines of code); this makes implementations of the same task in different programming languages directly comparable. The revision history and submission practices of Rosetta Code also suggest that programs are often revised by multiple programmers, and hence likely have a good quality on average; and the task list includes many relevant problems that are often part of large real-world projects. This setup helped make sound inter-language comparisons based on proper language usage, thus reducing dispersion and bias in the data. Based on a combination of their popularity in TIOBE and Rosetta Code, we selected 8 languages in four categories: C and Go as procedural languages; C# and Java as object-oriented languages; F# and Haskell as functional languages; and Python and Ruby as object-oriented languages. If your favorite language is not there do not despair: let us know in the comments why you think it deserves to be included; we might consider it for future work.
Let’s have a look at the main results (see the paper for all the details). The biggest surprise is that there are no huge surprises: well-trained programmers and software engineers will recognize several well-known adages about the advantages of certain programming language features over others. To make this apparent, I’ve collected excerpts from classics texts on programming languages that somewhat match our empirical findings.
Conciseness
It is generally understood that practical expressiveness boils down to conciseness:
The main benefit of the use of expressive languages seems to be the ability to abstract from programming patterns with simple statements and to state the purpose of a program in the concisest possible manner.
We have come to believe that the major negative consequence of a lack of expressiveness is the abundance of programming patterns to make up for the missing, non-expressible constructs.
[Felleisen, 1991]
Higher-order features such as list comprehensions, reflection, higher-order functions, and idiomatic support for lists and maps should increase the level of practical expressiveness, and hence conciseness:
Higher-order procedures can serve as powerful abstraction mechanisms, vastly increasing the expressive power of our language. [Pg. 75]
[…] expressive power […] is attained […] by accumulation and filtering [on lists]. [Pg. 81]
Elevate the conceptual level at which we can design our programs [means enhancing] the expressive power of our language. [Pg. 108]
[Abelson and Sussman, 1996]
Such higher-order features are more readily available in functional and scripting languages than imperative languages:
Against Java, we can say that (compared to, say, Python) some parts of it appear over-complex and others deficient.
[Pg. 340 in Raymond, 2003]
We measured conciseness in terms of lines of code, comparing solutions in each language against those in other languages. Our numbers draw a picture that is largely consistent with the above quotations: functional and scripting languages provide significantly more concise code than procedural and object-oriented languages. Their higher-order features increase practical expressiveness, to wit, conciseness. While in principle one can follow a functional style using object-oriented languages, it is idiomatic support that seems to make a tangible difference in practice.
Performance
Performance is another often controversial programming language feature. We tried to contribute to the understanding of performance in practice by distinguishing between two kinds of tests. Much of the controversy when discussing performance may derive from conflating these two kinds of problems, which represent very different conditions.
The first kind of performance comparison targets raw execution speed on large inputs; for example, sorting million-element arrays or compressing tens of megabytes of data. The outcome of our experiments using Rosetta Code tasks on such problems is what most people would expect: C is indisputably the fastest — if it was a race, it’d lap all other languages. A bit more generally, language features cost raw speed, and more features tend to cost more speed. In fact, the only runner-up (still from a distance) is Go, a language that is richer than C — it offers automatic memory management and strong typing — but deliberately renounces other expressive features, such as inheritance and genericity, that have become commonplace in modern high-level programming languages.
Programs that require maximum speed […] are good candidates for C. [Pg. 326]
Python cannot compete with C or C++ on raw execution speed. [Pg. 337]
[Raymond, 2003]
[The] main problem [of automatic memory management], however, is that “useful” processing time is lost when the garbage collector is invoked.
[Pg. 168 in Ghezzi and Jazayeri, 1997]
Most of the time, however, the extreme differences in raw speed that emerge with algorithmically-intensive programs on large inputs do not matter much because such jobs are absent or extremely infrequent in the overwhelming majority of applications, which hardly ever have to deal with number crunching. How many million-element arrays did your web browser have to sort while you were browsing the news? To understand performance differences more commonly occurring in everyday conditions, we identified a second kind of targets for comparison, consisting of well-defined problems on input of moderate size, such as checksum algorithms and string manipulation tasks. The results are quite different when we consider this second kind of everyday problems. Scripting and functional languages tend to emerge as the fastest, even surpassing C. More generally, the absolute differences between languages are smallish, which means that every language is usable, and engineering concerns other than raw speed emerge as more relevant.
Most applications do not actually need better performance than Python offers.
[Pg. 337 in Raymond, 2003]
To sum up, the most significant, and somehow neglected, finding that surfaced from our performance comparisons is this distinction between “raw speed” and “everyday” performance requirements, and the corresponding emphasis on the latter for most workaday programming tasks.
Failure proneness
Counting faults (or, more appropriately for this blog, bugs) is often used to measure the quality or success of software projects. The errors that are of principal interest in that context are those resulting from program behavior that diverges from specification; for example, a banking application that increases your balance when you withdraw money (although this is a bug most of us could live with 🙂 ). In contrast, our comparison of programming languages looked for errors that are independent of a program’s specification and have to do with what we might characterize as well-formedness, such as typing and compatibility errors. This is an admittedly restricted notion of error, but it lends itself to effective detection and analysis. The classic view on the problem of detecting such errors is clear:
[Checking for the presence of well-formedness errors] can be accomplished in different ways that can be classified in two broad categories: static and dynamic. […] In general, if a check can be performed statically, it is preferable to do so instead of delaying the check to run time […].
[Pg. 137 in Ghezzi and Jazayeri, 1997]
To see which languages follow this prescription and tend to perform more checks statically, we counted what fraction of programs in each language that compile correctly terminate without error (exit status zero). The compiled strongly-typed languages (that is, all compiled languages but C which is weakly typed) clearly emerged as those with the fewest runtime failures triggered in our experiments; their compilers do a fairly good job at catching errors at compile time by type checking and other static analyses. In contrast, the interpreted languages triggered runtime failures more frequently; nearly all checks but syntactic ones are done at runtime, when things can go wrong in many different ways.
Go was the least failure prone of the compiled strongly-typed languages. Given that we analyzed a large number of programs written by different contributors, we are reluctant to explain this difference mainly by attributing better programming skills to the authors of Go programs in our sample. Instead, this results refines the picture about what compiled strongly-typed languages can achieve: Go’s type system is more restricted than that of functional or object-oriented languages, which may help achieve type safety by minimizing dark corners where errors may originate.
Our results on failure proneness cannot set the debate about static vs. dynamic checks. This is another issue where a multitude of technical and organizational concern concur to creating several different local optima.
Be fruitful and replicate
No single study, no matter how carefully designed and executed, can claim to provide conclusive evidence about complex issues such as the relative merits and practical impact of programming language features. Scientific inductive knowledge grows slowly, one little piece of evidence at a time. Technological factors further complicate the picture, since they may shift the importance of certain features and render some obsolete (for example, the exponential growth of processor speed in past decades has favored the development of feature-rich graphical user interfaces that were previously impractical). Yet, it is encouraging that we have found significant concordance between our empirical results and others’ point of view. A lot of what you learned from the classics about programming language was right — as long as you picked the right classics!
References
- Sebastian Nanz and Carlo A. Furia: A comparative study of programming languages in Rosetta Code. Technical report arXiv.org:1409.0252, September 2014.
- Matthias Felleisen: On the expressive power of programming languages. Science of Computer Programming, 17(1-3):35-75, 1991.
- Harold Abelson, Gerald Jay Sussman, and Julie Sussman: Structure and interpretation of computer programs. 2nd edition, MIT Press, 1996.
- Eric S. Raymond: The art of UNIX programming. Addison-Wesley, 2003. Available online.
- Carlo Ghezzi and Mehdi Jazayeri: Programming language concepts. 3rd edition, Wiley & Sons, 1997.
Nice! But why don’t you add the very nice graphs from the paper?
hce #1:
Thanks! I didn’t add material from the report because I didn’t want to make the post too long. Also, I tried to keep the presentation here as complementary as possible to the one in the report. This gives me the chance to remind readers that they should check out the report for all the quantitative details.
Hi, I read the pdf version where you write “Python, in particular, is significantly more failure prone than every other language”.
I couldn’t quite make out from the paper how failure is measured. Would you elaborate?
Are you saying that many of the Python examples, once you had automatically determined which version of Python they should be run on, then returned a non-zero exit value, or are you counting the exit values from cases where the wrong version of Python interpreter was used?
Thanks.
Another “possible threat to external validity” is that Rosetta Code language examples rarely use specialised frameworks for example, Python has tools to martial libraries written in other languages as well as Python to speed execution speed as well as programmer productivity; numpy, scipy Biopython PyPy for example are used in this way, but hardly appear on Rosetta Code at all, and may not fit well with your metrics – If scipy wraps a Fortran library to solve a problem where the programmer writes in Python, then which language do you attribute results to? Many scripting languages are made to embed libraries from other languages in this way (or to be embedded themselves). When looking for more speed, many engineers don’t automatically move to C as there is still a balance to be achieved between accessability and speed that make such script controlled routes attractive.
it would be interesting to add a comparison of old languages, for example common lisp and smalltalk.
the question here is, whether newer languages bring improvements for any of your metrics, or if the only difference really is syntax and current popularity.
greetings, eMBee.
paddy3118 #80:
The first interpretation is correct: in the compilation stage, a script tried to determine which version of Python each example compiles to. Then, we ran all examples that compiled successfully in this way (using the appropriate version of Python interpreter) and counted how many of them terminated with a non-zero exit value.
This process was automated but was also supplemented by substantial manual scrutiny. We manually checked all examples that failed compilation (with both Python 2.x and 3.x) to see if there was any immediate problem that prevented compilation and that could be easily fixed. We also checked most examples that returned with non-zero exit value to see, for example, if the compilation script detected the incorrect Python version. This way, we have a reasonable confidence that the bulk of data about failures is sound.
Finally, note that we use a fairly restricted notion of failure. Basically, a failure in any error that the compiler didn’t catch but occurred at runtime. It’s not surprising that, roughly and all else being equal, the lighter the checks made by the compiler the more failures manifest themselves at runtime.
Thanks Carlo #124 for taking the time to answer. It would be great if you could give me a list of failing Python examples and I will try and get them updated on RC.
Thanks again.