counting bugs « Bug counting

During the spring semester, I supervised three groups of students working on master’s thesis projects that I proposed. Tackling quite different problems, they all did an excellent job, and it was fun and interesting to supervise their work and to follow their results. In this post, I’m giving a brief account of their work now that they are all done. Below, you can also find links to the official thesis reports in Chalmers’s Library. While we’re at it: if you’re a current student looking for a master’s thesis project, make sure to check out the projects I currently offer or drop me a line.

The spec is out there — Extracting contracts from code

Pontus Doverstav and Christoffer Medin developed a tool that extracts partial JML contracts (a kind of formal specification) from Java code. Their approach differs from the most common techniques for specification inference, which ultimately reason about the semantics of code (for example, abstract interpretation techniques abstract the concrete semantics of a program on simpler domains). In contrast, Pontus and Christoffer’s work is syntax-driven, as it is based on a collection of patterns that capture how certain code implicitly suggests simple contracts. For example, a qualified call t.foo() implicitly requires that t != null; this assertion is propagated to the precondition of the method where the call appears.

The whole approach is deliberately kept simple and best effort. For instance, whenever there is a method call, we try to determine its effects based on the callee’s (partial) contract; if no callee contract is available, we just assume a worst-case behavior to preserve soundness. In the worst case, an analyzed method may be marked as “failure” by the tool, denoting that no meaningful specification could be reliably extracted. An advantage of keeping the analysis so simple is that we can apply it extensively and retain any result that is useful. When they applied their tool to thousands of methods in four open-source projects, Christoffer and Pontus found that it could extract more than one JML specification case per method on average. Many extracted contracts are simple, but a few are quite complex — such as one capturing in detail the behavior of a large switch statement covering 256 cases. The thesis also reports on some preliminary evidence that suggests the contracts extracted this way can be complementary to those a programmer would write, thus helping the analysis of programs and other applications driven by specifications.

Comparing programming languages in Google Code Jam

Alexandra Back and Emma Westman analyzed empirically the solutions submitted to Google’s Code Jam programming contest, with the overall goal of comparing features of different programming languages. The main motivation behind targeting Code Jam was finding a collection of programs that all implement the same functionality (whatever is required by each round in GCJ) in different programming languages (contestants can use any programming language of their choice), while being able to control for quality — the intuition being that a solution’s rank in the contest is strongly indicative of its quality in terms of correctness and efficiency.

The study focused on the five programming languages most widely used in the contest — C, C++, C#, Java, and Python — and analyzed over 210’000 solution programs. Since a contestant’s performance is based only on his or her ability to produce the correct output for a given, large input to the program, there is no guarantee that the program that has been submitted is the same as the one that was used to generate the correct solution. This complicated the study, and ultimately required to drop 38% of the solutions from some analyses, as they could not be compiled and executed. Nonetheless, a large number of usable solutions remained that could be fully analyzed, which gives confidence in the study’s results. Among them, Alexandra and Emma found that the highest-ranked contestants use C++ predominantly but by far not exclusively; that Python programs tend to be the most concise ones, and C# programs the most verbose; that C and C++ dominate the comparison of running time performance, even though the other languages do not fall that far behind; that C and C++ also have the smallest memory footprint, while Java has the largest (probably due to how the JVM manages memory). Check out the thesis for many more details and results.

Incremental deductive verification for a subset of the Boogie language

Leo Anttila and Mattias Åkesson developed a tool that verifies Boogie programs incrementally as they are modified in each new iteration. Boogie is a popular intermediate verification language, combining a simple imperative language and an expressive program logic. Boogie is also the name of a tool that verifies Boogie programs. Whenever a procedure changes implementation or specification, Boogie verifies it from scratch — like most other verifiers; in contrast, Leo and Mattias’s tool is able to reuse the results of previous verification runs. Suppose, for example, we have verified a procedure foo; later, we introduce a small change to foo that is still compatible with the procedure’s specification. The incremental verifier is aware of what has changed, and only verifies that, intuitively, the difference between the current version of foo and the previous one does not affect correctness.

The details of how incremental verification works are quite technical, and rely on a combination of attribute grammars and incremental parsing. My experience working with Boogie (or program verifiers that use it as back-end) suggests that such an incrementality matches well the way so-called auto-active verification tools are used in practice — namely in a stream of small changes to the source code each immediately followed by a call to the verifier, whose responsiveness is of the essence to support a smooth user experience. The thesis describes experiments simulating this mode of interaction, showing that incrementality can often reduce the running time of the verifier by 30% or more.

References

Christoffer Medin and Pontus Doverstav: The spec is out there — Extracting contracts from code, Master’s thesis in computer science. Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, June 2017.
Alexandra Back and Emma Westman: Comparing programming languages in Google Code Jam, Master’s thesis in computer science, algorithms, languages and logic. Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, June 2017.
Leo Anttila and Mattias Åkesson: Incremental deductive verification for a subset of the Boogie language, Master’s thesis in computer science, algorithms, languages and logic. Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, June 2017.

Are you looking for a faculty job? Make sure to read the various guides available online; they’re packed with useful information. Their shortcoming is that they’re but entirely focused on the North American/US job market. If you’re interested in knowing how things work on the other side of the pond, read on.

The following notes are mainly based on my experience during the academic year 2014–2015, when I’ve been searching for a faculty position in Europe. Each section describes a particular aspect of the whole search process, and tries to compare it to the US experience which is generally documented more broadly elsewhere. Whatever aspect I do not comment on, you can assume it’s quite similar to the situation in north America. Keep in mind that, besides the desire to remain in Europe, I was a candidate with significant experience: I completed my PhD in 2007, and I’ve spent several years as senior researcher — a position comparable to assistant professor without tenure. This entails that I was looking for more senior positions, tenured or with a well-defined not-too-long path to tenure. These requirements turned out to be not overly constraining for the European job marked, but they may have skewed may perception of some aspects. Your mileage may vary.

Overall process

The overall process is the usual one. You apply for an advertised position by submitting all required application material by the deadline. After a while, shortlisted candidates are invited for an interview. Following the interviews, selected candidates receive offers. If you’re one of the lucky ones, you get to negotiate your position until you reach an agreement and sign an employment contract (or decline the offer).

The timing and transparency of this process vary wildly between institutions. First, application deadlines are spread out over the whole year. While the majority of deadlines are from September to the following February, several continue well into the spring, and some may even be during the summer. The other phases also follow asynchronously: in some cases, I haven’t heard back for 8 months after submitting the application; others informed me about every step in the selection process, and even shared with me the evaluations of the hiring committee (sometimes about all candidates) before interviewing me. Shortlisting was sometimes done in stages, with a longer list of candidates who are passed on to the next filtering stage. In all, you have to be flexible, and be ready to prioritize the positions that are more interesting for you. Checking with the hiring committee about tentative dates does not hurt, if it can reduce the uncertainty.

The application

What goes into the application material is roughly similar in all places, but different institutions often use different different formats or just different terminology. The cover letter may not always be explicitly required, but make sure to include it anyway as it’s your chance to customize the application and present an overview of your profile that is sufficiently short to be read in its entirety. The research statement (also research plan, research highlights) is a combination of presenting your research in a convincing frame and outlining future work. The latter part is normally the shorter of the two, even though some advertisements seem to suggest that it should be organized like a full-fledged research grant application. I took such notes as a suggestion to be concrete in outlining future work; but I would still recommend to give enough space for describing what you have already done, since that can be much more compelling in a hiring context. The teaching statement is not always required; in a few cases, it had to follow an elaborate format used internally by the organization. I always tried to stick to the format (within reason), but I never had the impression the teaching statement was given much attention. For both research and teaching statement, always stick to the given page limit if they give one (if there’s no page limit, still don’t overdo it; at least, length should be commensurate to achievements). The CV is probably what the hiring committee goes through once they get past the cover letter. It’s important that it’s carefully organized, not just to highlight your achievements but also in a way that it can be skimmed through efficiently. The publications list is another part that may have to stick to a given format (for example, journal and conference papers in different categories). I sometimes took the liberty of ignoring the requirements when I had a good reason to do so; in particular, I listed conference and journal papers in the same category because that’s how we roll in computer science. Make sure publications are numbered in a way that makes it easy to count them — at least until we stop the numbers game (don’t hold your breath).

The practice of requiring (or not) reference letters is customary in English-speaking countries. While it’s catching up in most of Europe as well, there are some countries where referees are not used at all. My suggestion is to include the list of referees in your CV regardless of whether they are required or not. In places where officially using recommendation letters is not customary or not permitted by the regulation, the hiring committee may still decide to get the same information informally by talking to your referees. When I’ve be involved in hiring decisions (in my case, regarding PhD students), I’ve always found that recommendation letters can play a big role especially in the case where the competition among candidates is strong.

Frustratingly, some institutions required applicants to jump through some additional hoops, like requiring to fill in a Word form (so poorly designed that even copy-pasting into it was a challenge), or using a poorly-designed submission system that asks you to enter the publication information using online forms (needless to say with no BibTeX import options). My policy has generally been to ignore as many as these artificial requirements as possible, unless they were relevant to the specific application or there was no other way of providing that information. In one case, I just wrote something like “See the attached material” in the first input form on the online application, and attached a single PDF with my actual application material. Result: they soon invited me for an interview. Provided you have carefully prepared your standard application material and you have a competitive profile, no reasonable person will desk-reject you just because you didn’t use Word. Indeed, sticking to your format and your story may even show some independence and that your time is valuable too (that’s my opinion at least). In any case, make sure to leave enough time before the application deadline to go through the submission system and check any unusual requirements.

The interview

The interview may turn out to be interviews (plural): some places do selection in stages, and may organize a presentation and informal discussion with a few shortlisted candidates, and then invite half of them to a formal interview at a later date. To my great surprise, there are a few places that refuse to reschedule an interview if the invited candidate is not available on the suggested date. This strategy is clearly detrimental to getting the best candidates; it’s a mystery to me why some practice it against their self-interest. Of course you should try to clear your schedule as much as possible during interview season, but complete inflexibility on the other side is not an inviting sign.

The actual interview typically follows a talk by the candidate on their research, teaching, or both. The talk’s length changes wildly — from ten minutes to one hour. Whatever your time budget it, stay within it — you do not want to be cut off. For the actual interview you usually sit around a table with the hiring committee. Unlike what is described about north America, you do not typically get to have one-on-one meeting with administrative figures; everybody sits around the same table and they all take turn asking you questions on their specific area of interest or expertise (research, teaching, career, and so on). It’s useful to have answers to most “standard” questions rehearsed in advance (but not rigidly memorized), so as to be assertive and to-the-point.

In most, but not all, places you also have a chance to have some informal contact with a small number of faculty members, typically those who are closer to your research interests and you may already know. This step is very important because, if no one in the hiring committee has a direct interest in what you do, the chances that they’ll hire you are dim. I also believe that this part of the interviewing process is about you getting to know the place and your potential future colleagues too. If a place does not give you any chance to chat with the faculty over lunch, or to get an idea of how the department works “from below”, it’s hardly a good sign. If you do know somebody on the faculty, you should also go ahead and contact them proactively in advance to make room for such informal meetings.

The offer

With few exception, the decision comes in one or two weeks after you interviewed. It’s unlikely that you have a lot of time to decide: this seems in contrast to the US where offers are often extended without time limits. Still, you should be able to take at least a couple of weeks after the negotiation ends. Thus it’s quite likely that, if you applied to numerous places, you will not have a chance to line up all offers and decide. To prepare for that, it’s useful to rank the places where you applied from best to worst (according to your preferences), as well as according to an estimate of how competitive getting that position is; the rankings will change interview after interview based on your refined impression of each place. Then, if you have to decide on an offer, you can base your decision on which other options are still open or may become open later on. If you feel creative, you may use an optimal stopping strategy :-).

Negotiation is often not a big deal, because conditions tend to be standardized by age and experience. Most of the negotiation is actually learning in details about the employment conditions. But what is offered is not the same everywhere. For instance not all places have a starting package or pay relocation costs. Directly comparing salaries in Europe is also difficult because the cost of living changes significantly from country to country, and so do income taxes. Related to this, it is important to learn about each country’s national research funding system. This also varies considerably from country to country, and it can make a big difference in how many resources you will be able to raise on top of whatever the university grants you.

Competitiveness

In hindsight I was invited to interview wherever there was a clearly good match between my profile and the advertised position in terms of research area and expected seniority. While you cannot pretend you are younger than you actually are (academic age starts running when you get your PhD), some people try to change the research statement to fit some of the topics highlighted in the call for applications. I never did that, and I suggest against doing it. First, because writing a strong research statement requires time and a lot of polishing; tweaking it last-minute for each new application runs the risk of becoming less readable and compelling. Second, because the research statement should present your research according to your vision; feigning expertise in topics that you’re not really expert in is unlikely to convince the real experts who are hiring. However, do customize the cover letter with a few strategic sentences where you show that you can clearly make a connection to the specific topic, and that you have genuine interest in making it. Also, don’t be shy in applying broadly — within reason — without necessarily waiting for the perfect match: a strong application is always well received, even if it does not get you an interview or an offer (I’ve had a couple of cases of positive feedback about my profile, who clearly said that they were not following up because they were looking for a different profile but appreciated my application nonetheless).

Enjoy it!

While academic job search involves a long and stressful process, you should also have a chance to enjoy some moments of it. In fact, interviewing is not that dissimilar from meeting people at conferences and other scientific events. The interviewing panels where you find the most at ease may also provide a strong indication of which places you want to work for. In any case, good luck!

In case the spamming — I mean, discreet, surgically targeted advertising — through the other channels miraculously hasn’t reached you yet, you should know about an open PhD student position in my research group. The official ad is online. In this post I’d like to give a more informal description and clarify a few aspects that may not be obvious to every prospective candidate.

There are a total of 4 openings in the Software Technology division that are currently open; you will often see them advertised together. One of the PhD student positions is under my direct supervision, and it’s the one I describe here.

There are only loose constraints on the research topic for the to-be-hired PhD student. Obviously I expect it to be something around the areas I’m interested in, so software engineering and formal methods. But there’s plenty of room within these general denominations for a variety of concrete topics that span from theoretical to practical. To get an idea, have a look at the description of my past research. If anything there tickles you’re fancy then… great! We have a match. If nothing does… great! It’s a chance to start working on something new and exciting. Seriously, that page describes my past research; it doesn’t predict the future. Of course I still expect a match in the general area of interest. If you want to do research in, say, ornithology you may want to look elsewhere (or not, if you find a good connection to my research interests :-).

I refrain from giving a list of the ideal candidates. You can find plenty of good recommendations, and I subscribe to pretty much all of them. The fact is, there’s no such thing as the ideal skills set for PhD students. Rather, different great students often have incomparable skills: some are better at analytic reasoning, some are outstanding programmers, others excel at communicating; regardless of what they’re less good at, they’ll make up for it by cultivating new skills and honing existing ones to perfection. So perhaps the only universally required qualities are meta-qualities (potentialities): the ability to learn and adapt skills, including meta-skills, and a passion for pawing through new problems “like some starving raccoon!”

A few organizational points. A PhD in Sweden normally takes 5 years, including 20% of the time spent on teaching assistantship. As a PhD student you are an employee not just a student; and you get a salary not a stipend, which is a pretty decent way of making a living. If you’re concerned about not speaking Swedish, don’t be: what they say about everybody in Sweden speaking good English is not a myth.

So should you apply? Yes, everybody with a genuine interest should!

These are good times for formal verification. Amazon Web Services is using formal specifications and model checking to support rigorous design at a very large scale. Facebook, of all companies, is getting serious about static analysis, with Monoidics’s separation logic gurus and Francesco Logozzo of Clousot fame among its recent hires. John Carmack, as a programmer, is aggressively pursuing static analysis. Only thanks to formal verification has a remote but pernicious bug in Java’s standard library sort been discovered. It’s clear that — slowly, one step at a time — the offerings of formal methods are becoming practically helpful.

Still, the transition from pure research to practical applications requires time, in formal verification as in other areas. Among the intermediate steps that have to be covered there’s dealing with instances that are realistic: not necessarily full-blown industrial-strength, but not discounting any features that are expected of a fully usable implementation. Realistic software also moves past the challenges of benchmarks, which typically focus on one or few complicated aspects while eliding features that are considered commonplace and hence not interestingly problematic.

Nadia Polikarpova‘s presentation at the Formal Methods Symposium in Oslo a few weeks ago was precisely about dealing with realistic software in verification: our paper describes the verification of EiffelBase2, a realistic general-purpose data-structure library, against its complete functional specification.

Before you roll your eyes thinking, “Seriously?! Verification of data structures in 2015? Wasn’t the problem solved in the 1970s? Or maybe earlier! And they gave them a best paper award for this stuff? Unbelievable!”, let me explain what’s new and, hopefully, interesting about this work. It is true that the amount of work on verifying data structures is overwhelming. Not only because the effort poses interesting challenges but also because data structures are a domain that is naturally amenable to formalization and analysis: everybody agrees on the expected behavior of, say, a linked stack, and it can be formalized succinctly. In other domains (reactive software for example) even specification is a challenge in terms of both “what” and “how”. Still and all, most data-structure verification work focuses on demonstrating certain aspects of a technique and is applicable to implementations written in a certain way, possibly with some restrictions, and not in others. In contrast, Nadia implemented and specified EiffelBase2 before verification (even if, naturally, some later adjustment were necessary to complete the verification effort) with the goal of providing a usable realistic container collection without compromising on any aspect (object-oriented design, API usability, implementation performance, …) that is to be expected in practice. In fact, EiffelBase2 now ships with the EiffelStudio compiler as an alternative to the old EiffelBase library. (It did not fully replace it only for backward compatibility reasons.)

Here are a few examples of EiffelBase2 features that you would expect in a usable data-structure library (think java.util or .NET Collections, or C++ STL), but which are often simplified or omitted altogether in verification research:

External iterators, with the possibility of statically checking that multiple iterators operating on the same structure are safe, that is modify it consistently.
Hash tables with arbitrary user-defined objects as keys and object-based key comparison, with the possibility of statically checking that modifying objects used as keys does not affect table consistency.
Rich, convenient APIs offering operations for searching, replacing, inserting, and removing elements, merging containers, converting between different containers, object-based comparison (i.e., comparing objects according to their content rather than by references pointing to it), and copy constructors.
Abstract specifications of data structures, consistent with the inheritance hierarchy, and provably complete for functional correctness.
Implementations that do not trade performance off for verifiability: array-based lists use ring buffers, hash tables automatically resize their bucket arrays to avoid performance degradation, and so on.

If some of these features do not sound impressive it is because we are blasé, used as we are to having them in any realistic implementations, but also happy to ignore or abstract them away when performing verification because they do not seem to introduce complex, as opposed to complicated, novel challenges. However the devil’s in the details, and some sources of complexity are emergent from the combination of seemingly plain and innocuous features. Even verifying something as workaday as copy constructors brought its own set of challenges, in ensuring that the copy is consistent with the abstract class model and with its relatives by means of inheritance.

Part of the challenge of verifying EiffelBase2 was achieving all this with a good level of automation and a reasonable amount of annotations (besides specifications) — in other words, with an effort proportional to the magnitude of the challenge. To this end, we developed a verification methodology and supporting tool (AutoProof, which I presented in a previous post) with the goals of providing flexibility and proportionality: verifying simple stuff should be straightforward, verifying complex stuff (such as EiffelBase2) should still be attainable but require a bigger annotation overhead. While I personally think that being able to interact directly with the prover for the most challenging proofs may support even greater levels of flexibility, the auto-active interaction mode of AutoProof may have hit on a sweet spot in the landscape of annotations vs. automation — at least for full functional correctness in the domain of data structure libraries.

You’re welcome to add comments about features of realistic software that are often overlooked in practical verification, as well as about other success stories of formal methods that narrowed the gap between lab and field practice.

Breaking news! A recent study found that Barack Obama is, with high probability, not an American citizen! The study — destined to revive the controversy that emerged during the President’s first presidential campaign — is based on new evidence and a simple analysis using widely accepted statistical inference tools. I’ll leave it to the political pundits to analyze the grave effects that this shocking finding surely will have on the upcoming presidential campaign. This post focuses on the elegant technical machinery used to reach the unsettling conclusion.

The crux of the analysis applies, in a statistical setting, modus tollens, a basic inference rule of logic. Given two facts $X$ and $Y$ such that if $X$ is true then $Y$ is true, modus tollens derives the falsehood of $X$ from the falsehood of $Y$ . In formal notation:

$\begin{matrix} X \Longrightarrow Y, \quad \neg Y \\ \hline \neg X \end{matrix}$

For example, take $X$ to be “It rains” and $Y$ to be “I have an umbrella with me”. From the fact that I am carrying no umbrella, by applying modus tollens, you can conclude that it’s not raining.

The next step introduces a simple generalization of modus tollens to the case where facts are true with some probability: if $X$ is true then $Y$ is true with high probability. Then, when $Y$ happens to be false, we conclude that $X$ is unlikely to be true. If I have an umbrella with me 99% of the times when it rains, there’s only a 1% chance that it rains if I have no umbrella with me.

All this is plain and simple, but it has surprising consequence when applied to the presidential case. A randomly sampled American citizen is quite unlikely to be the President; the odds are just 1 in 321-something millions. So we have that if “a person $p$ is American” (or $X$ ) is true then “ $p$ is not the President” (or $Y$ ) is true with high probability. But Mr. Barack Obama happens to be the President, so he’s overwhelmingly unlikely to be American according to probabilistic modus tollens!

(The ironic part of the post ends here.)

Sure you’re thinking that this was a poor attempt at a joke. I would agree, were it not the case that the very same unsound inference rule is being applied willy-nilly in countless scientific papers in the form of statistical hypothesis testing. The basic statistical machinery, which I’ve discussed in a previous post, tells us that, under a null hypothesis $H_0$ , a certain data $D$ is unlikely to happen. In other words: if “the null hypothesis $H_0$ ” is true then “the data is different than $D$ ” is true with high probability. So far so good. But then the way this fact is used in practice is the following: if we observe the unlikely $D$ in our experiments, we conclude that the null hypothesis is unlikely, and hence we reject it — unsoundly! How’s that for a joke?

Having seen for ourselves that modus tollens does not generalize to probabilistic inference, what is a correct inference from data to hypothesis testing? We can use Bayes’s theorem and phrase it in terms of conditional probabilities. $P(X \mid Y)$ is the probability that $X$ occurs given that $Y$ has occurred. Then $P(H_0 \mid D)$ — the probability that the null hypothesis $H_0$ is true given that we observed data $D$ — is computed as $P(D \mid H_0) \cdot P(H_0) / P(D)$ . Even if we know that $D$ is unlikely under the null hypothesis — $P(D \mid H_0)$ is small — we cannot dismiss the null hypothesis with confidence unless we know something about the absolute prior probabilities of $H_0$ and $D$ . To convince ourselves that Bayes’s rule leads to sound inference, we can apply it to the Barack Obama case: $H_0$ is “a person $p$ is American” and $D$ is “ $p$ is the President”. We plug the numbers in and do the simple math to see that $P(H_0 \mid D)$ , the probability that the President is American, is indeed one:

$(1 / A) \cdot (A / W) / (1 / W) = 1$ , where $A$ is the population of the USA and $W$ is the world population. Bayes 1 – birthers 0.

Now you understand the fuss about statistical hypothesis testing that has emerged in numerous experimental sciences. Sadly, this blunder is not merely a possibility; it is quite likely that it has affected the validity of numerous published experimental “findings”. In fact, the inadequacy of statistical hypothesis testing is compounded by other statistical results such as the arbitrariness of a hard-and-fast confidence threshold, the false hypothesis paradox (when studying a rare phenomenon, that is a phenomenon with low base rates, most positive results are false positives), and self-selection (the few research efforts that detect some rare phenomenon will publish, whereas the overwhelming majority of “no effect” studies will not lead to publication). In an era of big data, these behaviors are only becoming more likely to emerge.

The take home message is simple yet important. Statistical hypothesis testing is insufficient, by itself, to derive sound conclusions about empirical observations. It must be complemented by other analysis techniques, such as data visualization, effect sizes, confidence intervals, and Bayesian analysis. Unless you remain convinced that Obama’s not American, Elvis is alive, and the Apollo moon landings were staged. In this case, this blog is not for you — with high probability.

Bug counting

A little specification can go a long way

Category Archives: counting bugs

Master’s thesis project showcase