It’s really a combination of all three. More questions? 🙂

Seriously, the issue of the real “nature” of informatics (I don’t call it “computer science” only to avoid adding to the terminological confusion) — applied mathematics, science of information, or engineering discipline — remains a topic of intense debate in more than one circle. A related discussion is the relationship between informatics theoreticians and practitioners, and whether they can get along (yes they can and do, as Scott Aaronson patiently explains).

I must admit that discussions about these topics leave me mostly indifferent, if not mildly baffled. What we call “informatics” or “computer science” has grown from roots in different branches of science and engineering, so that its true “nature” is not in any single one of them but in their amalgam. This should be evident by looking at the history of modern informatics and its eminent pioneers. Its eclectic origins are a tremendous source of vitality and part of the reason why computer science and its applications have become ubiquitous in pretty much any other science to provide conceptual and practical analytical tools. And this is also a great deal of what makes computer science so interesting!

But perhaps I can make my position more convincing with an example. In the rest of this post, I select a computational problem and follow the train of thought that leads to a possible solution and implementation. My point is demonstrating how solving the problem satisfactorily requires a combination of theoretical and practical tools whose intimate connection cannot be severed without ruining their efficacy.

Take the problem of enumerating prime numbers: a number is prime if it is only divisible by itself and by one; a number that is not prime is called composite. Prompted by this blog’s motto (“A little specification can go a long way”), let’s approach the problem descriptively: we provide a simple characterization of prime numbers, and then use standard tools to match the characterization against any integer that we want to test for primality. The characterization is going to use regular expressions.

Primes is not regular

Wait a minute! Is this even possible? This was my reaction too the first time I read (I can’t remember where or when, but there are several articles exploring this idea) about using regular expression to characterize primality. The notion of prime numbers involves divisibility and integer arithmetic, which are beyond the capabilities of regular expressions or their operational counterparts (finite-state automata), whose expressiveness defines the so-called regular languages.

In order to analyze the problem rigorously, we have to cast it into the framework of formal languages. Using a simple unary alphabet of letters x, the language P of prime numbers can be defined as the set of all strings of x’s whose length is a prime number:

P = \{ x^p \mid p \text{ is prime} \}.

Let’s show that the language P of prime numbers is not regular. We could use the well-known pumping lemma, but a direct proof by contradiction is even more convincing. Assume that P is regular; then there exists a deterministic finite state automaton A_P that accepts precisely strings in P. Automaton A_P consists of a finite number N of states; take any prime p > N. (Side note: this is possible because there are infinitely many primes — if you’re not familiar with Euclid’s neat proof of this fact go check it out.) Consider the accepting path of x^p when it is processed by A_P: each transition consumes exactly one x; since there are more x’s in input than states in A_P, the path must include a loop that goes from some state S back to it. Call \ell the length of such loop, that is the number of x’s consumed while traversing it. We can see that adding a number of x’s multiple of \ell leads to an input that A_P still accepts. This is because \ell more x’s traverse the loop one more time, and the behavior of the automaton from S on does not change if the trailing input is the same because A_P is deterministic. Going back to the definition of P, we have that A_P accepts all strings of length p + k \cdot \ell for any nonnegative integer k. In particular, it accepts the string of p + p \cdot \ell x’s. But this is incorrect behavior: p + p \cdot \ell is not prime because it is divisible by p. This contradicts the hypothesis that A_P recognizes language P — the language of primes is not regular.

While we’re at it, let’s also prove that the language P of primes is not even context free. We can leverage Parikh’s theorem for that. A convenient way to state the theorem is: every context-free language is commutatively equivalent to some regular language. This roughly means that, if P is context-free, there exists a regular language P_r whose strings have the same number of letters as those in P irrespective of order. But order is immaterial for a language over unary alphabet such as P — length is the only distinguishing feature — which entails that P is context-free only if it is also regular. Since we have seen that P is not regular, it’s not even context-free.

Primes as a regex

The trick to circumvent the non-regularity of the language P of primes is switching from regular expressions — à la Kleene, used in formal language theory — to regexes — à la Thompson, used in classic Unix text processing utilities and part of the POSIX standard. Perl’s Larry Wall championed this useful terminological distinction (regular expressions vs. regexes), which we’ll consistently use in this presentation. Crucially, the regex feature that greatly increases the expressive power of regular expressions are backreferences, which can express repetitions of previously captured substrings. In fact, it’s not hard to see that all other features of POSIX regexes are just syntactic sugar expressible using Kleene’s regular expressions.

Back to the problem of expressing the language P of primes using regexes and backreferences. It’s actually easier to look at the complement problem: defining the language \overline{P} of composite numbers expressed as unary strings. A composite number c has at least one divisor (other than one and itself); that is, it can be written as a product c_1 \cdot c_2 for 1 < c_1, c_2 < c[/latex] or, equivalently, as a sum of [latex]c_2[/latex] equal terms [latex]\underbrace{c_1 + \cdots + c_1}_{c_2 \text{ terms}}[/latex]. Visualizing [latex]c[/latex] in unary, we can write it as partitioned into [latex]c_2[/latex] groups, each consisting of [latex]c_1[/latex] letters [latex]x[/latex]. This is a pattern we can express using a regex: a leading substring of two or more (that is, [latex]c_1[/latex]) [latex]x[/latex]’s (<code>xx+</code>, with <code>+</code> denoting one or more repetitions) followed by one or more (that is, [latex]c_2 - 1) of its replicas (\1+, with \1 denoting a backreference to the first matched substring). In all, regex (xx+)\1+ matches the composite numbers greater than one in unary notation. If we want to handle all natural numbers, we can use alternation (the | operator) to specify the special cases of 0 and 1 as composite: (xx+)\1+ | x? (with ? denoting zero or one occurrences, and assuming | has lower precedence than the other operators). We'd better do that as no mathematician since Lebsgue considers 1 a prime according to Wikipedia 🙂

The power of regexes

What is then the expressive power of regexes? Our primes example indicates that backreferences can express languages in classes beyond context-free. Precisely, we've shown that language P is not context-free and that regexes can express the complement language \overline{P}, but regexes are not explicitly closed under complement. However, it's clear that \overline{P} is also neither regular (because regular languages are closed under complement) nor context-free (because unary context-free languages boil down to regular languages). Alternatively, we can consider languages whose strings have the form w\,w for any string w, which are clearly expressible by means of regexes with backreferences ((.*)\1) but are neither regular nor context free (as one can show using, say, the pumping lemmas).

On the other hand, regexes seems incapable of expressing all context-free languages; in particular, they cannot express the languages of balanced characters such as parentheses — known as Dyck languages in formal language theory —, which are paradigmatic examples of context-free languages. Consider D = \{ a^n b^n \mid n > 0 \}, which is well-known to be context-free but not regular. Short of a rigorous proof, an argument that D is inexpressible by regexes could work as follows. As we match a regex against strings in D, each pair of captured pattern and backreference to it must occur either entirely within the a^n substring or entirely within the b^n substring. Otherwise, the backreference would deviate from the structure of strings in D where all b’s follow all a’s. This constraint, however, means that backreferences cannot carry any information between the a^n and the b^n parts of the string. All that is left is then limited to regular languages; hence it cannot express language D.

I have to remark again that this discussion applies to classic Unix/POSIX regexes, whose only non-regular operators are backreferences. Modern scripting languages often support a much larger superset of "Perl Compatible Regular Expressions" (PCREs) whose expressiveness far exceeds that of Unix/POSIX regexes as they include fully recursive patterns and even the possibility of invoking arbitrary code as subpatterns are matched. See this article for a nice overview of the expressiveness of PCREs as they are available in PHP; and this Stack Overflow discussion about how Java regexes can match patterns such a^n b^n.

Enumerating primes in Bash

After this long detour it's time to go back to the original problem of enumerating primes. Equipped with a descriptive characterization of composite numbers — the regex (xx+)\1+ | x? — let's turn ourselves to Unix command line tools that can interpret it. As an additional challenge in the original spirit of solving the problem descriptively, we'll try to avoid explicit loops entirely in our program of sorts for enumerating primes.

First, since our regex works on unary strings, we need a way to convert numbers from decimal to unary. We can use printf's format strings to this end: the format specifier %*s reads a numeric argument followed by a string argument and prints the latter as a space-padded string whose length is the numeric argument. If the string argument is absent, it just prints as many spaces as the value of the numeric argument. Then, we use tr to turn spaces into x's and voilà: our unary representation. For example, to print 24 in unary:

There's an alternative that uses printf's format specifier %.0s without requiring tr: printf 'x%.0s' {1..24}. See printf's documentation for an explanation of how it works.

Second, we have to call the above snippet involving printf for all numbers that we want to test. Instead of an explicit loop, we can generate a sequence {0..N} of nonnegative integers up to N and pass its values on to printf using xargs's option -n1, which dispatches one argument value per invocation of printf. However, printf's format string now needs a trailing newspace character to separate unary numbers by line.

It's finally time to use our regex. There's a minor issue here regarding the distinction between POSIX basic and extended regexes. The former include backreferences but no alternation; the latter include alternation but no backreferences. This is hardly a problem in practice, as most GNU tools (the one I'm assuming we're using) support backreferences using both syntaxes. As for the alternation, we don't really need to include it explicitly in the regex; instead, we use grep's option -e to specify multiple regexes and have the tool match any of them. We need to know about grep's options -E — to use extended regex syntax (just because it's nicer as it doesn't require to slash-escape metacharacters) — and -v to invert the match (because we want to keep the primes and discard the composites).

Don't try the previous snippet yet: it doesn't work! The problem is that grep matches the regexes against any portion of the input string. Since x? trivially matches the empty substring, the inverted match is the empty match. Let's introduce the ^ and $ markers to require that the regexes match the whole substring (which, by the way, is the standard interpretation of regular expressions in formal language theory). You can try and see that the following snippet enumerates primes up to 24 in unary.

The output in unary form is not very readable. We convert it back (without loops) using the expr utility, again with arguments dispatched one at a time using xargs. This is the only non-POSIX compliant part of the snippet. I couldn't find another way that doesn't use loops or variables; if you know one, please leave a comment. Anyway, we have our loopless prime enumerator!

A little empirical performance analysis

We have reasons to suspect that our prime enumerator is not very efficient. Enumerating primes up to 41 takes over 14 seconds on my desktop machine. This is not very good given that primality testing is in P and randomized algorithms such as Miller-Rabin are very fast in practice. It is also unsurprising: regex matching with backreferences uses backtracking to implement nondeterminism, which nullifies the otherwise linear-time performance of matching using finite-state automata. For the same reasons, memory usage should instead be roughly constant or grow very slowly: the size of the automaton used for the matching only depends on the regex to be matched and, linearly, on the length of the captured substring that is backreferenced.

Let's see if we can get some empirical data to corroborate this intuition. If the running time T(n) of our enumerator is exponential in the input, the ratio T(n) / 10^n should remain constant as the input size n increases. Input normally is measured in number of bits (or digits) for numerical algorithms; hence let n = \log(k), where k is the number we test for primality. Let's measure T(k) / k for each input k and see what it looks like. We allow ourselves to use loops this time, so that we can simplify the enumeration snippet using a for Bash loop.

We enumerate from 30 because this is when grep's running time starts to become non-negligible. It's time to use another command-line utility. Not to be confused with the homonym Bash command, utility time is normally located in /usr/bin/time and offers options to customize output using a format string. For each invocation of grep, we print the current value of k, and the CPU user time (%U) and maximum RAM resident set size (%M) taken by the process running grep. Since we now only care about these measures, we add the -q option to grep, which suppresses output. Finally, we redirect these statistics to awk, which will print on each line a triple k, T(k)/k, M(k) (T and M are time and memory). In all, run the command:

In case you're wondering, the 2>&1 is needed because time outputs to standard error, so a normal pipe won't work. In Bash version 4 and later you can write |& instead. Running the command on my machine gives the output:

It seems we were on the right track. Memory consumption doesn't noticeably grow. Time grows instead even more quickly than exponentially in the number of digits. If we play around a bit more, it seems that time approximately goes like a polynomial in k, and certainly asymptotically more slowly than a double exponential. See this article for a detailed discussion of the efficiency of regex matching implementations.

Conclusion

Was this just math? Just science? Just engineering? The sensible conclusion seems that it was a combination of all three of them: the theory of regular languages has mathematical flavor, their practical implementations in Unix tools is an engineering feat, and empirical analysis smells like experimental science. And that's the beauty of it!

I’m very glad to host a guest post by Nadia Polikarpova, a recent PhD graduate of our group on her way to a post-doctoral position at MIT. Nadia is definitely into bug counting, even though her ultimate ambitions are more of exterminating them (for the casual reader: we’re talking about software bugs — no animals were harmed in the making of her research). Thanks to this guest post of hers, the blog finally goes back to its core interests!

      — Carlo A.

Dimensions in program verification

Model checking, abstract interpretation, program logics, symbolic execution — if you are new to program verification, the variety of techniques it offers might be intimidating. Each area seems to start from a different set of basic principles and use its own terminology, which makes classification and comparison difficult.

To make matters worse, verification experts are often partisans of their school of thought: for an example, see a slide from a recent talk by Patrick Cousot. Abstract interpretation devotees may look down on other approaches, such as auto-active verification, that do not easily scale to systems with million lines of code. Auto-active verification researches, like myself, may make fun of abstract interpretation folks for being stuck with array-out-of-bounds errors, while we can prove full functional correctness (which, in turn, is uninteresting to model-checking experts, who care about temporal properties). The acolytes of the dark art of interactive theorem proving (such as Coq and Isabelle) don’t take any other approach seriously unless it comes with an air-tight proof of soundness. Meanwhile testing people consider all sound techniques pure intellectual exercise, since real programmers are not going to put up with false positives…

How do you make sense of all this and learn to appreciate relative merits and fundamental limitations of different techniques? Here is my short guide to program verification, which I worked out after studying the area for nearly six years.

There are three dimensions to a program verification technique: soundness, flexibility, and automation. Soundness (level of assurance) determines how certain we are that a program is correct if verification succeeds. Flexibility (expressiveness) measures the diversity of programs and correctness properties that the approach can handle. Automation accounts for the amount of human effort — beyond writing the program — that is needed to carry out verification. As is often the case, you can pick any two, but you cannot have all three.

The verification cube

For example, abstract interpretation is sound and fully automatic, but the kind of properties and programs it can check effectively is limited by a predefined set of abstract domains encoded in the analyzer. Bounded model checking, on the other hand, is flexible and fairly automatic — users still have to formalize requirements — but in general cannot provide a correctness guarantee for all inputs (note that we can also trade flexibility for soundness by limiting ourselves to analyzing programs with a finite state space). Deductive verification techniques are flexible and sound, but require a fair amount of manual guidance; among deductive tools, interactive proof assistants are pushing the first two dimensions to the extreme (with higher-order specification languages and tiny trusted cores), while auto-active tools are prepared to sacrifice a bit of both to retain some automation. Unbounded (CEGAR-based) model checking is sound, automatic, and inflexible; symbolic execution is flexible, automatic, and unsound.

I’m not aware of any technique that doesn’t lie close to one of those three corners of the “verification cube”. Why is this the case? Intuitively, the reason is simple: any program analysis can only consider a finite number of states, so you get the choice between only analyzing programs with finitely many states (and lose flexibility), considering only a subset of all states (and lose soundness), or over-approximating an infinite set of states by a finite set of abstract states. Abstraction, in turn, can be performed manually (this is what loop invariants are for!) or automatically (like in abstract interpretation); the former kills automation, while the latter is limited to a restricted set of properties, which kills flexibility. Naturally, acing all three dimensions has always been the holy grail of program verification. This is where all the approaches are converging, albeit taking very different routes, with each technique improving along its weak dimension little by little. Static analyzers are enriched with new abstract domains; model checkers can handle ever larger state spaces; auto-active tools compete in economy of required user annotations; proof assistants boast more and more sophisticated automated tactics. They are all working towards the same goal — still far away from the finish line — and they still differ vastly in terms of strengths and weaknesses, application domains, and user experience.

So, next time you hear someone say that their program verification technique is better that all the rest, just remember the cube.

      — Nadia

One of Tolstoy’s least known works [Tolstoy, 1886], published during the ascetic spiritual phase that characterized his late life, criticizes science (as well as art) as irrelevant because it gives no answer to the eponymous question “What shall we do then?”. While Tolstoy’s intentions were right-minded, his criticism remains misplaced: science may not give ready-made direct answers to moral questions, but certainly provides highly relevant information to help answer them.

Tolstoy’s superficial description of the scientist’s work as consisting of “counting invisible bugs and stars” caught Poincaré’s attention. In a talk given in 1906, later published in his trilogy of essays on epistemology [Poincaré, 1914], Poincaré discusses how science goes well beyond merely “counting bugs” or, beyond metaphor, amassing facts. As he says in a famous quotation in another part of the trilogy:

Science is built up with facts, as a house is with stones. But a collection of facts is no more a science than a heap of stones is a house.

This story explains the origin of this blog’s name, but you may still be wondering why a blog mainly about science (and enthusiastic about it!) is named after a misconception.

To address this objection, we have to think about the derivative meaning of “bug” as “defect” or “error” — particularly widespread in computer science to describe software faults, but whose introduction predates even Tolstoy’s observations. Under this new meaning, Tolstoy’s sentence is not so infelicitous. On the contrary, finding and counting errors are efforts fundamental to the progress of science. At another level, there are entire fields devoted to counting bugs: software verification, one of my main research interests and a recurring topic in this blog, is concerned with finding, characterizing, removing, and establishing the absence of software bugs.

I’m sure Poincaré would approve of this view. After all he developed one of his most important breakthroughs — chaotic behavior of nonlinear dynamical systems — as an attempt to patch a mistake (a bug!) in his initial submission of an award-winning paper (interestingly, a reviewer helped catch the mistake, but this makes for another story [Diacu, 1996]).

The slogan “counting bugs” also vindicates the unquantifiable value of science from below, where every effort that satisfies curiosity and contributes, no matter how modestly, to improving knowledge and understanding is worthy regardless of utility or practicality.

This blog will try to follow such an inquisitive but also lightsome spirit while discussing sundry topics in science and beyond. If you’re sympathetic, I hope you will join us!

References

  1. Leo Tolstoy: What shall we do?, 1886. Available here in English.
  2. Henry Poincaré: Science and hypothesis. The value of science. Science and method., 1914. English edition: The value of science, edited by Stephen Jay Gould, Modern Library, 2001.
  3. Florin Diacu: The solution of the n-body problem, The Mathematical Intelligencer, 18(3):66–70, 1996. Available here.