It’s really a combination of all three. More questions? 🙂

Seriously, the issue of the real “nature” of informatics (I don’t call it “computer science” only to avoid adding to the terminological confusion) — applied mathematics, science of information, or engineering discipline — remains a topic of intense debate in more than one circle. A related discussion is the relationship between informatics theoreticians and practitioners, and whether they can get along (yes they can and do, as Scott Aaronson patiently explains).

I must admit that discussions about these topics leave me mostly indifferent, if not mildly baffled. What we call “informatics” or “computer science” has grown from roots in different branches of science and engineering, so that its true “nature” is not in any single one of them but in their amalgam. This should be evident by looking at the history of modern informatics and its eminent pioneers. Its eclectic origins are a tremendous source of vitality and part of the reason why computer science and its applications have become ubiquitous in pretty much any other science to provide conceptual and practical analytical tools. And this is also a great deal of what makes computer science so interesting!

But perhaps I can make my position more convincing with an example. In the rest of this post, I select a computational problem and follow the train of thought that leads to a possible solution and implementation. My point is demonstrating how solving the problem satisfactorily requires a combination of theoretical and practical tools whose intimate connection cannot be severed without ruining their efficacy.

Take the problem of enumerating prime numbers: a number is prime if it is only divisible by itself and by one; a number that is not prime is called composite. Prompted by this blog’s motto (“A little specification can go a long way”), let’s approach the problem descriptively: we provide a simple characterization of prime numbers, and then use standard tools to match the characterization against any integer that we want to test for primality. The characterization is going to use regular expressions.

Primes is not regular

Wait a minute! Is this even possible? This was my reaction too the first time I read (I can’t remember where or when, but there are several articles exploring this idea) about using regular expression to characterize primality. The notion of prime numbers involves divisibility and integer arithmetic, which are beyond the capabilities of regular expressions or their operational counterparts (finite-state automata), whose expressiveness defines the so-called regular languages.

In order to analyze the problem rigorously, we have to cast it into the framework of formal languages. Using a simple unary alphabet of letters x, the language P of prime numbers can be defined as the set of all strings of x’s whose length is a prime number:

P = \{ x^p \mid p \text{ is prime} \}.

Let’s show that the language P of prime numbers is not regular. We could use the well-known pumping lemma, but a direct proof by contradiction is even more convincing. Assume that P is regular; then there exists a deterministic finite state automaton A_P that accepts precisely strings in P. Automaton A_P consists of a finite number N of states; take any prime p > N. (Side note: this is possible because there are infinitely many primes — if you’re not familiar with Euclid’s neat proof of this fact go check it out.) Consider the accepting path of x^p when it is processed by A_P: each transition consumes exactly one x; since there are more x’s in input than states in A_P, the path must include a loop that goes from some state S back to it. Call \ell the length of such loop, that is the number of x’s consumed while traversing it. We can see that adding a number of x’s multiple of \ell leads to an input that A_P still accepts. This is because \ell more x’s traverse the loop one more time, and the behavior of the automaton from S on does not change if the trailing input is the same because A_P is deterministic. Going back to the definition of P, we have that A_P accepts all strings of length p + k \cdot \ell for any nonnegative integer k. In particular, it accepts the string of p + p \cdot \ell x’s. But this is incorrect behavior: p + p \cdot \ell is not prime because it is divisible by p. This contradicts the hypothesis that A_P recognizes language P — the language of primes is not regular.

While we’re at it, let’s also prove that the language P of primes is not even context free. We can leverage Parikh’s theorem for that. A convenient way to state the theorem is: every context-free language is commutatively equivalent to some regular language. This roughly means that, if P is context-free, there exists a regular language P_r whose strings have the same number of letters as those in P irrespective of order. But order is immaterial for a language over unary alphabet such as P — length is the only distinguishing feature — which entails that P is context-free only if it is also regular. Since we have seen that P is not regular, it’s not even context-free.

Primes as a regex

The trick to circumvent the non-regularity of the language P of primes is switching from regular expressions — à la Kleene, used in formal language theory — to regexes — à la Thompson, used in classic Unix text processing utilities and part of the POSIX standard. Perl’s Larry Wall championed this useful terminological distinction (regular expressions vs. regexes), which we’ll consistently use in this presentation. Crucially, the regex feature that greatly increases the expressive power of regular expressions are backreferences, which can express repetitions of previously captured substrings. In fact, it’s not hard to see that all other features of POSIX regexes are just syntactic sugar expressible using Kleene’s regular expressions.

Back to the problem of expressing the language P of primes using regexes and backreferences. It’s actually easier to look at the complement problem: defining the language \overline{P} of composite numbers expressed as unary strings. A composite number c has at least one divisor (other than one and itself); that is, it can be written as a product c_1 \cdot c_2 for 1 < c_1, c_2 < c[/latex] or, equivalently, as a sum of [latex]c_2[/latex] equal terms [latex]\underbrace{c_1 + \cdots + c_1}_{c_2 \text{ terms}}[/latex]. Visualizing [latex]c[/latex] in unary, we can write it as partitioned into [latex]c_2[/latex] groups, each consisting of [latex]c_1[/latex] letters [latex]x[/latex]. This is a pattern we can express using a regex: a leading substring of two or more (that is, [latex]c_1[/latex]) [latex]x[/latex]’s (<code>xx+</code>, with <code>+</code> denoting one or more repetitions) followed by one or more (that is, [latex]c_2 - 1) of its replicas (\1+, with \1 denoting a backreference to the first matched substring). In all, regex (xx+)\1+ matches the composite numbers greater than one in unary notation. If we want to handle all natural numbers, we can use alternation (the | operator) to specify the special cases of 0 and 1 as composite: (xx+)\1+ | x? (with ? denoting zero or one occurrences, and assuming | has lower precedence than the other operators). We'd better do that as no mathematician since Lebsgue considers 1 a prime according to Wikipedia 🙂

The power of regexes

What is then the expressive power of regexes? Our primes example indicates that backreferences can express languages in classes beyond context-free. Precisely, we've shown that language P is not context-free and that regexes can express the complement language \overline{P}, but regexes are not explicitly closed under complement. However, it's clear that \overline{P} is also neither regular (because regular languages are closed under complement) nor context-free (because unary context-free languages boil down to regular languages). Alternatively, we can consider languages whose strings have the form w\,w for any string w, which are clearly expressible by means of regexes with backreferences ((.*)\1) but are neither regular nor context free (as one can show using, say, the pumping lemmas).

On the other hand, regexes seems incapable of expressing all context-free languages; in particular, they cannot express the languages of balanced characters such as parentheses — known as Dyck languages in formal language theory —, which are paradigmatic examples of context-free languages. Consider D = \{ a^n b^n \mid n > 0 \}, which is well-known to be context-free but not regular. Short of a rigorous proof, an argument that D is inexpressible by regexes could work as follows. As we match a regex against strings in D, each pair of captured pattern and backreference to it must occur either entirely within the a^n substring or entirely within the b^n substring. Otherwise, the backreference would deviate from the structure of strings in D where all b’s follow all a’s. This constraint, however, means that backreferences cannot carry any information between the a^n and the b^n parts of the string. All that is left is then limited to regular languages; hence it cannot express language D.

I have to remark again that this discussion applies to classic Unix/POSIX regexes, whose only non-regular operators are backreferences. Modern scripting languages often support a much larger superset of "Perl Compatible Regular Expressions" (PCREs) whose expressiveness far exceeds that of Unix/POSIX regexes as they include fully recursive patterns and even the possibility of invoking arbitrary code as subpatterns are matched. See this article for a nice overview of the expressiveness of PCREs as they are available in PHP; and this Stack Overflow discussion about how Java regexes can match patterns such a^n b^n.

Enumerating primes in Bash

After this long detour it's time to go back to the original problem of enumerating primes. Equipped with a descriptive characterization of composite numbers — the regex (xx+)\1+ | x? — let's turn ourselves to Unix command line tools that can interpret it. As an additional challenge in the original spirit of solving the problem descriptively, we'll try to avoid explicit loops entirely in our program of sorts for enumerating primes.

First, since our regex works on unary strings, we need a way to convert numbers from decimal to unary. We can use printf's format strings to this end: the format specifier %*s reads a numeric argument followed by a string argument and prints the latter as a space-padded string whose length is the numeric argument. If the string argument is absent, it just prints as many spaces as the value of the numeric argument. Then, we use tr to turn spaces into x's and voilà: our unary representation. For example, to print 24 in unary:

There's an alternative that uses printf's format specifier %.0s without requiring tr: printf 'x%.0s' {1..24}. See printf's documentation for an explanation of how it works.

Second, we have to call the above snippet involving printf for all numbers that we want to test. Instead of an explicit loop, we can generate a sequence {0..N} of nonnegative integers up to N and pass its values on to printf using xargs's option -n1, which dispatches one argument value per invocation of printf. However, printf's format string now needs a trailing newspace character to separate unary numbers by line.

It's finally time to use our regex. There's a minor issue here regarding the distinction between POSIX basic and extended regexes. The former include backreferences but no alternation; the latter include alternation but no backreferences. This is hardly a problem in practice, as most GNU tools (the one I'm assuming we're using) support backreferences using both syntaxes. As for the alternation, we don't really need to include it explicitly in the regex; instead, we use grep's option -e to specify multiple regexes and have the tool match any of them. We need to know about grep's options -E — to use extended regex syntax (just because it's nicer as it doesn't require to slash-escape metacharacters) — and -v to invert the match (because we want to keep the primes and discard the composites).

Don't try the previous snippet yet: it doesn't work! The problem is that grep matches the regexes against any portion of the input string. Since x? trivially matches the empty substring, the inverted match is the empty match. Let's introduce the ^ and $ markers to require that the regexes match the whole substring (which, by the way, is the standard interpretation of regular expressions in formal language theory). You can try and see that the following snippet enumerates primes up to 24 in unary.

The output in unary form is not very readable. We convert it back (without loops) using the expr utility, again with arguments dispatched one at a time using xargs. This is the only non-POSIX compliant part of the snippet. I couldn't find another way that doesn't use loops or variables; if you know one, please leave a comment. Anyway, we have our loopless prime enumerator!

A little empirical performance analysis

We have reasons to suspect that our prime enumerator is not very efficient. Enumerating primes up to 41 takes over 14 seconds on my desktop machine. This is not very good given that primality testing is in P and randomized algorithms such as Miller-Rabin are very fast in practice. It is also unsurprising: regex matching with backreferences uses backtracking to implement nondeterminism, which nullifies the otherwise linear-time performance of matching using finite-state automata. For the same reasons, memory usage should instead be roughly constant or grow very slowly: the size of the automaton used for the matching only depends on the regex to be matched and, linearly, on the length of the captured substring that is backreferenced.

Let's see if we can get some empirical data to corroborate this intuition. If the running time T(n) of our enumerator is exponential in the input, the ratio T(n) / 10^n should remain constant as the input size n increases. Input normally is measured in number of bits (or digits) for numerical algorithms; hence let n = \log(k), where k is the number we test for primality. Let's measure T(k) / k for each input k and see what it looks like. We allow ourselves to use loops this time, so that we can simplify the enumeration snippet using a for Bash loop.

We enumerate from 30 because this is when grep's running time starts to become non-negligible. It's time to use another command-line utility. Not to be confused with the homonym Bash command, utility time is normally located in /usr/bin/time and offers options to customize output using a format string. For each invocation of grep, we print the current value of k, and the CPU user time (%U) and maximum RAM resident set size (%M) taken by the process running grep. Since we now only care about these measures, we add the -q option to grep, which suppresses output. Finally, we redirect these statistics to awk, which will print on each line a triple k, T(k)/k, M(k) (T and M are time and memory). In all, run the command:

In case you're wondering, the 2>&1 is needed because time outputs to standard error, so a normal pipe won't work. In Bash version 4 and later you can write |& instead. Running the command on my machine gives the output:

It seems we were on the right track. Memory consumption doesn't noticeably grow. Time grows instead even more quickly than exponentially in the number of digits. If we play around a bit more, it seems that time approximately goes like a polynomial in k, and certainly asymptotically more slowly than a double exponential. See this article for a detailed discussion of the efficiency of regex matching implementations.


Was this just math? Just science? Just engineering? The sensible conclusion seems that it was a combination of all three of them: the theory of regular languages has mathematical flavor, their practical implementations in Unix tools is an engineering feat, and empirical analysis smells like experimental science. And that's the beauty of it!

Last week I attended the second Workshop on Software Correctness and Reliability organized by Peter Müller and Martin Vechev here at ETH. As in the first edition one year ago, the workshop featured an impressive collection of speakers who talked about diverse topics in verification. Videos of the talks will be publicly available, so there’s no need to write detailed summaries. Instead, I give my telegraphic impressions of each talk in the form of one-liners. If I used Twitter these could’ve been live tweets; except that I, like Machete, don’t tweet 🙂 (not at the moment anyway). As the name suggests, the impressions do not necessarily talk about the main intended message of each talk; but at least they’re short: if anything tickles your fancy go and watch the videos!

James Larus
Improving the software development culture is not the job of academics; their job is solving technical problems.
Martin Rinard
Just because it passes all tests, it doesn’t mean the program is correct.
David Basin
Security is a form of enforceable safety.
Rajeev Alur
Regular functions or: the power of automata with strings.
Armando Solar-Lezama
Synthesis to improve the code of programmers who use loops where they could use SQL.
Anders Møller
Static analysis to check the code of programmers who use jQuery where they could use loops.
Martin Vechev
Machine learning to guess the meaning of minified (or otherwise obscure) programs.
Peter Müller
Teenage mobile app developers benefit from abstract interpretation.
Vijay Saraswat
Failing asynchronously is OK, as long as someone cleans up.
Ranjit Jhala
Piggybacking type systems to prove properties of (post-)functional languages.
(Rustan Leino featured in absentia as the one who knows what to do with universal quantifiers — the “SMT whisperer”.)
Ahmed Bouajjani
Checking linearizability may be hard, but checking trace inclusion need not be.

Since I skipped the last session, I cannot offer my impressions of Mayur Naik‘s and Manfred Broy‘s talks. If you were there or have watched the videos, feel free to do so in the comments.