It’s really a combination of all three. More questions? ðŸ™‚

Seriously, the issue of the real “nature” of informatics (I don’t call it “computer science” only to avoid adding to the terminological confusion) — applied mathematics, science of information, or engineering discipline — remains a topic of intense debate in more than one circle. A related discussion is the relationship between informatics theoreticians and practitioners, and whether they can get along (yes they can and do, as Scott Aaronson patiently explains).

I must admit that discussions about these topics leave me mostly indifferent, if not mildly baffled. What we call “informatics” or “computer science” has grown from roots in different branches of science and engineering, so that its true “nature” is not in any single one of them but in their amalgam. This should be evident by looking at the history of modern informatics and its eminent pioneers. Its eclectic origins are a tremendous source of vitality and part of the reason why computer science and its applications have become ubiquitous in pretty much any other science to provide conceptual and practical analytical tools. And this is also a great deal of what makes computer science so interesting!

But perhaps I can make my position more convincing with an example. In the rest of this post, I select a computational problem and follow the train of thought that leads to a possible solution and implementation. My point is demonstrating how solving the problem satisfactorily requires a combination of theoretical and practical tools whose intimate connection cannot be severed without ruining their efficacy.

Take the problem of enumerating *prime* numbers: a number is prime if it is only divisible by itself and by one; a number that is not prime is called *composite*. Prompted by this blog’s motto (“A little specification can go a long way”), let’s approach the problem descriptively: we provide a simple characterization of prime numbers, and then use standard tools to match the characterization against any integer that we want to test for primality. The characterization is going to use regular expressions.

### Primes is not regular

Wait a minute! Is this even possible? This was my reaction too the first time I read (I can’t remember where or when, but there are several articles exploring this idea) about using regular expression to characterize primality. The notion of prime numbers involves divisibility and integer arithmetic, which are beyond the capabilities of regular expressions or their operational counterparts (finite-state automata), whose expressiveness defines the so-called *regular languages*.

In order to analyze the problem rigorously, we have to cast it into the framework of formal languages. Using a simple *unary* alphabet of letters , the language of prime numbers can be defined as the set of all strings of ’s whose length is a prime number:

Let’s show that the language of prime numbers is not regular. We could use the well-known pumping lemma, but a direct proof by contradiction is even more convincing. Assume that is regular; then there exists a deterministic finite state automaton that accepts precisely strings in . Automaton consists of a finite number of states; take any prime . (Side note: this is possible because there are infinitely many primes — if you’re not familiar with Euclid’s neat proof of this fact go check it out.) Consider the accepting path of when it is processed by : each transition consumes exactly one ; since there are more ’s in input than states in , the path must include a loop that goes from some state back to it. Call the length of such loop, that is the number of ’s consumed while traversing it. We can see that adding a number of ’s multiple of leads to an input that still accepts. This is because more ’s traverse the loop one more time, and the behavior of the automaton from on does not change if the trailing input is the same because is deterministic. Going back to the definition of , we have that accepts all strings of length for any nonnegative integer . In particular, it accepts the string of ’s. But this is incorrect behavior: is not prime because it is divisible by . This contradicts the hypothesis that recognizes language — the language of primes is not regular.

While we’re at it, let’s also prove that the language of primes is not even context free. We can leverage Parikh’s theorem for that. A convenient way to state the theorem is: every context-free language is commutatively equivalent to some regular language. This roughly means that, if is context-free, there exists a regular language whose strings have the same number of letters as those in irrespective of order. But order is immaterial for a language over unary alphabet such as — length is the only distinguishing feature — which entails that is context-free only if it is also regular. Since we have seen that is not regular, it’s not even context-free.

### Primes as a regex

The trick to circumvent the non-regularity of the language of primes is switching from regular expressions — à la Kleene, used in formal language theory — to regexes — à la Thompson, used in classic Unix text processing utilities and part of the POSIX standard. Perl’s Larry Wall championed this useful terminological distinction (regular expressions vs. regexes), which we’ll consistently use in this presentation. Crucially, the regex feature that greatly increases the expressive power of regular expressions are *backreferences*, which can express repetitions of previously captured substrings. In fact, it’s not hard to see that all other features of POSIX regexes are just syntactic sugar expressible using Kleene’s regular expressions.

Back to the problem of expressing the language of primes using regexes and backreferences. It’s actually easier to look at the complement problem: defining the language of *composite* numbers expressed as unary strings. A composite number has at least one divisor (other than one and itself); that is, it can be written as a product for ) of its replicas (`\1+`

, with `\1`

denoting a backreference to the first matched substring). In all, regex `(xx+)\1+`

matches the composite numbers greater than one in unary notation. If we want to handle all natural numbers, we can use alternation (the `|`

operator) to specify the special cases of 0 and 1 as composite: `(xx+)\1+ | x?`

(with `?`

denoting zero or one occurrences, and assuming `|`

has lower precedence than the other operators). We'd better do that as no mathematician since Lebsgue considers 1 a prime according to Wikipedia ðŸ™‚

### The power of regexes

What is then the expressive power of regexes? Our primes example indicates that backreferences can express languages in classes beyond context-free. Precisely, we've shown that language is not context-free and that regexes can express the *complement* language , but regexes are not explicitly closed under complement. However, it's clear that is also neither regular (because regular languages are closed under complement) nor context-free (because unary context-free languages boil down to regular languages). Alternatively, we can consider languages whose strings have the form for any string , which are clearly expressible by means of regexes with backreferences (`(.*)\1`

) but are neither regular nor context free (as one can show using, say, the pumping lemmas).

On the other hand, regexes seems incapable of expressing *all* context-free languages; in particular, they cannot express the languages of balanced characters such as parentheses — known as Dyck languages in formal language theory —, which are paradigmatic examples of context-free languages. Consider , which is well-known to be context-free but not regular. Short of a rigorous proof, an argument that is inexpressible by regexes could work as follows. As we match a regex against strings in , each pair of captured pattern and backreference to it must occur either entirely within the substring or entirely within the substring. Otherwise, the backreference would deviate from the structure of strings in where all ’s follow all ’s. This constraint, however, means that backreferences cannot carry any information between the and the parts of the string. All that is left is then limited to regular languages; hence it cannot express language .

I have to remark again that this discussion applies to classic Unix/POSIX regexes, whose only non-regular operators are backreferences. Modern scripting languages often support a much larger superset of "Perl Compatible Regular Expressions" (PCREs) whose expressiveness far exceeds that of Unix/POSIX regexes as they include fully recursive *patterns* and even the possibility of invoking arbitrary code as subpatterns are matched. See this article for a nice overview of the expressiveness of PCREs as they are available in PHP; and this Stack Overflow discussion about how Java regexes can match patterns such .

### Enumerating primes in Bash

After this long detour it's time to go back to the original problem of enumerating primes. Equipped with a descriptive characterization of composite numbers — the regex `(xx+)\1+ | x?`

— let's turn ourselves to Unix command line tools that can interpret it. As an additional challenge in the original spirit of solving the problem descriptively, we'll try to avoid explicit loops entirely in our program of sorts for enumerating primes.

First, since our regex works on unary strings, we need a way to convert numbers from decimal to unary. We can use `printf`

's format strings to this end: the format specifier `%*s`

reads a numeric argument followed by a string argument and prints the latter as a space-padded string whose length is the numeric argument. If the string argument is absent, it just prints as many spaces as the value of the numeric argument. Then, we use `tr`

to turn spaces into `x`

's and voilà: our unary representation. For example, to print 24 in unary:

1 |
printf '%*s' 24 | tr ' ' 'x' |

There's an alternative that uses `printf`

's format specifier `%.0s`

without requiring `tr`

: `printf 'x%.0s' {1..24}`

. See `printf`

's documentation for an explanation of how it works.

Second, we have to call the above snippet involving `printf`

for all numbers that we want to test. Instead of an explicit loop, we can generate a sequence `{0..N}`

of nonnegative integers up to `N`

and pass its values on to `printf`

using `xargs`

's option `-n1`

, which dispatches one argument value per invocation of `printf`

. However, `printf`

's format string now needs a trailing newspace character to separate unary numbers by line.

1 |
echo {0..24} | xargs -n1 printf '%*s\n' | tr ' ' 'x' |

It's finally time to use our regex. There's a minor issue here regarding the distinction between POSIX *basic* and *extended* regexes. The former include backreferences but no alternation; the latter include alternation but no backreferences. This is hardly a problem in practice, as most GNU tools (the one I'm assuming we're using) support backreferences using both syntaxes. As for the alternation, we don't really need to include it explicitly in the regex; instead, we use `grep`

's option `-e`

to specify multiple regexes and have the tool match any of them. We need to know about `grep`

's options `-E`

— to use extended regex syntax (just because it's nicer as it doesn't require to slash-escape metacharacters) — and `-v`

to invert the match (because we want to keep the primes and discard the composites).

1 |
echo {0..24} | xargs -n1 printf '%*s\n' | tr ' ' 'x' | grep -vE -e '(xx+)\1+' -e 'x?' |

Don't try the previous snippet yet: it doesn't work! The problem is that `grep`

matches the regexes against *any portion of the input string*. Since `x?`

trivially matches the empty substring, the inverted match is the empty match. Let's introduce the `^`

and `$`

markers to require that the regexes match the whole substring (which, by the way, is the standard interpretation of regular expressions in formal language theory). You can try and see that the following snippet enumerates primes up to 24 in unary.

1 |
echo {0..24} | xargs -n1 printf '%*s\n' | tr ' ' 'x' | grep -vE -e '^(xx+)\1+$' -e '^x?$' |

The output in unary form is not very readable. We convert it back (without loops) using the `expr`

utility, again with arguments dispatched one at a time using `xargs`

. This is the only non-POSIX compliant part of the snippet. I couldn't find another way that doesn't use loops or variables; if you know one, please leave a comment. Anyway, we have our loopless prime enumerator!

1 |
echo {0..24} | xargs -n1 printf '%*s\n' | tr ' ' 'x' | grep -vE -e '^(xx+)\1+$' -e '^x?$' | xargs -n1 expr length |

### A little empirical performance analysis

We have reasons to suspect that our prime enumerator is not very efficient. Enumerating primes up to 41 takes over 14 seconds on my desktop machine. This is not very good given that primality testing is in P and randomized algorithms such as Miller-Rabin are very fast in practice. It is also unsurprising: regex matching with backreferences uses backtracking to implement nondeterminism, which nullifies the otherwise linear-time performance of matching using finite-state automata. For the same reasons, memory usage should instead be roughly constant or grow very slowly: the size of the automaton used for the matching only depends on the regex to be matched and, linearly, on the length of the captured substring that is backreferenced.

Let's see if we can get some empirical data to corroborate this intuition. If the running time of our enumerator is *exponential* in the input, the ratio should remain constant as the input size increases. Input normally is measured in number of bits (or digits) for numerical algorithms; hence let , where is the number we test for primality. Let's measure for each input and see what it looks like. We allow ourselves to use loops this time, so that we can simplify the enumeration snippet using a `for`

Bash loop.

1 |
for k in {30..45}; do printf '%*s' "$k" | tr ' ' 'x' | grep -vE -e '^(xx+)\1+$' -e '^x?$'; done |

We enumerate from 30 because this is when `grep`

's running time starts to become non-negligible. It's `time`

to use another command-line utility. Not to be confused with the homonym Bash command, utility `time`

is normally located in `/usr/bin/time`

and offers options to customize output using a format string. For each invocation of `grep`

, we print the current value of `k`

, and the CPU user time (`%U`

) and maximum RAM resident set size (`%M`

) taken by the process running `grep`

. Since we now only care about these measures, we add the `-q`

option to `grep`

, which suppresses output. Finally, we redirect these statistics to `awk`

, which will print on each line a triple ( and are time and memory). In all, run the command:

1 |
for k in {30..45}; do printf '%*s' "$k" | tr ' ' 'x' | `which time` --quiet -f "$k : %U : %M" grep -qvE -e '^(xx+)\1+$' -e '^x?$'; done 2>&1 | awk -F: '{printf("%d\t%1.3f\t%d\n", $1, $2/$1, $3)}' |

In case you're wondering, the `2>&1`

is needed because `time`

outputs to standard error, so a normal pipe won't work. In Bash version 4 and later you can write `|&`

instead. Running the command on my machine gives the output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
30 0.001 4320 31 0.001 4320 32 0.002 4320 33 0.004 4320 34 0.006 4176 35 0.009 4304 36 0.014 4304 37 0.022 4176 38 0.034 4320 39 0.053 4320 40 0.083 4208 41 0.130 4208 42 0.202 4304 43 0.319 4208 44 0.499 4320 45 0.785 4304 |

It seems we were on the right track. Memory consumption doesn't noticeably grow. Time grows instead even more quickly than exponentially in the number of digits. If we play around a bit more, it seems that time approximately goes like a polynomial in , and certainly asymptotically more slowly than a double exponential. See this article for a detailed discussion of the efficiency of regex matching implementations.

### Conclusion

Was this just math? Just science? Just engineering? The sensible conclusion seems that it was a combination of all three of them: the theory of regular languages has mathematical flavor, their practical implementations in Unix tools is an engineering feat, and empirical analysis smells like experimental science. And that's the beauty of it!