Bug counting « A little specification can go a long way

How do you determine, empirically, if a theory is correct?

A standard approach uses the toolset of statistical hypothesis testing, whose modern formulation was introduced by influential figures such as Fisher and Pearson around a hundred years ago. The first step consists in expressing the absence of an effect, the one we’re investigating empirically, as a null hypothesis. Then, a mathematical toolset exists that assesses the likelihood that an observed sample (the empirical data) is drawn under the assumption that the null hypothesis holds. Likelihood is normally given as a $p$ -value — the probability of the observation assuming the null hypothesis. A small $p$ -value (say, $p < 0.01[/latex]) means that the observation would be very unlikely to occur under the null hypothesis. This is compelling evidence against the hypothesis, which can be rejected with confidence: evidence suggests the effect is real. To make the discussion concrete, let me introduce an example taken from my own research. <a href="http://se.inf.ethz.ch/people/nordio/">Martin Nordio</a> conceived the idea of assessing the impact of using <a href="http://en.wikipedia.org/wiki/Agile_software_development">agile</a> development processes (such as <a href="http://en.wikipedia.org/wiki/Scrum_%28software_development%29">Scrum</a>) on the success of software projects carried out by distributed teams. With <a href="http://se.inf.ethz.ch/people/estler/">Christian Estler</a>, he set up a broad data collection effort where they queried information about more than sixty software projects worldwide. The overall goal of the study was to collect empirical evidence that following agile processes, as opposed to more traditional development processes, does make a difference. To this end, the null hypothesis was roughly: <blockquote> There is no difference in [latex]X$ for projects developed following agile processes compared to projects developed using other processes.

We instantiated $X$ for the several aspects of project outcome gathered in the data collection, including “overall success”, team “motivation” and “amount of communication”, and “importance” for customers. This gave us not one but six null hypotheses (extended to sixteen in the follow-up version of the work) to be rejected by the data.

Much to our dismay, when we sat down to analyze the data we found no evidence whatsoever to disprove any of the hypotheses. The $p$ -values were in the range $0.25..0.97$ — even greater than $0.75$ for four of the hypotheses! Was the collected data any good?

More generally, what can one conclude from lack of evidence against the null hypothesis, to wit large $p$ -values? In the orthodox Fisherian interpretation: nothing except that the available data is inconclusive. In Fisher’s own words [Fisher, 1935, emphasis added]:

the null hypothesis is never proved or established […] every experiment […] exist[s] only in order to give the facts a chance of disproving the null hypothesis

Such an approach, were facts are assessed only by disproving hypotheses, resonates with Popper’s notion of falsifiability, where only theories that can be conclusively rejected by experiments are considered scientific. The closeness of dates when Fisher’s and Popper’s works became popular is an inkling that the similarity may not be entirely coincidental.

And yet, the reality of science is a bit more nuanced than that. Regarding falsifiability, it is generally false (ahem 🙂 ) that empirical falsifiability is a sine qua non for a theory to be scientific. As Sean Carroll nicely explains, the most theoretical branches of physics routinely deal with models that are not testable (at least as of yet) but nonetheless fit squarely into the practice of science as structured interpretation of data. Science is as scientist does.

But even for theories that are well within the canon of testability and falsifiability, there are situations where the lack of evidence against a hypothesis is indicative of the hypothesis being valid. To find these, we should look beyond individual experiments, to the bigger picture of accumulating evidence. As experiments searching for an effect are improved and repeated on larger samples, if the effect is genuinely present it should manifest itself more strongly as the experimental methodology sharpens. In contrast, if significance keeps waning, it may well be a sign that the effect is absent. In such contexts, persistent lack of significance may be even more informative than its occasional presence. Feynman made similar observations about what he called “cargo cult science” [Feynman, 1985], where effects becoming smaller as experimental techniques improve is a sure sign of the effects being immaterial.

We drew conclusions along these lines in the study of agile processes. Ours was not the first empirical analysis of the effect of agile practices. Overall, the other studies provided mixed, sometimes conflicting, evidence in favor of agile processes. Our study stood out in terms of size of the samples (most other studies target only a handful of projects) and variety of tested hypotheses. In this context, the fact that the data completely failed to disprove any of the hypothesis was telling us more than just “try again”. It suggested that agile processes have no definite absolute edge over traditional processes; they can, and often do, work very well, but so can traditional processes if applied competently.

These results, presented at ICGSE 2012 in Brazil and later in an extended article in Empirical Software Engineering, stirred some controversy — Martin Nordio’s presentation, where he posited Maradona a better footballer than Pelé to an audience with a majority of Brazilian people, may have contributed some to the stirring. Enthusiasts of agile processes disputed our conclusions, reporting direct experiences with successful projects that followed agile processes. Our study should not, however, be interpreted as a dismissal of agile processes. Instead, it suggests that agile and other processes can be equally effective for distributed software development. But neither are likely to be silver bullets whose mere application, independent of the characteristics and skills of teams, guarantees success. Overall, the paper was generally well received and it went on to win the best paper award of ICGSE 2012.

There are two bottom lines to this post’s story. First: ironically, statistics seems to suggest that Martin might have been wrong about Maradona. Second: when your $p$ -values are too large don’t throw them away just yet! They may well be indirectly contributing to sustaining scientific evidence.

References

Ronald A. Fisher: The design of experiments, 1935. Most recent edition: 1971, Macmillan Pub Co.
Richard Feynman, Ralph Leighton, contributor, Edward Hutchings, editor: Surely You’re Joking, Mr. Feynman!: Adventures of a Curious Character, 1985. The chapter “Cargo cult science” is available online.

Good technical writing is largely an acquired skill. Like other elements of expertise necessary for research, one learns it by reading the classics, following expert‘s advice, and practicing a lot. After sufficiently extensive training and experience, one interiorizes rules and tips as second nature, and applies them whenever they are writing a technical piece.

In this post, I want to discuss what sometimes may happen as a side effect of extensively practicing the necessary imitative efforts that one goes through to master technical writing. Some assimilated rules may survive that are mere relics of outdated practices; others misinterpret the original intentions behind sound advice, or its right scope, and become unnecessary constraints that make for duller rather than more convincing writing. I call these “myths” of technical writing, because they are uncritically assumed in spite of lacking a convincing justification only because they appear, or are believed, to be common accepted practice. Some of them are about writing style, others are about typographical conventions.

Not all the myths I discuss have the same relevance or widespread usage. The selection is also somewhat a matter of personal opinion. My ultimate intent is to revisit some knee-jerk habits rather than uncritically perpetuating them without a convincing understanding of their rationale — as I certainly have done on occasion too. Since good writing is also a matter of taste, I believe that for every rule, and for every anti-rule, there are plausible exceptions. The bottom line really is: let’s be critical in choosing our writing style as we should be in our research activity.

Two columns are better than one.: A specter is haunting scientific publishing — the specter of two-column layouts. Multi-column page layouts were introduced at times when printing was expensive with the goal of cramming as much content as possible in the fewest pages. Compared to single-column layouts, two-column layouts require text in smaller fonts, complicate the placing of figures and tables, and make formatting of special text (such as equations and program code) and text justification more troublesome (a narrow horizontal space for text can accommodate few extra spaces for justification before it becomes ugly). All these features translate into less readable texts. I suspect the reason many publishers of scientific articles still insist on multi-column layouts is largely out of outdated practices rather than conscious decisions. I say we try to use single-column layouts whenever we have a choice, such as for technical reports and extended versions.

You can’t use contracted forms of verbs.: When I started writing research papers, I was told to absolutely avoid contracted verbal forms (can’t, won’t, aren’t, …); the reason given for this restriction was that they are not acceptable in formal, rigorous writing. You may have heard and followed similar rules; indeed, it seems that a large part of scientific writing avoids contracted forms. However, the justification for this recommendation is questionable: there’s nothing inherently informal, let alone sloppy, about contracted forms. Modern English writing, including technical texts, admits contracted forms as legitimate; up-to-date copyeditors do not correct them away. The recommendation to avoid them is still sensible when a true ambiguity may arise. For instance, “it’s” is the contracted form of both “it is” and “it has”: if the intended meaning is not clear from the context, it’s advisable to disambiguate by conjugating the verb in full. When ambiguity is not an issue, using contracted or full forms is a matter of flow and pace that each sentence should have. Not abusing contracted forms is the real rule to follow; but there is no need to ban them.

Italicizing foreign words is de rigueur.: Maybe in Elizabethan English; not anymore in the twenty-first century. Italic should mainly be used to emphasize words; just because a word has foreign origin it does not mean one should emphasize it. I recommend the following rule of thumb to decide whether a foreign word should be typeset in italic: if it’s included in a standard English dictionary, do not italicize; if it’s not included in the dictionary, are you sure you want to use the word at all?

The passive voice is to be avoided: we always avoid it.

I believe passive forms used to be quite common in scientific writing (they may still be so in some scientific fields other than computer science). The rationale: scientific content should be objective; since the passive voice can deemphasize the subject by omitting it, it makes it easy to achieve an impersonal style that can be mistaken for an objective one. However, a sweeping usage of passive forms tends to communicate indecisiveness and vagueness; therefore it has lost popularity in technical writing, where clarity and assertiveness are paramount.
Unfortunately, these considerations sometimes are turned into a ban of the passive voice, whereas they should be indications regarding when and how to use it to the best effect. Another incongruous reaction to the suggestion of limiting the usage of passive is overusing the first person plural — we — as sentence subject. Ironically, an indiscriminate usage of “we” has the same problem as an indiscriminate usage of passive voice: the subject becomes unclear and impersonality prevails. Indeed, some papers may use the first person plural for such diverse subjects as:

the paper’s text: “we describe in this paragraph” — this paragraph’s text describes;
the paper’s authors: “we hope that the results are useful” — the authors hope;
a subset of the paper’s authors: “we ran the experiments” — the graduate student ran;
an algorithm, program, or technique: “we generate a set of unit tests” — the test-case generation algorithm generates;
the paper’s authors together with readers: “we can see that the correlation is strong” — anyone looking at the figures can see.

The last form is the preferred meaning to be attributed to the first person plural. In all other cases, we’d better clarify the subject or use the passive voice to deemphasize it.

Different floats should be numbered independently: is Figure 2 before or after Table 1?: Another mainstream practice of typography is the habit of using independent numberings for different kinds of floats (such as figures, tables, and listings) and environments (such as numbered theorems, definitions, and remarks). For example, if a document contains two tables and two figures, they are referred to as Table 1, Table 2, Figure 1, and Figure 2. The problem with this practice is that the numbering does not carry any information about the relative order of the various floats: the first figure is Figure 1 regardless of whether other floats appeared before it. If it were readily available, the missing information would help readability by suggesting whether one should search forward or backward for a certain float given any current position in the paper. A similar argument applies to numbered environments. There are few authors who dare to change the common practice and adopt more practical numbering schemes. The classic Concrete mathematics [Graham, 1994], for example, numbers each theorem by the page of the book where it appears; it looks outlandish the first time you encounter it, but then it’s easy to understand and very convenient for browsing. I try to override the numbering scheme to one that is more reasonable whenever I write documents past a certain length — even if some copyeditors have occasionally enforced the standard, inconvenient scheme.

Never use citations as nouns; see [7] for an example.

Using citations as nouns — writing “see [3]” instead of “see Graham et al. [3]” — is possibly the only writing practice among those introduced to save every bit of space to comply with the strict page limits common to conference publishing that I find attractive for general usage even if it is normally shunned by copyeditors. It may not be very elegant, but there are cases in which none of the alternatives seems any better.
Of course, if we have no space constraints whatsoever, and are citing few references by well-known authors, writing their names out in full is clearer. But things are very different when we’re writing a “Related work” section full of citations and we are trying to combine readability and brevity. As a simple example, imagine three references [A], [B], and [C] all originating in the same research group, lead by Prof. Smith, but with slightly different (and long) authors lists. We would like to give a general introduction to the three works, and then perhaps single out [C] as the most closely related to our own. Which solution do you prefer?

Smith’s group has worked on baz in [A], [B], and [C]. [B] describes technique foo which is most similar to ours.
Prancer et al. [A], Donner et al. [B], and Blitzen et al. [C] have worked on baz. Donner et al. [B] describe technique foo which is most similar to ours.

I find the first option much more readable; I’m willing to accept a tad of informality in exchange for that.

Avoid repetitions; do not reiterate but use synonyms.: Of course superfluous repetitions should be avoided. But the suggestion to use synonyms is easily misunderstood as a requirement to uncritically use up the thesaurus as extensively as possible.
Italian journalist Cesare Marchi told a funny story [Quartu, 1986] to illustrate the perils of overdoing searching for synonyms. A student overly committed to avoiding repetitions had to write a passage on Hannibal’s armed expedition across the Alps during the second Punic war. The student mentioned that the Carthaginian general “crossed the Alps with elephants“, in the hope that Romans would be “shocked by those behemoths“; he went on telling the challenges of “climbing mountains with those mastodons“; he continued by describing the “death of many ungulate trunked herbivore mammals“. As you can see, avoiding repetitions is not a protection against unintended comedy.
Strunk and White [1999] give a great recommendation about repetitions: express coordinate ideas in coordinate form. Indiscriminately using synonyms may blur the connections between parts and thus render a text less clear and less cogent. There is another point to make in favor of repetition that is particularly relevant to technical writing: using a synonym for words that should have a precise well-defined meaning incurs the risk of insinuating the doubt regarding what the real writer’s intentions were. If I write of bugs in section 2 of my paper and go on about faults in section 3, am I referring to the same things in both sections? A little repetitiveness is worth having for greater clarity.

Paper structure is rigid; in particular, the first section is Introduction, the last section is Conclusion(s).: Have you noticed that the first section of most papers, invariably titled “Introduction”, is often not really introductory in character? More commonly, it contains motivations for the work presented in the paper and, more extensively, an overview or summary of the main content — more detailed than the abstract but still not requiring too much background or definitions. If this is the case, how come we always title that first section “Introduction”? This stale convention misses the opportunity to give it a more informative title, preferably connected to the actual content rather than completely generic. Or, if it really requires no title, why not omitting the title altogether? Though less strongly, similar observations apply to other standard sections of scientific articles. For example, it’s not compulsory to have a “Conclusion(s)” section unless there is some interesting material to be put there. In this respect, I welcome a practice of fields such as mathematics (as well as theoretical computer science) where there’s no stigma in ending a paper abruptly with a theorem or a proof. Provided, of course, that is the best usage of the available space.

IOKTUAP (i.e., It’s OK To Use Abbreviations Profusely): Computer science jargon is already rife with acronyms and abbreviations, possibly more than any other technical discipline; this should be enough a reason already to limit the introduction of new abbreviations to few cases of necessity. As a minimum, every abbreviation or acronym but the most widely used ones should be spelled out in full the first time it is introduced in a paper. However, if an abbreviation is used only once or twice in a whole document, it’s advisable to avoid introducing it in the first place.
A related problem when using abbreviations is how to choose indefinite articles: “an LTL formula” or “a LTL formula”? The rule of thumb I follow chooses according to whether the acronym is usually pronounced as such or read out in full. “LTL” is usually read “/ell tee ell/”, and hence the indefinite article should be “an”; in contrast, I prefer “a FOL formula” to “an FOL (/eff oh ell/) formula” since “FOL” is normally read out in full as “First Order Logic”.
As usual, transgressions against any of these rules may be excusable if trying to save space to conform to rigid page limits.

Sentences should be kept short.: Unqualified criticism about sentences being “too long” reminds me of the (possibly apocryphal, but popularized by the movie Amadeus) comment “too many notes” attributed to Emperor Joseph II after watching Mozart’s Die Entführung aus dem Serail (by the way, it’s in italic because it’s a title, not because it’s German ;-). I do not condone, let alone encourage, habits of writing long, convoluted, obscure sentences, but I object to the notion that there is a limit to how long a sentence should be that cannot be trespassed without sacrificing clarity. Sentence length, like many other issues discussed in this post, is a matter of style and content. Dry, somewhat minimalistic writing à la Dalton Trumbo does not best suit every text.
If we diligently polish our writing aiming for clarity and crispness and calibrate our style to maximize expressiveness, we will be able to quip in response to nonspecific criticism about excessive sentence length, like Mozart quipped in response to the Emperor, that, in our sentences, there are just as many words as there should be.

If you have more myths that you’d like to discuss, or if you disagree with parts of my assessment, you are welcome to leave a comment.

References

B. M. Quartu, editor: Dizionario dei sinonimi e dei contrari, Rizzoli, 1986. Foreword by Cesare Marchi.
William Strunk Jr. and E. B. White: The elements of style, 4th edition, Longman, 1986.
Ronald L. Graham, Donald E. Knuth, and Oren Patashnik: Concrete mathematics: A foundation for computer science, 2nd edition, Addison-Wesley, 1994.

Bug counting

A little specification can go a long way

What good are large p-values?

References

Ten myths about technical writing

References

Poetical programming: an attempt