It was thanks to a recommendation from Michael Bolton that I came across the book Calling Bullsh*t by Carl T. Bergstrom and Jevin D. West. While it’s not a book specifically about software testing, there are some excellent takeaways for testers as I’ll point to in the following review of the book. This book is a must read for software testers in my opinion.
The authors’ definition of bullshit (BS) is important to note before digging into the content (appearing on page 40):
Bullshit involves language, statistical figures, data graphics, and other forms of presentation intended to persuade or impress an audience by distracting, overwhelming, or intimidating them with a blatant disregard for truth, logical coherence, or what information is actually being conveyed.
I was amazed to read that the authors already run a course at a US university on the same topic as this book:
We have devoted our careers to teaching students how to think logically and quantitatively about data. This book emerged from a course we teach at the University of Washington, also titled “Calling Bullshit”. We hope it will show you that you do not need to be a professional statistician or econometrician or data scientist to think critically about quantitative arguments, nor do you need extensive data sets and weeks of effort to see through bullshit. It is often sufficient to apply basic logical reasoning to a problem and, where needed, augment that with information readily discovered via search engine.
The rise of the internet and particularly social media are noted as ways that BS has proliferated in more recent times, spreading both misinformation (claims that are false but not deliberately designed to deceive) and disinformation (deliberate falsehoods).
…the algorithms driving social media content are bullshitters. They don’t care about the messages they carry. They just want our attention and will tell us whatever works to capture it.
Bullshit spreads more easily in a massively networked, click-driven social media world than in any previous social environment. We have to be alert for bullshit in everything we read.
As testers, we tend to have a critical thinking mindset and are hopefully alert to stuff that just doesn’t seem right, whether that’s the way a feature works in a product or a claim made about some software. It seems to me that testers should naturally be good spotters of BS more generally and this book provides a lot of great tips both for spotting BS and learning how to credibly refute it.
Looking at black boxes (e.g. statistical procedures or data science algorithms), the authors make the crucial point that understanding the inner workings of the black box is not required in order to spot problems:
The central theme of this book is that you usually don’t have to open the analytic black box in order to call bullshit on the claims that come out of it. Any black box used to generate bullshit has to take in data and spit results out.
Most often, bullshit arises either because there are biases in the data that get fed into the black box, or because there are obvious problems with the results that come out. Occasionally the technical details of the black box matter, but in our experience such cases are uncommon. This is fortunate, because you don’t need a lot of technical expertise to spot problems with the data or results. You just need to think clearly and practice spotting the sort of thing that can go wrong.
The first big topic of consideration looks at associations, correlations and causes and spotting claims that confuse one for the other. The authors provide excellent examples in this chapter of the book and a common instance of this confusion in the testing arena is covered by Theresa Neate‘s blog post, Testing and Quality: Correlation does not equal Causation. (I’ve also noted the confusion between correlation and causality very frequently when looking at big ag-funded “studies” used as ammunition against veganism.)
The chapter titled “Numbers and Nonsense” covers the various ways in which numbers are used in misleading and confusing ways. The authors make the valid point that:
…although numbers may seem to be pure facts that exist independently from any human judgment, they are heavily laden with context and shaped by decisions – from how they are calculated to the units in which they are expressed.
It is all too common in the testing industry for people to hang numbers on things that make little or no sense to look at quantitatively, counting “test cases” comes to mind. The book covers various ways in which numbers turn into nonsense, including summary statistics, percentages and percentage points. Goodhart’s Law is mentioned (in its rephrased form by Marilyn Strathern):
When a measure becomes a target, it ceases to be a good measure
I’m sure many of us are familiar with this law in action when we’re forced into “metrics programmes” around testing for which gaming becomes the focus rather than the improvement our organizations were looking for. The authors introduce the idea of mathiness here: “mathiness refers to formulas and expressions that may look and feel like math – even as they disregard the logical coherence and formal rigour of actual mathematics” and testing is not immune from mathiness either, e.g. “Tested = Checked + Explored” is commonly quoted from Elisabeth Hendrickson‘s (excellent) Explore It! book. Another concept that will be very familiar to testers (and others in the IT industry) is zombie statistics, viz.
…numbers that are cited badly out of context, are sorely outdated, or were entirely made up in the first place – but they are quoted so often that they simply won’t die.
There are many examples of such zombie statistics in our industry, Boehm’s so-called cost of change curve being a prime example (claiming that the cost of changes later in the development cycle is orders of magnitude higher than earlier in the cycle) and is of one the examples covered beautifully in Laurent Bossavit’s excellent book, The Leprechauns of Software Engineering.
The next statistical concept introduced in the book is selection bias and I was less familiar with this concept (at least under this name):
Selection bias arises when the individuals that you sample for your study differ systematically from the population of individuals eligible for your study.
This sort of non-random sampling leads to statistical analyses failing or becoming misleading and there are again some well-considered examples to explain and illustrate this bias. Reading this chapter brought to mind my recent critique of the Capgemini World Quality Report in which I noted that both the size of organizations and roles of participants in the survey was problematic. (I again note that from my vegan research that many big ag-funded studies suffer from this bias too.)
A hefty chapter is devoted to data visualization, with the authors noting the relatively recent proliferation of charts and data graphics in the media due to the technology becoming available to more easily produce them. The treatment of the various ways that charts can be misleading is again excellent with sound examples (including axis scaling, axis starting values, and the “binning” of axis values). I loved the idea of glass slippers here, viz.
Glass slippers take one type of data and shoehorn it into a visual form designed to display another. In doing so, they trade on the authority of good visualizations to appear authoritative themselves. They are to data visualizations what mathiness is to mathematical equations.
The misuse of the periodic table visualization is cited as an example and, of course, the testing industry has its own glass slippers in this area, for example Santhosh Tuppad’s Heuristic Table of Testing! This chapter also discusses visualizations that look like Venn diagrams but aren’t, and highlights the dangers of 3-D bar graphs, line graphs and pie charts. A new concept for me in this chapter was the principle of proportional ink:
Edward Tufte…in his classic book The Visual Display of Quantitative Information…states that “the representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented.” The principle of proportional ink applies this rule to how shading is used on graphs.
The illustration of this principle by well-chosen examples is again very effective here.
It’s great to see some sensible commentary on the subject of big data in the next chapter. The authors say “We want to provide an antidote to [the] hype” and they certainly achieve this aim. They discuss AI & ML and the critical topic of how training data influences outcomes. They also note how machine learning algorithms perpetuate human biases.
The problem is the hype, the notion that something magical will emerge if only we can accumulate data on a large enough scale. We just need to be reminded: Big data is not better; it’s just bigger. And it certainly doesn’t speak for itself.
The topics of Big Data, AI and ML are certainly hot in the testing industry at the moment, with tool vendors and big consultancies all extoling the virtues of these technologies to change the world of testing. These claims have been made for quite some time now and, as I noted in my critique of the Capgemini World Quality Report recently, the reality has yet to catch up with the hype. I commend the authors here for their reality check in this over-hyped area.
In the chapter titled “The Susceptibility of Science”, the authors discuss the scientific method and how statistical significance (p-values) is often manipulated to aid with getting research papers published in journals. Their explanation of the base rate fallacy is excellent and a worthy inclusion, as it is such a common mistake. While the publication of dodgy papers and misleading statistics are acknowledged, the authors’ belief is that “science just plain works” – and I agree with them. (From my experience in vegan research, I’ve read so many dubious studies funded by big ag but these don’t undermine my faith in science, rather my faith in human nature sometimes!) In closing:
Empirically, science is successful. Individual papers may be wrong and individual studies misreported in the popular press, but the institution as a whole is strong. We should keep this in perspective when we compare science to much of the other human knowledge – and human bullshit – that is out there.
In the penultimate chapter, “Spotting Bullshit”, the discussion of the various means by which BS arises (covered throughout the book) is split out into six ways of spotting it, viz.
- Question the source of information
- Beware of unfair comparisons
- If it seems too good or bad to be true…
- Think in orders of magnitude
- Avoid confirmation bias
- Consider multiple hypotheses
These ways of spotting BS act as a handy checklist I think and will certainly be helpful to me in refining my skills in this area. While I was still reading this book, I listened to a testing panel session online and one of the panelists was from testing tool vendor, Applitools. He briefly mentioned some claims about their visual AI-powered test automation tool. These claims piqued my interest and I managed to find the same statistics on their website:
I’ll leave it as an exercise for the reader to decide if any of the above falls under the various ways BS manifests itself according to this book!
The final chapter, “Refuting Bullshit”, is really a call to action:
…a solution to the ongoing bullshit epidemic is going to require more than just an ability to see it for what it is. We need to shine a light on bullshit where it occurs, and demand better from those who promulgate it.
The authors provide some methods to refute BS, as they themselves use throughout the book in the many well-chosen examples used to illustrate their points:
- Use reductio ad absurdum
- Be memorable
- Find counterexamples
- Provide analogies
- Redraw figures
- Deploy a null model
They also “conclude with a few thoughts about how to [call BS] in an ethical and constructive manner”, viz.
- Be correct
- Be charitable
- Admit fault
- Be clear
- Be pertinent
In summary, this book is highly recommended reading for all testers to help them become more skilled spotters of BS; be that from vendors, testing consultants or others presenting information about testing. This skill will also come in very handy in spotting BS in claims made about the products you work on in your own organization!
The amount of energy needed to refute bullshit is an order of magnitude bigger than [that needed] to produce it.Alberto Brandolini (Italian software engineer, 2014)
After reading this book, you should have the skills to spot BS and I actively encourage you to then find inventive ways to refute it publicly so that others might not get fooled by the same BS.
Our industry needs those of us who genuinely care about testing to call out BS when we see it, I’m hoping to see more of this in our community! (My critique of the Capgemini World Quality Report and review of a blog post by Cigniti are examples of my own work in this area as I learn and refine these skills.)