A Microsoft Research paper from back in 2008 has recently been getting a lot of renewed attention after a blog post about it did the rounds on Twitter, Reddit, etc. The paper is titled “The Influence of Organizational Structure on Software Quality: An Empirical Case Study” and it looks at defining metrics to measure organizational complexity and whether those metrics are better at predicting “failure-proneness” of software modules (specifically, those comprising the Windows Vista operating system) than other metrics such as code complexity .
The authors end up defining eight such “organizational metrics”, as follows:
- Number of engineers – “the absolute number of unique engineers who have touched a binary and are still employed by the company”. The claim here is that higher values for this metric result in lower quality.
- Number of ex-engineers – similar to the first metric, but defined as “the total number of unique engineers who have touched a binary and have left the company as of the release date of the software system”. Again, higher values for this metric should result in lower quality.
- Edit frequency – “the total number times the source code, that makes up the binary, was edited”. Again, the claim is that higher values for this metric suggest lower quality.
- Depth of Master Ownership – “This metric (DMO) determines the level of ownership of the binary depending on the number of edits done. The organization level of the person whose reporting engineers perform more than 75% of the rolled up edits is deemed as the DMO.” Don’t ask me, read the paper for more on this one, but the idea is that the lower the level of ownership, the higher the quality.
- Percentage of Org contributing to development – “The ratio of the number of people reporting at the DMO level owner relative to the Master owner org size.” Higher values of this metric are claimed to point to higher quality.
- Level of Organizational Code Ownership – “the percent of edits from the organization that contains the binary owner or if there is no owner then the organization that made the majority of the edits to that binary.” Higher values of this metric are again claimed to point to higher quality.
- Overall Organization Ownership – “the ratio of the percentage of people at the DMO level making edits to a binary relative to total engineers editing the binary.” Higher values of this metric are claimed to point to higher quality.
- Organization Intersection Factor – “a measure of the number of different organizations that contribute greater than 10% of edits, as measured at the level of the overall org owners.” Low values of this metric indicate higher quality.
These metrics are then used in a statistical model to predict failure-proneness of the over 3,000 modules comprising the 50m+ lines of source code in Windows Vista. The results apparently indicated that this organizational structure model is better at predicting failure-proneness of a module than any of these more common models: code churn, code complexity, dependencies, code coverage, and pre-release bugs.
I guess this finding is sort of interesting, if not very surprising or indeed helpful.
One startling omission from this paper is what constitutes a “failure”. There are complicated statistical models built from these eight organizational metrics and comparisons made to other models (and really the differences in the predictive power between all of them are not exactly massive), but nowhere does the paper explain what a “failure” is. This seems like a big problem to me. I literally don’t know what they’re counting – which is maybe just a problem for me – but, much more significantly, I don’t know whether what the different models are counting are the same things (which would be a big deal in comparing the outputs from these models against one another).
Now, a lot has changed in our industry since 2008 in terms of the way we build, test and deploy software. In particular, agile ways of working are now commonplace and I imagine this has a significant organizational impact, so these organizational metrics might not offer as much value as they did when this research was undertaken (if indeed they did even then).
But, after reading this paper and the long discussions that have ensued online recently after it came back into the light, I can’t help but ask myself what value we get from becoming better at predicting which modules have “bugs” in them. On this, the paper says:
More generally, it is beneficial to obtain early estimates of software quality (e.g. failure-proneness) to help inform decisions on testing, code inspections, design rework, as well as financial costs associated with a delayed release.
I get the point they’re making here but the information provided by this organizational metric model is not very useful in informing such decisions, compared to, say, a coherent testing story revealed by exploratory testing. Suppose I predict that module X likely has bugs in it, then what? This data point tells me nothing in terms of where to look for issues or whether it’s worth my while to do so based on my mission to my stakeholders.
We spend a lot of time and effort in software development as a whole – and testing specifically – trying to put numbers against things – perhaps as a means of appearing more scientific or accurate. When faced with questions about quality, though, such measurements are problematic and I thank James Bach for his very timely blog post in which he encourages us to assess quality rather than measure it – I suggest that taking the time to read his blog post is time better spent than trying to make sense of over-complicated and meaningless pseudo-science such as that presented in the paper I’ve reviewed here.
(The original 11-page MS Research paper can be found at https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-11.pdf)
Pingback: Five Blogs – 2 January 2020 – 5blogs