Email correspondence: Sai to Office for Statistics Regulation (13 December 2024)
Dear OSR,
I respectfully request that you review the methodology of data published by the MoJ Justice Data Lab, for a. failure to conduct multiple comparisons correction (MCC) / account for family-wise error (FWE), and b. possible p-hacking by failure to disclose all considered comparisons (e.g. ones that were unfavourable).
As background, I refer you to the classic and concise paper by Bennett, Miller, & Wolford (PDF), Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction, 47 Neuroimage S125 (2009) – https://www.psychology.mcmaster.ca/bennett/psy710/readings/BennettDeadSalmon.pdf
By way of example, I direct your attention to the recent publication MoJ official statistics publication of Justice data lab statistics: April 2024 (https://www.gov.uk/government/statistics/justice-data-lab-statistics-april-2024), such as in particular its HMPPS CFO report (https://assets.publishing.service.gov.uk/media/66602fffdc15efdddf1a8756/HMPPS_CFO_JDL_Report.pdf). See also its general annex (https://assets.publishing.service.gov.uk/media/6628f8683b0122a378a7e5a0/Justice_Data_Lab_general_annex.pdf).
The report states at the bottom “Our statistical practice is regulated by the Office for Statistics Regulation (OSR).”
It lists 59 p-values, claiming that p-values of <0.05 (of which 26 are listed) are “significant”.
Neither the report nor general annex makes any mention of MCC, nor of how many (if any) potential comparisons are evaluated but not published.
Supposing generously that there were no unpublished values — contrary to the annex, which suggests 14 measures are considered, multiplied by any number of possible comparison groups — one would expect, by chance alone, that some of those 26 ‘significant’ p-values are actually not. At minimum, the 8 listed p-values of >= 0.01 could not have been statistically significant (though they are claimed to be) — the 59-way Bonferroni-corrected threshold for MCC-adjusted p<0.05 would p<0.00085, and even a very lenient FWE correction method for 59 comparisons would not be at 0.01. There were not enough significant figures for listed p<0.01 to assess whether the others were also bogus, but considering the failure mode, it seems very likely that most of those supposed p<0.01 results were (after MCC) statistically insignificant.
All of Justice Data Lab’s published statistics and reports have the same methodological problems. See: https://www.gov.uk/government/collections/justice-data-lab-pilot-statistics.
I respectfully submit that these are examples of clearly incorrect methods, resulting in claims of programme effects that are not trustworthy, contrary to OSR Code of Practice Q2.1, Q2.3, Q2.4, Q3.1, & V3.2. Given this very serious methodological failure, is impossible to tell what programmes (if any) are actually effective, or how effective they are — perhaps some are; the published reports are inadequate to show either way — but false claims of efficacy fundamentally undermine MoJ’s (and other official statisticians’) credibility and lead the reader to question all other claims made.
I request that you please review MoJ’s statistical methods, publications, and underlying data, and — should you agree that they are faulty — require them to retract any published claims of statistical significance that are not actually justified and publish corrected statistics (including all considered but unpublished values).
If you do find that MoJ JDL methodology was faulty, I request that you further investigate any other official statistics, by MoJ or otherwise, that may share the same methodological faults; and that you issue policies, guidance, training, etc that would help to prevent such failures in future.
I request that you please email me with any updates on this matter.
Sincerely,
Sai
Email correspondence: Office for Statistics Regulation to Sai (6 February 2025)
Dear Sai,
Thank you for raising concerns about the statistical methods used by the Ministry of Justice (MoJ) to produce the Justice Data Lab statistics.
We recognise that multiple comparison correction (MCC) is widely considered to be best practice when carrying out a large number of comparisons on the same data. In line with the Code of Practice for Statistics, it is the responsibility of official statistics producers like MoJ to determine the methods that are used and to explain their choices. As such, we do not typically specify or require a particular method to be used, but we do require the producer to explain its decisions and set out why its chosen methods are appropriate for producing the statistics.
We note that some, but not all, Justice Data Lab reports explain MoJ’s rationale for not applying MMC; see for example the 2023 TSP reoffending report (pg. 45) which states that “While multiple correction methods can be applied to reduce the risk of incorrectly finding a positive treatment effect, they can also increase the likelihood that real differences will not be detected.” MoJ told us that it will look to include this explanation in future Justice Data Lab publications where there are a large number of comparisons.
In examining these statistics, we have identified some improvements that we consider MoJ could make to the presentation and communication of the statistics. We will take forward these issues, and revisit your concerns about MCC, as part of a broader regulatory review of the trustworthiness, quality and value of the Justice Data Lab statistics in 2025/26. We are happy to keep you updated on this work.
Thank you again for raising this issue with us.
Kind regards,
OSR
Email correspondence: Sai to Office for Statistics Regulation (6 February 2025)
Dear OSR,
Thank you for your email.
“While multiple correction methods can be applied to reduce the risk of incorrectly finding a positive treatment effect, they can also increase the likelihood that real differences will not be detected.”
Stated at that level of generality, I agree that is technically true, but it does not really address the concern — and no ordinary reader will understand the impact of that caveat. I respectfully suggest that it is not merely misleading, but outright false, to claim that the outcome statistics are “significant” when the FWE rate is so high. A more clear statement of the likelihood of both false negatives and false positives, on each line item and in the summaries, would be a better communication of this.
Bonferonni correction is one of the strictest MCCs, and I do not suggest that it is the only reasonable one to apply in all cases. There are many others, which are suitable for different particular situations and reduce the false negative rate while maintaining an acceptable FWE false positive rate. Their choice requires nuance and tailoring to the situation.
I note that a simultaneously necessarily issue, which MoJ seems to have entirely failed to do (given that they seem to say they consider several hundred possible outcomes, choosing the outcomes to report only after looking at the data), is to simply designate a limited number of specific metrics ahead of time as the expected outcomes, and only claim significance if those metrics are MCC-significant.
That way, the corrected p-values required for elements within the set of pre-determined outcomes to be considered for significance are not as stringent. If possibly-significant outcomes show up in other metrics, they should be reported as being “of interest but not statistically significant”, and then evaluated based on follow-up data collection. This can of course also be done by treating a one-time random blind subset of the data as an initial pilot which does not claim significance, to determine what outcomes to evaluate, and evaluate only the remaining data by those chosen outcomes. This can be done pre hoc (as with an actual pilot) or — if the analysis is absolutely rigorous and trustworthy, preferably by firewalling the jobs of data partitioning and data analysis, so the analyst only gets to see the test data set after they have committed to an analysis choice based on the sample data set — it can be done post hoc (by one-time random sample).
It is, however, absolutely critical that the choice of metrics to evaluate for significance must be done before any data are known, otherwise it is sharpshooter fallacy. And they must report all data looked at for the same reason — especially data that is not favourable to their preferred outcomes, including null results.
Lastly, I suggest that it would help if MoJ were to report all t-scores, z-scores, p-scores, and MCC-adjusted p-scores, etc., and to do so to full precision — not just rounded up to 0.01 if below 0.01. In almost any analysis with multiple outcomes, “p ≤ 0.01” is simply not sufficient to distinguish true positives from multiple comparisons noise, and the resulting reports are therefore impossible to independently evaluate.
“In examining these statistics, we have identified some improvements that we consider MoJ could make to the presentation and communication of the statistics.”
I would appreciate it if you could please tell me what those are.
“We are happy to keep you updated on this work.”
I would appreciate that as well.
I respectfully ask that you would please consider issuing government-wide guidance on MCC and other sharpshooter fallacy & p-hacking issues (such as choosing what statistics to publish or evaluate only after looking at data, which irretrievably corrupts an analysis). I ask that this guidance cover both how and when to do what methods; as well as appropriate, accurate language about false negatives and false positives on both a per-item and summary basis (which should vary depending on the analytical approach adopted).
Lastly, I respectfully ask that you please publicly publish your actions (etc) in this matter. You have my permission to publish my emails in this to whatever extent you see fit.
Sincerely,
Sai
Email correspondence: Office for Statistics Regulation to Sai (18 February 2025)
Dear Sai,
Thank you for following up. We will consider your detailed points about MCC methods when we review the Justice Data Lab statistics.
You asked us about the improvements that we identified. We think that MoJ could be more transparent about its analysis plan and design, for example, by setting out what each intervention intends to measure. We also think that MoJ could be more careful in how it summarises the results.
Responsibility for producing government-wide guidance on statistical methods sits with the Government Statistical Service (GSS). We will seek broader methodological advice and raise the idea for guidance on MCC and p-hacking with GSS methodologists as part of our review.
We are happy to publish our correspondence.
Kind regards,
OSR