Reproducible Analytical Pipelines: Overcoming barriers to adoption

Executive Summary

Introduction to the review

Official statistics produced by governments should uphold the highest standards of trustworthiness, quality and value in order to serve the public good. In 2017 we championed the Reproducible Analytical Pipeline (RAP), a new way of producing official statistics developed by the Department for Culture, Media and Sport and the Government Digital Service. This approach involved using programming languages to automate manual processes, version control software to robustly manage code and code storage platforms to collaborate, facilitate peer review and publish analysis.

Since then, we have seen some excellent examples of RAP principles being applied across the Government Statistical Service (GSS), the cross-government network of all those who work on official statistics. However, through our regulatory work we have seen that there are often common barriers for teams and organisations wishing to implement RAP. These include access to the right tools and training and statisticians having the time and support to carry out development work.

In Summer 2020 we set out our intention to further advocate for RAP principles in government statistics as part of our Automation and Data programme. We consider that RAP principles support all three pillars of the Code of Practice for Statistics: trustworthiness, quality and value.

In Autumn 2020 we launched this review. Our aim was to explore the current use of RAP principles across the GSS, identify what enables successful implementation and to understand what prevents statistics producers implementing RAP. We spoke to a variety of organisations that produce official statistics. This included the Office for National Statistics, UK government departments, devolved administrations, arms-length-bodies and voluntary adopters of the Code of Practice for Statistics. We also engaged with users of official statistics and stakeholders with a supportive or leadership role in this area, such as the GSS Best Practice and Impact Team and the office of the National Statistician. Finally, we drew on other available sources of evidence. These included Civil Service and GSS surveys and findings from our previous regulatory work. More information about how we carried out the review is provided in Annex 1: Approach to the review.

Our findings and recommendations

To enhance the trustworthiness, quality and value of official statistics through increased use of RAP principles and see RAP become the default approach to statistics we make the following recommendations.

A consistent shared understanding of RAP and RAP principles is needed across the GSS.Building on their previous work to promote RAP, the Best Practice and Impact Team and RAP champions network should ensure that there is widespread awareness within the GSS of the recently developed minimum standard of RAP.
RAP is not only a change in tools – it involves a cultural change to the way that analysis is approached and carried out.The Analysis Function board and Directors of Analysis should consider how best to foster a culture where reproducible analysis is prioritised across government.
RAP principles support the highest standards of trustworthiness, quality and value and should be used as a way to enhance compliance with the Code of Practice for Statistics.The leadership of the GSS, including the National Statistician, should set a strategic direction for the use of RAP principles in official statistics.
Support and encouragement from senior leaders allows statistics producers to successfully and sustainably implement RAP.Organisations in the GSS should ensure that RAP principles are included in their analytical strategies.
Senior leaders responsible for strategies in their organisations must develop a good understanding of what RAP is, why it is required, and support an open culture of innovation.
The implementation of RAP principles is most successful when producers carry out their own development work and when a planned approach is taken – for example having a good understanding of skill levels, training needs and existing processes.Statistics producers should take a managed approach to implementing RAP. Projects should be underpinned by senior support, sufficient resource and the required skills, training and mentoring support.
RAP is not all or nothing: implementing just some RAP principles will result in improvements.Statistics producers should consider what can be achieved easily and build on developments iteratively over time.
Programming and code management skills are essential for modern statistical analysis.The GSS People Committee should ensure that RAP-related skills such as coding and code management are considered core skills for statistics producers and included in future career frameworks, such as the competency framework.
Bespoke and targeted training is most successful. Statistics producers need access to advanced training on programming, as well as introductory courses.The GSS should invest in advanced and bespoke training on RAP and RAP-related skills through the Analytical Learning Team. This should build on existing resources and be developed in collaboration with the Best Practice and Impact Team. Availability of training must be effectively communicated across the GSS so everyone is aware of it.
Support from experts has a significant impact on the success of RAP projects.The GSS needs to invest in expert mentoring, for example, through the Best Practice and Impact Team. Organisations that have the required skills and knowledge should support those that don’t.
Access to the tools required for RAP, such as programming languages, version control software and code storage platforms, varies across organisations. Organisations are tackling the same technical problems with different results.A strategy for implementing RAP principles across the GSS should recommend tools which should be available to statistics producers. It should also provide guidance on the best approaches to solving common technical problems.

Ensuring statistical models command public confidence

Learning lessons from the approach to developing models for awarding grades in the UK in 2020

Executive Summary

Purpose of this report

In March 2020 the ministers with responsibility for education in England, Scotland, Wales and Northern Ireland announced the closure of schools as part of the UK’s response to the coronavirus outbreak. Further government announcements then confirmed that public examinations in summer 2020 would not take place.

The four UK qualification regulators – Ofqual (England), Scottish Qualifications Authority (Scotland), Qualifications Wales (Wales) and the Council for the Curriculum, Examinations & Assessment (Northern Ireland) – were directed by their respective governments to oversee the development of an approach to awarding grades in the absence of exams. While the approaches adopted differed, all approaches involved statistical algorithms.

When grades were released in August 2020, there was widespread public dissatisfaction centred on how the grades had been calculated and the impact on students’ lives. The grades in all four countries were re-issued based on the grades that schools and colleges had originally submitted as part of the process for calculating grades.

The public acceptability of algorithms and statistical models had not been such a prominent issue for so many people before, despite the rise in their use. As the regulator of official statistics in the UK, it is our role to uphold public confidence in statistics.

Statistical models and algorithms used by government and other public bodies are an increasingly prevalent part of contemporary life. As technology and the availability of data increase, there are significant benefits from using these types of models in the public sector.

We are concerned that public bodies will be less willing to use statistical models to support decisions in the future for fear of a public acceptability backlash, potentially hindering innovation and development of statistics and reducing the public good they can deliver. This is illustrated by the emphasis placed on not using algorithms during discussions of how grades will be awarded in 2021 following the cancellation of exams this year. For example, the Secretary of State for Education, when outlining the approach to awarding grades in January 2021, stated that “This year, we will put our trust in teachers rather than algorithms.”[1]

It is important therefore that lessons are learned for government and other public bodies who may wish to use statistical models to support decisions. This review identifies lessons for model development to support public confidence in statistical models and algorithms in the future.

The broader context: Models and algorithms

Throughout this report we have used the terms statistical model and algorithm when describing the various aspects of the models used to deliver grades. It should be noted, however, that terms such as statistical model, statistical algorithm, data-driven algorithms, machine learning, predictive analytics, automated decision making and artificial intelligence (AI), are frequently used interchangeably, often with different terms being used to describe the same process.

We consider that the findings of this review apply to all these data-driven approaches to supporting decisions in the public sector whatever the context.

Our approach: Lessons on building public confidence

This review centres on the importance of public confidence in the use of statistical models and algorithms and looks in detail at what it takes to achieve public confidence. The primary audiences for this review are public sector organisations with an interest in the use of models to support the delivery of public policy, both in the field of education and more broadly. This includes statisticians and analysts; regulators; and policy makers who commission statistical models to support decisions.

In conducting our review, we have adopted the following principles.

  • Our purpose is not to pass definitive judgments on whether any of the qualification regulators performed well or badly. Instead, we use the experiences in the four countries to explore the broader issues around public confidence in models.
  • The examples outlined in this report are included for the purposes of identifying the wider lessons for other public bodies looking to develop or work with statistical models and algorithms. These examples are therefore not an exhaustive description of all that was done in each country.
  • In considering these case studies, we have drawn on the principles of the Code of Practice for Statistics. While not written explicitly to govern the use of statistical algorithms, the Code principles have underpinned how we gathered and evaluated evidence, namely:

Trustworthiness: the organisational context in which the model development took place, especially looking at transparency and openness

Quality: appropriate data and methods, and comprehensive quality assurance

Value: the extent to which the models served the public good.

  • We considered the end-to-end processes, from receiving the direction from Ministers to awarding of grades and the planned appeals processes, rather than just the technical development of the algorithms themselves.
  • We have drawn on evidence from several sources. This included meeting with the qualification regulators and desk research of publicly available documents.
  • We have undertaken this review using our regulatory framework, the Code of Practice for Statistics. It is outside our remit to form judgments on compliance or otherwise with other legal frameworks.

We have also reviewed the guidance and support that is available to organisations developing statistical models and algorithms to identify whether it is sufficient, relevant and accessible and whether the available guidance and policies are coherent. Independent reviews of the grade awarding process have been commissioned by the Scottish Government, Welsh Government and Department of Education in Northern Ireland. Whilst there are some overlaps in scope with our review, there are also key differences – most notably, the reviews sought to review the approach to awarding grades in order to make recommendations for the approach to exams in 2021. Our review goes wider: it seeks to draw lessons from the approaches in all four countries to ensure that statistical models, whatever they are designed to calculate, command public confidence in the future.


The approaches to awarding grades were regulated by four bodies:

  • In England, Office of Qualifications and Examinations Regulation (Ofqual)
  • In Scotland, Scottish Qualifications Authority (SQA)
  • In Wales, Qualifications Wales
  • In Northern Ireland, Council for the Curriculum, Examinations & Assessment (CCEA).

Although the specific approaches differed in the four countries, the overall concepts were similar, in that they involved the awarding of grades based on a mix of teacher predicted grade, rankings of students within a subject, and prior attainment of the 2020 students and/ or previous cohorts at the same centre (i.e. school or college).

It was always going to be extremely difficult for a model-based approach to grades to command public confidence

The task of awarding grades in the absence of examinations was very difficult.  There were numerous challenges that the qualification regulators and awarding organisations had to overcome. These included, but are not limited to:

  • The novelty of the approach, which meant that it was not possible to learn over multiple iterations and that best practice did not already exist.
  • The constraints placed on the models by the need to maintain standards and not disadvantage any groups.
  • The variability in exams results in a normal year due to a range of factors other than student ability as measured by prior attainment.
  • Tight timescales for the development and deployment of the model.
  • Decisions about young people’s lives being made on the day the grades were released.
  • Limited data on which to develop and test the model.
  • The challenges of developing the models while all parts of the UK were in a lockdown.
  • Teacher estimated grades varied significantly from historic attainment for some schools or colleges.

These challenges meant that it was always going to be difficult for a statistical algorithm to command public confidence.

Whilst we understand the unique and challenging context in which the models were developed, we also recognise that the grade awarding process in summer 2020 had a fundamental impact on young people’s lives.

Public confidence was influenced by a number of factors

Against the background of an inherently challenging task, the way the statistical models were designed and communicated was crucial. This demonstrates that the implementation of models is not simply a question of technical design. It is also about the overall organisational approach, including factors like equality, public communication and quality assurance.

Many of the decisions made supported public confidence, while in some areas different choices could have been made. In our view, the key factors that influenced public confidence were:

The teams in all of the qualification regulators and awarding organisations acted with honesty and integrity. All were trying to develop models that would provide students with the most accurate grade and enable them to progress through the education system. This is a vital foundation for public confidence.

  • Confidence in statistical models in this contextwhilst we recognise the unique time and resource constraints in this case, a high level of confidence was placed in the ability of statistical models to predict a single grade for each individual on each course whilst also maintaining national standards and not disadvantaging any groups. In our view the limitations of statistical models, and uncertainty in the results of them, were not fully communicated. More public discussion of these limitations and the mechanisms being used to overcome them, such as the appeals process, may have helped to support public confidence in the results.
  • Transparency of the model and its limitations – whilst the qualification regulators undertook activities to communicate information about the models to those affected by them and published technical documentation on results day, full details around the methodology to be used were not published in advance. This was due a variety of reasons, including short timescales for model development, a desire not to cause anxiety amongst students and concerns of the impact on the centre assessed grades had the information been released sooner. The need to communicate about the model, whilst also developing it, inevitably made transparency difficult.
  • Use of external technical challenge in decisions about the modelsthe qualification regulators drew on expertise within the qualifications and education context and extensive analysis was carried out in order to make decisions about the key concepts in the models. Despite this, there was, in our view, limited professional statistical consensus on the proposed method. The methods were not exposed to the widest possible audience of analytical and subject matter experts, though we acknowledge that time constraints were a limiting factor in this case. A greater range of technical challenge may have supported greater consensus around the models.
  • Understanding the impact of historical patterns of performance in the underlying data on results – in all four countries the previous history of grades at the centre was a major input to calculating the grades that the students of 2020 received for at least some of their qualifications. The previous history of grades would have included patterns of attainment that are known to differ between groups. There was limited public discussion ahead of the release of results about the likely historical patterns in the underlying data and how they might impact on the results from the model.  All the regulators carried out a variety of equality impact analyses on the calculated grades for potentially disadvantaged categories of students at an aggregate level. These analyses were based on the premise that attainment gaps should not widen, and their analyses showed that gaps did not in fact widen. Despite this analytical assurance, there was a perception when results were released that students in lower socio-economic groups were disadvantaged by the way grades were awarded. In our view, this perception was a key cause of the public dissatisfaction.
  • Quality Assurance – in the exam case, there were clear examples of good quality assurance of both input and output data. For input data, centres were provided with detailed guidance on the data they should supply. For output data, the regulators undertook a wide range of analysis, largely at an aggregate level. There was limited human review of outputs of the models at an individual level prior to results day. Instead, the appeal process was expected to address any issues. There was media focus on cases where a student’s grade was significantly different from the teacher prediction. In our view, these concerns were predictable and, whilst we recognise the constraints in this scenario, such cases should be explored as part of quality assurance.
  • Public engagement – all the qualification regulators undertook a wide range of public engagement activities, particularly at the outset. They deployed their experience in communicating with the public about exams and used a range of communication tools including formal consultations and video explainers, and the volume of public engagement activity was significant. Where acceptability testing was carried out, however, the focus was primarily on testing the process of calculating grades, and not on the impact on individuals. This, and the limited testing in some countries, may have led to the regulators not fully appreciating the risk that there would be public concern about the awarding of calculated grades.
  • Broader understanding of the exams system: in a normal year, individuals may not get the results they expect. For example, they may perform less well in an exam than anticipated. Statistical evidence and expert judgments support the setting of grade boundaries in a normal year. These may not be well understood in general but, as well-established processes they are able to command public confidence. As a result, when the unfamiliar 2020 approach was presented publicly, people may have assumed that an entirely new, machine-led approach was being introduced, and this may have raised their concerns. This issue of broader understanding would have been very hard for the regulators to address in the time available.

Overall, what is striking is that, while the approaches and models in the four countries had similarities and differences, all four failed to command public confidence. This demonstrates that there are key lessons to be learned for government and public bodies looking to develop statistical models to support decisions. These lessons apply to those that develop statistical models, policy makers who commission statistical models and the centre of government.

Lessons for those developing statistical models

Our review found that achieving public confidence is not just about delivering the key technical aspects of a model or the quality of the communication strategy. Rather, it arises through considering public confidence as part of an end-to-end process, from deciding to use a statistical model through to deploying it.

We have identified that public confidence in statistical models is supported by the following three principles:

  • Be open and trustworthy – ensuring transparency about the aims of the model and the model itself (including limitations), being open to and acting on feedback and ensuring the use of the model is ethical and legal.
  • Be rigorous and ensure quality throughout – establishing clear governance and accountability, involving the full range of subject matter and technical experts when developing the model and ensuring the data and outputs of the model are fully quality assured.
  • Meet the need and provide public value– engaging with commissioners of the model throughout, fully considering whether a model is the right approach, testing acceptability of the model with all affected groups and being clear on the timing and grounds for appeal against decisions supported by the model.

Specific learning points, which are of relevance to all those using data-driven approaches to support decisions in the public sector underpin each principle. These are detailed in Part 3 of this report.

Lessons for policy makers who commission statistical models

We have identified lessons for ensuring public confidence for commissioners of statistical models from the perspective of supporting those developing them.

  • A statistical model might not always be the best approach to meet your need. Commissioners of statistical models and algorithms should be clear what the model aims to achieve and whether the final model meets the intended use, including whether, even if they are “right”, they are publicly acceptable. They should ensure that they understand the likely strengths and limitations of the approach, take on board expert advice and be open to alternative approaches to meeting the need.
  • Statistical models used to support decisions are more than just automated processes. They are built on a set of assumptions and the data that are available to test them. Commissioners of models should ensure that they understand these assumptions and provide advice on acceptability of the assumptions and key decisions made in model development.
  • The development of a statistical model should be regarded as more than just a technical exercise. Commissioners of statistical models and algorithms should work with those developing the model throughout the end to end process to ensure that the process is open, rigorous and meets the intended need. This should include building in regular review points to assess whether the model will meet the policy objective.

Lessons for centre of Government

For statistical models used to support decisions in the public sector to command public confidence, the public bodies developing them need guidance and support to be available, accessible and coherent.

The deployment of models to support decisions on services is a multi-disciplinary endeavour. It cuts across several functions of Government, including the Analysis function (headed by the National Statistician) and the Digital and data function, led by the new Central Digital and Data Office, as well as others including operational delivery and finance. As a result, there is a need for central leadership to ensure consistency of approach.

The Analysis Function aims to improve the analytical capability of the Civil Service and enable policy makers to easily access advice, analysis, research and evidence, using consistent, professional standards. In an environment of increasing use of models, there is an opportunity for the function to demonstrate the role that analysis standards and professional expertise can play in ensuring these models are developed and used appropriately.

Our review has found that there is a fast-emerging community that can provide support and guidance in statistical models, algorithms, AI and machine learning.  However, it is not always clear what is relevant and where public bodies can turn for support – the landscape is confusing, particularly for those new to model development and implementation. Although there is an emerging body of practice, there is only limited guidance and practical case studies on public acceptability and transparency of models. More needs to be done to ensure there is sufficient access for public bodies to available, accessible and coherent guidance on developing statistical models

Professional oversight support should be available to provide support to public bodies developing statistical models. This should include a clear place to go for technical expertise and ethics expertise.

Our recommendations

These recommendations focus on the actions that organisations in the centre of Government should take. Those taking forward these recommendations should do so in collaboration with the administrations in Scotland, Wales and Northern Ireland, which have their own centres of expertise in analysis, digital and data activities.

Recommendation 1: The Heads of the Analysis Function and the Digital Function should come together and ensure that they provide consistent, joined-up leadership on the use of models.

Recommendation 2: The cross-government Analysis and Digital functions, supported by the Centre for Data Ethics and Innovation should work together, and in collaboration with others, to create a comprehensive directory of guidance for Government bodies that are deploying these tools.

Recommendation 3: The Analysis Function, Digital Functions and the Centre for Data Ethics and Innovation should develop guidance, in collaboration with others, that supports public bodies that wish to test the public acceptability of their use of models.

Recommendation 4: In line with the Analysis Function’s Aqua Book, in any situation where a model is used, accountability should be clear. In particular, the roles of commissioner (typically a Minister) and model developer (typically a multi-disciplinary team of officials) should be clear, and communications between them should also be clear.

Recommendation 5: Any Government body that is developing advanced statistical models with high public value should consult the National Statistician for advice and guidance. Within the Office for National Statistics there are technical and ethical experts that can support public bodies developing statistical models. This includes the Data Science Campus, Methodology Advisory Service, National Statistician’s Data Ethics Committee and The Centre for Applied Data Ethics.

We will produce our own guidance in 2021 which sets out in more detail how statistical models should meet the Code of Practice for Statistics. In addition, we will clarify our regulatory role when statistical models and algorithms are used by public bodies.


The grade awarding process in 2020 was a high-profile example of public bodies using statistical models to make decisions.

In our view, the teams within the qualification regulators and awarding organisations worked with integrity to try to develop the best method in the time available to them. In each country there were aspects of the model development that were done well, and aspects where a different choice may have led to a different outcome. However, none of the models were able to command public confidence and there was widespread public dissatisfaction of how the grades had been calculated and the impact on students’ lives.

Our review has identified lessons to ensure that statistical models, whatever they are designed to calculate, can command public confidence in the future. The findings of this review apply to all public bodies using data-driven approaches to support decisions, whatever the context.

Our main conclusion is that achieving public confidence in statistical models is not just about the technical design of the model – taking the right decisions and actions with regards transparency, communication and understanding public acceptability throughout the end to end process is just as important.

We also conclude that guidance and support for public bodies developing models should be improved. Government has a central role to play in ensuring that models developed by public bodies command public confidence. This includes directing the development of guidance and support, ensuring that the rights of individuals are fully recognised and that accountabilities are clear.

[1] The Secretary of State for Education, Covid-19: Educational Settings

Volume 686: debated on Wednesday 6 January 2021, Hansard

COVID-19 surveillance and registered deaths data review

Information available on COVID-19 cases and deaths has been developed rapidly in a constantly shifting environment. The work being done by analysts to get this information into the public domain is commendable. There will always be a desire for improvements to the timeliness and completeness of data, but this should not undermine the huge efforts being made by individuals and organisations to deliver timely data to support decision making and inform the public.

Our vision is statistics that serve the public good. We aim to support producers of statistics and data to achieve this while championing the needs of the public. We have undertaken a short review of the data releases on COVID-19 cases and deaths – at a UK level and for each country within the UK – to help understanding of the available sources and to highlight strengths and our view on areas for improvement. This document outlines the findings from our review, that is necessarily only a snapshot of what are very fast-moving developments.

In reviewing the various statistical outputs, we have been guided by the three pillars of the Code of Practice for Statistics: Trustworthiness, Quality and Value. Trustworthiness refers the governance that surrounds the production of statistics; Quality refers to the characteristics of the data; and Value considers the extent to which the statistics answer users’ questions.

Summary of findings

There have been many developments to the data and supporting information available on COVID-19. Analysts have made huge efforts to deliver the information and have shown a willingness to address concerns and make rapid improvements.

There is great value in having timely data, such as the daily surveillance data covering the UK that is published less than 24 hours after the data reporting period. It provides an important leading indicator of the trend in COVID-19 testing, cases and deaths, which is essential to inform operational decisions being made at pace. However, the speed at which these data are made available means there has been a trade off with completeness, and the limitations of the UK data have not been fully explained.

The nature and extent of the uncertainty around the UK estimates of deaths associated with COVID-19 has not so far been made clear. However, we are aware of efforts being made to improve the clarity and transparency of the material that accompanies the daily briefing, including drawing on support from the Government Statistical Service (GSS).

In contrast, the weekly death statistics published for England and Wales, Scotland and Northern Ireland provide a more complete measure of the number of deaths associated with COVID-19, but these statistics are released with a greater time lag.

ONS’s publication of its forward workplans in this area is a helpful development for stakeholders and it is important that other nations provide detail about their plans to keep users of the statistics informed. We understand that the GSS is considering the accessibility of all the information on COVID-19 to allow users to navigate all outputs from a central hub, such as the GSS health and care statistics landscape.

Areas for further development

  1. It is important to maintain public confidence and trustworthiness of statistics that are used to inform public debate. The nature and extent of the uncertainty around the UK estimates of deaths associated with COVID-19 should be clarified.
  2. All statistics producers should show they are actively considering the diverse and changing user need for COVID-19 statistics, by publishing detailed plans for improvements, for example, information about the occupancy of intensive care units or beds, or on person characteristics, such as ethnicity.
  3. The GSS should consider the accessibility of the information and allow users to navigate all COVID-19 related outputs from a central hub, such as the GSS landscape.

National Statistics Designation Review – Phase 1 Exploratory Review

The Office for Statistics Regulation (OSR) has conducted an exploratory review to see whether the time is right to look at the meaning and value of the NS designation: does it meet the needs of official statistics in serving the public good in a data abundant world? And, if required, what further developments should be conducted?

This paper summarises the findings from the exploratory review in which we spoke to a range of stakeholders, to get an initial steer on the value and usefulness of the NS designation. It presents recommendations that OSR and official statistics producers can consider, to improve the information for users about the status of official statistics.

Please see our review page for further information about the National Statistics designation review.