“Wouldn’t it be cool if…

…we could look at this against x! And y. And maybe a, b and c too…”

This felt like quite a common conversation with my team, back when I was analysing data in the Department for Digital, Culture, Media and Sport (DCMS) circa 2015.

The number of interesting questions and analyses we could do with our data, if we could only put it together with other data, felt potentially limitless. And what an amazing benefit these analyses could have to society – we’d basically be able to understand and improve everything!

But it wasn’t meant to be. We did try and match our survey data with data held by one other department and… it was painful! It took months to get to the point of being able to physically share and receive data and, once we had some data, getting it ready to analyse proved tricky too. In fact, it proved so difficult that, I’m ashamed to admit, I moved roles before I managed it.

OSR also continues to emphasise the power of linked data to produce better statistics. On paper, linking data sets might sound simple but, in practice, it is often difficult. This is why I’m so excited about the recent work we’ve seen from the Ministry of Justice (MoJ). MoJ is taking great steps to link up the administrative data sets it generates in its operational work, and to make them available for analysis by people outside of the department. This means that MoJ, and other interested parties, can more easily do analysis across different parts of the justice system, and beyond, to understand the journeys individuals take.

There are two projects I’d like to highlight:

         1. Data First

In collaboration with ADR UK (Administrative Data Research UK), MoJ is undertaking an ambitious data linkage project called ‘Data First’. OSR’s 2018 review of The Public Value of Justice Statistics highlighted the need for statistics that move from counting people as they interact with specific parts of the justice system to telling stories about the journeys people take. Data First is doing just that! It will anonymously link data from across the family, civil and criminal courts in England and Wales, enabling research on how the justice system is used and enhancing the evidence base to understand ‘what works’ to help tackle social and justice policy issues.

In June, we were delighted to hear that Data First reached its first major milestone. The first, research-ready dataset – a de-identified, case-level dataset on magistrates’ court use – was made available for accredited researchers through the Office for National Statistics (ONS) Secure Research Service (SRS). This data provides insight into the magistrates’ court user population, including the nature and extent of repeat users. It enables, for the first time, researchers to establish whether a defendant has entered the courts on more than one occasion and will drive better policy decisions to reduce frequent use of the courts. In August, a second output followed, this time a de-identified, research-ready dataset on Crown Court use. This dataset is also available through the SRS.

         2. Data shares with the Department for Education (DfE)

To improve understanding of the potential links between individual’s educational outcomes and characteristics and their involvement or risk of involvement with crime and the criminal justice system, MoJ and DfE have created a de-identified, individual-level dataset, which links data from the Police National Computer (MoJ) and the National Pupil Database (DfE)[1]. The DfE data spans educational attainment, absence from school, exclusions and characteristics like special educational needs and free school meals eligibility. The MoJ data includes information on criminal histories and reoffending, court proceedings, prison and assessments of offenders. Linking this data will allow analysis that has previously not been possible, including: longitudinal analysis of trends in individual’s characteristics and outcomes; analysis to inform the design of policies and processes that better support those at risk; and evaluations of the effectiveness of interventions. Accredited researchers can apply to access the data via the ONS SRS or MoJ’s Justice MicroData Lab.

This work follows The Children in Family Justice Data Share (CFJDS)[2], which started in 2012 and has resulted in a database of child-level data linked from across the MoJ, DfE and the Children and Family Court Advisory and Support Service (Cafcass). The CFJDS provides, for the first time, longitudinal data on the short and medium-term outcomes for children who experience the family justice system. The data are being used to build understanding of how different experiences and decisions made within the family court can impact on children’s educational outcomes, and subsequently, their life chances. In turn, they will provide more robust evidence on which to make policy decisions for children and their families.

What’s really exciting about both these projects is the way that the teams involved are tackling the challenges of data linkage. Instead of creating a big new IT system to try and join up the data, these projects are starting from a position of, “let’s take what’s in the current databases and see what we can get through anonymised matching.” The exact tools used vary between teams and departments but include established tools such as SAS Data Management Studio and SQL Server Management Studio (SSMS), which were used by MoJ and DfE respectively for linking crime and justice and NPD data. For data linkage done as part of Data First, MoJ have developed a new tool called Splink, which was written in the programming language Python. Splink is an open source library for probabilistic record linkage at scale: it’s free, and MoJ hope others in government (and beyond) will find it useful for their own data linkage and deduplication tasks. Rule based matching algorithms, including ‘fuzzy-matching’ algorithms – rules used to link data based on non-perfect matches between data variables – have been used to link individuals within and between data sets.

These projects show what can be achieved when government departments, agencies and external organisations work together, and will help us start to achieve what my team and I hoped we could back in 2015. They will enable us to better understand individuals and society and, in turn, to make better decisions and policies, which will improve the justice system and outcomes for all individuals. I’m looking forward to seeing what comes next.


[1] To ensure the confidentiality and protection of data about children, access to DfE data extracts from the NPD is managed through tightly controlled processes.

[2] https://www.gov.uk/government/statistics/family-court-statistics-quarterly-october-to-december-2017, published 29 March 2018

Joining Up Data

Jeni Tennison, CEO of the Open Data Institute, responds to our Joining Up Data for Better Statistics report.

Data is moving from being scarce and difficult to process to being abundant and easy to use. But harnessing its value for economic and social benefit – in ways that support innovation and deliver social justice – is not straightforward.

At the Open Data Institute (ODI), we would like to see a future where people, organisations and communities use data to make better decisions, more quickly. This would help our economies and societies to thrive. Using data and statistics well underpins research; enables us to innovate; informs the creation of more effective products, services and policies; and fuels discovery, economic growth and productivity.

In the future we would like to see, people can trust organisations to manage data ethically and benefits arising from data are distributed fairly. Data is used to meet the needs of individuals, communities and societies.

The Joining Up Data for Better Statistics review from the Office for Statistics Regulation (OSR) focuses on an essential part of this open, trustworthy data ecosystem: how to safely link together and share data from across different data stewards for analysis, research and generating statistics.

Data as roads

At the ODI, we often use the analogy of data being like roads. Where we use roads to navigate to a location, we use data to navigate to a decision.

The road analogy highlights the importance of joining up data. A single road only takes us to places between two locations; their real value comes from being part of a network. Data works in the same way: it is not just having more data that unlocks its value, but linking it together. Data is not individual datasets, it is a network: a data infrastructure.

We can apply the ‘data as roads’ analogy to the Code of Practice for Statistics’ three pillars:

  • Roads are valuable when they go to places people want to go to; similarly, data and statistics add value when they help answer society’s questions.
  • Well-paved roads help us travel more quickly, but even rough tracks can be useful if you have the right vehicle – you need to know what to expect when you’re planning a journey; similarly, high-quality data is best, but lower quality data can be useful if you are aware of its limitations when drawing conclusions.
  • To avoid danger, we rely on engineers to use good practices to build and maintain roads, bridges and tunnels and on road users obeying the rules of the road; similarly, we rely on data custodians and data users to collect, maintain, use and share data in trustworthy ways.

Open and trustworthy

Like our road infrastructure, for our data infrastructure to generate value it has to be both as open as possible and trustworthy.

Data is more useful when more people can access and use it. It is most useful when it can be joined together. Data that is inaccessible – or where access takes so long it is rendered irrelevant – is of limited utility.

At the same time, greater access and linkage – particularly with personal data – can increase the potential for harmful impacts. The result of unethical, inequitable and untransparent use of data goes beyond direct impacts on affected individuals: it can undermine trust more widely, causing people to withdraw consent.

This ultimately affects the quality and representativeness of the data we have, the data we need to understand our populations, to meet their needs, and to innovate.

As the OSR’s review highlights, there is still much to do to increase both data’s openness and its trustworthiness. We need better technical guidance and approaches, through data trusts perhaps, but we also need to upskill data stewards so they can understand and weigh risks and benefits, quickly and well.

We are still learning how to share and join up data in open and trustworthy ways. Being open and transparent about the decisions we make as we use and share data can build trust and speed up this learning, so we can all benefit from data.

Joining Up Data for Better Statistics

To speak to people involved in linking Government datasets is to enter a world that at times seems so ludicrous as to be Kafkaesque. Stories abound of Departments putting up arcane barriers to sharing their data with other parts of Government; of a request from one public sector body being treated as a Freedom of Information request by another; and of researchers who have to wait so long to get access to data that their research funding runs out before they can even start work.

Our report, Joining Up Data for Better Statistics, published today, was informed by these experiences and more.

The tragedy is that it doesn’t have to be this way. We encountered excellent cases where data are shared to provide new and powerful insights – for example, on where to put defibrillators to save most lives; how to target energy efficiency programmes to reduce fuel poverty; which university courses lead to higher earnings after graduation. These sorts of insight are only possible through joining up data from different sources. The examples show the value that comes from linking up data sets.

This points to a gap between what’s possible in terms of valuable insights, especially now the Digital Economy Act creates new legal gateways for sharing and linking data, and the patchy results on the ground.

It leads us to conclude that value is being squandered because data linkage is too hard and too rare.

We want to turn this on its head, and make data linkage much less frustrating. We point to six outcomes that we see as essential to support high quality linkage and analysis, with robust safeguards to maintain privacy, carried out by trustworthy organisations including the Office for National Statistics (ONS) and government Departments. The six outcomes are that:

  • Government demonstrates its trustworthiness to share and link data through robust data safeguarding and clear public communication
  • Data sharing and linkage help to answer society’s important questions
  • Data sharing decisions are ethical, timely, proportionate and transparent
  • Project proposal assessments are robust, efficient and transparent
  • Data are documented adequately, quality assessed and continuously improved
  • Analysts have the skills and resources needed to carry out high-quality data linkage and analysis

The report seeks to make things better. The six outcomes are the underpinnings of this. The report supports them with recommendations designed to help foster this new, better environment for trustworthy data linkage. The good news is that there is a strong coalition of organisations and leaders wanting to take this forward both inside and outside Government. This includes the National Statistician and his team at ONS, strong data linkage networks in Scotland, Wales and Northern Ireland, and new bodies like the Centre for Data Ethics and Innovation, UK Research and Innovation and the Ada Lovelace Institute. Alongside this blog we’re publishing a blog from Jeni Tennison, CEO of the Open Data Institute, which shows the strong support for this agenda outside Government.

We want statistical experts in Government, and those who lead their organisations, to achieve the six outcomes. When they do so, they will ensure that opportunities are no longer squandered. And the brilliant and valuable examples we highlight will no longer be the exception: analysts will be empowered to see data linkage as a core part of their toolkit for delivering insights.