Data Quality

An analyst’s job is never done

‘Don’t trust the data. If you’ve found something interesting, something has probably gone wrong!’ Maybe you’ve been there too? It was a key lesson I learnt as a junior researcher. It partly reflected my skills as an analyst at the time – the mistakes could well have been mine! But, not entirely.

You see I was working with cancer registration and deaths data which on occasion could show odd patterns due to changes in disease classifications, diagnosis developments or reporting practices. Take a close look and you could spot the step changes when a classification change occurred. Harder to spot might be the impact of a new treatment or screening programme. But sometimes there were errors too – including the very human error of using the wrong population base for rates.

I was reminded of this experience when Sir Ian Diamond, the National Statistician, spoke to the Health and Social Care Select Committee in May. He said (Q34):

“One of the things about good statisticians is that they are always just a little sceptical of the data. I was privileged to teach many great people in my life as an academic and I always said, “Do not trust the data. Look for errors.””

Sage advice from an advisor to SAGE!

The thing with quality is that the analyst’s job is never done. It is a moving target. In our Quality Assurance of Administrative Data guidance, we emphasise the importance of understanding where the data come from, how and why they were collected. But this information isn’t static – systems and policies may alter. And data sources will change as a result.

Being alert for this variation is an ongoing, everyday task. It includes building relationships with others in the data journey, to share insight and understanding about the data and to keep a current view about the data source. As Sir Ian went on to point out in his evidence, it should involve triangulating against other sources of data.

OSR recently completed a review of quality assurance in HMRC, at the agency’s invitation. It was a fascinating insight into the operation of the organisation and the challenges it faces. We used a range of questions to help inform our understanding through meetings with analytical teams. They told us that they found the questions helpful and asked if we would share them to help with their own quality assurance. So, we produced an annex in the report with those questions.

And we have now reproduced the questions in a guide, as prompts to help all statistics producers think about their data and about quality under these headings:

Understanding the production process
Tools used during the production process
Receiving and understanding input data
Quality assurance
Version control and documentation
Issues with the statistics

The guide also signposts to a wealth of excellent guidance on quality on the GSS website. The GSS Best Practice and Impact Division (BPI) supports everyone in the Government Statistical Service in meeting the quality requirements of the Code and improving government statistics. BPI provides a range of helpful guidance and training.

Quality Statistics in Government guidance is primarily intended for producers of statistics who need to ensure that their products meet expectations for statistical quality. It is an introduction to quality and brings together the principles of statistical quality with practical advice in one place. You will find helpful information about quality assurance of methods and data and how to design processes that are efficient, transparent and reduce the risk of mistakes. Reproducible Analytical Pipelines (RAP) and the benefits of making our analysis reproducible is also discussed. The guidance complements the Quality Statistics in Government training offered by the GSS Quality Centre.
Communicating quality, uncertainty and change guidance is intended for producers of official statistics who need to write about and communicate effectively information about quality, uncertainty and change. It can be applied to all sources of statistics, including surveys, censuses, administrative and commercial data, as well as estimates derived from a combination of these. There is also a Communicating quality, uncertainty and change training.
The GSS Quality Centre has developed a guidance which includes top tips to improve the QA of ad-hoc analysis across the GSS. Moreover, the team runs the Quality Assurance of Administrative Data (QAAD) workshop in which users can get an overview of the QAAD toolkit and how to apply it to administrative sources.
There is also a GSS Quality strategy in place which aims to improve statistical quality across the Government Statistical Service (GSS) to produce statistics that serve the public good.

Check out our quality question guide and let us know how you get on by emailing me at penny.babb@statistics.gov.uk – we would welcome hearing about your experiences. We are always on the look-out for some good examples of practice that we can feature on the Online Code.

Big Data Analytics

Last week I attended the Big Data Analytics Conference on Data Quality at the Francis Crick Institute.

It was a great event in an amazing space. It covered a huge range of applications, ranging from looking at data to identify election fraud through to the use of Big Data in medicine, via the creation of apps to disrupt the financial services market. And I spoke about what Big Data could learn from official statistics.

It’s easy to have two cynical reactions to a Big Data event. First, that Big Data isn’t a thing at all, but just a buzz word designed to make not-very-new things sound new; and second, that the Government is miles behind everyone else.

I think these reactions miss the point. This event showed that Big Data is as much as a cross-domain community as it is a set of specific techniques. And it’s clear from the event that this community is vibrant, committed and growing. And as I tried to show in my slides, there’s lots to learn from how large data sets are used in Government – about the pitfalls of neglecting data quality and also the innovations that emerge when Government analysts link multiple datasets, as they have done in Department for Education recently to look at ordinary working families.

Here are my slides [Big Data Presentation]. They capture a simple message: that focusing on the eternal principles of trustworthiness, quality and value is still the starting point of all good work with data and all good presentation of statistics.

So thanks to the organisers and speakers – it was a great event.

Ed Humpherson

Health statistics

In the last few weeks, we’ve made three comments on health statistics – one in England, about leaks of accident and emergency data; one in Scotland, on statistics on delayed discharges; and one on analysis at the UK level. They all show the importance of improving the public value of statistics.

On accident and emergency statistics, I wrote to the heads of key NHS bodies in England to express concern about recent leaks of data on performance.

Leaks of management information are the antithesis of what the Office for Statistics Regulation stands for: public confidence in trustworthy, high quality and high value information.

It’s really hard to be confident about the quality of leaked information because it almost always lacks context, description, or any guidance to users. On value, leaked information usually relates to a question of public interest, but it’s not in itself valuable, in the sense it’s not clear how it relates to other information on the same topic. Its separated, isolated nature undermines its value. And it’s hard for leaked information to demonstrate that it is trustworthy, because the anonymous nature of the “producer” of the information (the person who leaked it) means that motives can be ambiguous.

But leaks can highlight areas where there is concern about the public availability of information. And that was the constructive point of my letter: the NHS bodies could look into reducing the risk of leaks. One way of doing this would be to reduce the time lag between the collection of the information on accident and emergency performance, and its publication as official statistics. This lag is currently around 6 weeks – 6 weeks during which the performance information circulates around the health system but is not available publicly. Shorten this lag, I argue, and the risk of disorderly release of information may also reduce.

The comments on Scotland relate to the comparability of statistics across the UK. When NHS Scotland’s Information Services Division published its statistics on delayed discharge from NHS hospitals for February, the Cabinet Secretary for Health and Sport in the Scottish Government noted that these figures compared positively to the equivalent statistics in England.

This is of course an entirely reasonable thing for an elected representative to do – to comment on comparative performance. The problem was that the information ISD provided to users in their publication on how to interpret the Scottish statistics in the UK context was missing – it wasn’t clear that Scotland figures are compiled on a different basis to the England figures. So the comparison is not on a like for like basis. The difference wasn’t stated alongside the equivalent statistics for England either. This clarification has now been provided by ISD, and NHS England have agreed to make clearer the differences between the figures in their own publication.

For us, it’s really important that there is better comparability of statistics across the UK. While there are differences in health policy that will lead to different metrics and areas of focus, it’s quite clear that there is public interest in looking at some issues – like delayed discharge – across the four UK health systems.

In this situation, good statistics should help people make sound comparisons. Yet, with health and care being a devolved matter, there are some constraints on the comparability of statistics across England, Wales, Scotland, and Northern Ireland. And, to the untrained eye it is difficult for users to know what is or is not comparable – with delayed discharge data as a prime example. This is why we really welcome the recently published comparative work, led by Scottish Government, where statisticians have created a much more accessible picture of health care quality across the UK, pulling together data on acute care, avoidable hospital admissions, patient safety, and life expectancy/healthy life expectancy across all 4 UK countries.

Both these cases – the leaks and comparability – illustrate a broader point.

Health statistics in the UK should be much better. They should be more valuable; more coherent; in some cases more timely; and more comparable. If statistics do not allow society to get a clear picture in good time of what is going on, then they are failing to provide public value.