Data standards, quality and curation
The full value of data can only be unlocked if they are provided in a format that meets certain standards. Data that are used to produce official statistics should already have been through quality assurance processes, had an assessment of their suitability, and had any known biases identified as part of the original statistics production process. However, for the onward sharing to be effective, data should also be:
- Documented and supplied with metadata so that users can understand the provenance, properties and potential of datasets (including linked datasets) and individual data items and can replicate key analyses and variables. Disclosure control steps should also be clearly documented so users can understand how they impact the data, and data subjects can be reassured that their data are being handled safely. Documentation and metadata should be available for users to access before they apply for data unless restrictions are absolutely necessary (for example, for data safeguarding reasons).
- Consistent – the data made available should remain consistent with the standards applied at the time the data were collected. Also the data should be consistent with the data used to produce any published estimates – this will require effective version control, good communication between statistics production teams and data supplier teams, and effective mechanisms to update users when changes are made to underlying data after they have been supplied.
- Linkable with other data sources – this will require the safe retention of identifiers in standardised formats and a demonstrable public good case for doing so under the General Data Protection Regulations.
- Timely – long delays between the publication of statistics and the provision of data diminish the value of the data in the same way that long delays between collection and publication can.
- Curated – data increasingly now are curated to support onward use by researchers, rather than just supplied as a by-product of the statistical production process. Data curation is a very clear demonstration of a statistics producer’s commitment to increasing the value of data. It involves working with groups of users to identify their needs and developing datasets to meet them. This might involve linking more than one source together, which sometimes requires acting as a broker to work with other data suppliers to source suitable data.
- Trackable – enabling the onward tracking of a dataset’s use can help to demonstrate the impact of that data and how it has served the public good. Examples of how this can be done include providing links to published pieces of work in accessible formats, including any lay summaries provided when applications to use the data were originally submitted. Formal mechanisms to track datasets via digital object identifiers (DOIs) are not commonplace at present, but systems are being developed to enable this. Users can also be encouraged to fully cite the data they have used in any published work.
- Reproducible – the need for research outputs to be reproducible is recognised as an important way to maintain research integrity. Reproducibility aligns with the Code of Practice’s requirements for transparent processes around review and correction, helps to assure quality, and ensures that flawed analyses do not cause harm and undermine the value of data and statistics. To support this principle, data suppliers can place expectations on data users to provide documentation of their data preparation and analysis code. Platforms such as GitHub can help to host code. An archiving policy is also necessary to ensure users can access previous versions for reproducibility purposes. Reproducibility is more challenging when data are provided via secure settings, but models have been developed elsewhere to address this (for example by CASCAD in France).
- Explorable – the provisions that need to be in place to safeguard data will often mean that analysts cannot access any data until they have specified their research questions and gained the necessary approvals for their work and themselves. However, sometimes the process of developing research questions requires access to data beforehand to fully understand what can and can’t be done. Good metadata and documentation can help here, but it is no substitute for using the data directly. Synthetic data could provide a technical solution here. We encourage data providers to identify ways to support exploratory analyses, including work looking at the potential for synthetic data to achieve this, and at ways to enable an ongoing dialogue with users to support their use.
All these requirements can be supported by data providers working collectively with each other to develop common standards and share best practice. The Working Group for Safe Data Access Professionals, Administrative Data Research UK and Health Data Research UK networks provide these opportunities.
 Reproducible refers to the ability to achieve the same research outputs from the same data using the same methods and code. The broader term, replicable, refers to the ability to achieve the same research outputs from equivalent or similar data using the same method (but different code).Back to top