DPM data inputs
5. In the analysis of the quality of the data, the ONS follows the guidelines of the European Statistical Systems that set the dimensions of data quality: relevance (meeting user needs), timeliness, coherence with other sources and internal comparability, accuracy and reliability, output quality and accessibility. Each of the DPM data inputs are assessed against those criteria (ONS 25/05/2023, 29/12/2023, 23/02/2024). Further, the development of a Quantitative Quality Indicators (QQI) to quantify the quality of the administrative data is exemplary and potentially providing key information about the data quality to the DPM, which is crucial for the outputs, as I explain later. However, I noted that the terms related to accuracy and bias are not always clearly described. For instance, more technical reports (Law et al. 2022) differentiate between bias (a systematic error) and accuracy (variance, or uncertainty, around an estimator) and clear guidelines are created for assessing the quality of population estimates (ONS 2023c), whereas other documents (e.g., ONS 29/12/2023, 23/02/2024) tend to use a broader definition of accuracy that encompasses both bias and uncertainty. I thus recommend a distinction between bias and accuracy (i.e. uncertainty) is provided in the documentation, especially for the purposes of informing the DPM about the quality of the data.
6. The inputs to the DPM are population stocks, such as a baseline population estimated from the 2011 Census for 30 June 2011 (i.e., 2011 Census-based MYE) or snapshots of population counts captured in the administrative sources on a given day, and flows, that is counts of births and deaths (natural change) as well as international, cross-border and internal migration usually between 1 July and 30 June of a given year. The descriptions of the data inputs are provided within the documentation of the model (e.g., Elliott & Blackwell 2023, ONS 14/07/2022) and through the websites documenting the developments of the model (e.g., ONS 27/06/2023b). However, the details of their production are somewhat difficult to navigate because of the extensive cross-referencing to previous documents/releases, and potentially also due to the rebranding of the previous versions of the ABPEs before the deployment of the DPM in July 2022 into Statistical Population Dataset, currently used as an input to the DPM (ONS 27/06/2023a). The ONS confirmed that a methods guide and an interactive R package will be published that will contain each model specification. This should enable stakeholders to better understand the details behind the ABPEs production.
Stocks
7. MYEs based on the 2011 Census are used as a baseline input to the DPM. The MYE that follow census were adjusted for errors in the components of population change in a process called rebasing and reconciling (details available in ONS 30/11/2017). The 2021 MYEs stock estimates derived from the (rebased) 2021 Census have been used as an input to the model, as well as a gold-standard benchmark for assessing the ABPEs based on the DPM, without using them in the DPM (ONS 28/02/2023a). The rebased MYEs 2012 to 2020 were not used in the model. There is a general consensus that the rebased Census estimates are the best population measures available (e.g., ONS 18/12/2023) but 2021 Census may have some limitations due to it having been carried out during the COVID-19 pandemic (ONS 23/11/2023c). Patient Register was also used as a stock in 2012-2015 and Personal Demographics Service (PDS) that is based on the registrations with a GP (ONS 18/12/2023, Elliott & Blackwell 2023).
8. MYEs used as input to DPM are subject to statistical uncertainty, which reflects their accuracy. These are usually expressed in terms of confidence or credible intervals. Census base from the 2011 Census had confidence intervals for population of England and Wales of +/- 0.148 per cent (ONS 2012). This uncertainty is derived from the Census Coverage Survey (CCS). MYEs are created based on this Census base and accumulate uncertainty over time, the further the year from the census, which is referred to as intercensal drift (ONS 25/05/2023, 29/12/2023, 23/02/2024; see also Point 20). Most of the uncertainty in population estimates further from the census year comes from international and internal migration (ONS 27/07/2020).
9. Population stocks used as inputs to the DPM also include Statistical Population Dataset (SPD), referred to in the most recent ONS documentation as Version v4.1 (ONS 18/12/2023b; various previous versions have been used in the preceding publications by the ONS). As mentioned earlier, the SPD was previously referred to as ABPE but was rebranded to reflect its status as the DPM input (ONS 27/06/2023a). The SPD has an ambitious goal of providing the approximate measure of the resident population derived from a variety of administrative data sources. Since the calculation of the SPD would be for each year, the risk of drift observed in the MYE would be lower. The SPD is one of the key inputs to the DPM. The data sources used to create the SPD are well-documented (ONS 03/03/2023). The comparison of the SPD (v4.0) for year 2021 with the 2021 Census estimates showed that while the age-sex and area (LA) profiles are generally similar, there are some considerable differences that will require further research (ONS 28/02/2023a).
10. The “backbone” of the SPD is a Demographic Index (DI), which is a dataset containing linked data on individuals in administrative sources (ONS 2022). The DI records are then included in the SPD if they meet set inclusion criteria based on activity of individuals recorded in the data (through interacting with the administrative systems). However, it is acknowledged that the linkage process for the DI creation and the filtering of the SPD may overlap (ONS 2022), which may introduce error to the data. Furthermore, an exercise of linking the DI records with the Census Coverage Survey (CCS) showed that only a small proportion of the CCS respondents (less than 1%) were not linked to the DI (ONS 01/03/2023). However, the analysis of the linked data showed that young males, those in London, or those not born in the UK or speaking English as their first language were more likely to be missing from the DI. As acknowledged by the ONS (2023a), more work is needed to improve the linkage methods and better understand the quality of linkage that underlies the DI and its potential impact on biases and uncertainty that can be propagated into the SPD and, subsequently, DPM.
11. The main problem with the SPD is overcoverage (e.g., double counting of individuals in the admin data or the inclusion of individuals who are not usual residents but appear active in the admin data), which seems to be more difficult to handle in estimation processes than undercoverage (i.e., not including some of the usual residents in the data; see Law et al. 2022, 2023). Various strategies and methods have been proposed to improve the quality of the SPD in terms of coverage. The adjustment for over- and undercoverage takes place in the creation of the SPDs (Law et al. 2023) and it also is implemented in the DPM via model parameters (Elliott & Blackwell 2023). This is one of the key aspects of the model, as I discuss later (Point 25).
12. The problem of under- and overcoverage in the SPD was analysed in an exercise of linking the SPD (v4.0) in 2021 to the Census 2021 and Census Coverage Survey (CCS; ONS 28/02/2023c). It was found that 7.3% of those on 2021 Census and CCS were incorrectly excluded (i.e., undercoverage) from the SPD v.4.0, while 8.6% in the SPD were incorrectly included (i.e., overcoverage). While the differences in under- and overcoverage may cancel out at aggregate levels similarly to what has been shown by Champion’s (2024) analysis of the UPC, care needs to be taken because these two issues may affect populations with varying characteristics (age, sex) in different areas, as demonstrated on the example of Harrow LA, where differences in incorrect inclusions and exclusions across age groups were found (ONS 28/02/2023c). In another comparison of the SPD v4.0 (ONS 27/06/2023a), it has been found that the difference between coverage-adjusted SPD and the 2021-Census-based MYE is nearly 4% at a national level, and even larger relative differences were found for detailed characteristics. I thus suggest that in the future developments of the DPM, coverage parameters are construed in a way that reflects separate issues related to the under- and overcoverage. This would potentially reduce bias in detailed population characteristics, especially for areas with higher population churn or age profiles of young working age populations. Further, in the current version of the DPM, the model coverage parameters are created by using 2011 and 2021 Census-based MYE and a linear interpolation (ONS 28/02/2023a). I advise that if the SPDs are corrected by using, e.g., 2021 Census data (benchmark) and used as an input to the DPM, then the same benchmark is not used to inform coverage parameters in the DPM. Otherwise there is a risk of the DPM over-correcting for the coverage issues; this will also violate an assumption of not using data twice within Bayesian inference (Gelman 2008, Robert & Ntzoufras 2012). This has been considered by the ONS in their scoping of a data source to be used as a coverage benchmark (Law et al. 2022: Figure 1).
Flows
13. Data on births and deaths in England and Wales are sourced from the Civil Registration System administered by the ONS and are of very high quality compared with migration data, despite minor delays in reporting (ONS 23/02/2024, 29/12/2023). In the DPM, births and deaths are considered error-free and the only uncertainty related to them comes from the fertility and mortality rates through the population at risk in the denominator (Elliott & Blackwell 2023), which is a reasonable assumption.
14. The production of international migration data (long-term international migration, or LTIM) has been going through changes since the COVID-19 pandemic, when the main source of information, International Passenger Survey (IPS), was suspended. The new methods rely, to a much greater extent, on the administrative sources, such as Home Office Border and Immigration Data (HOBID, linked visa and travel data), Registration of Population Interactions Database (RAPID) that builds upon National Insurance Numbers, and statistics from HESA. The IPS is still used to produce estimates of migration of British nationals (ONS 03/05/2024) yet new methods for using administrative data are being developed. Also, a new way of including asylum seekers in the admin-based migration statistics had to be developed.
15. Historically, international migration was subject to high uncertainty, especially for detailed characteristics such as age, sex, country of origin and LA where migrants reside. When relying on the IPS, it also referred to a definition of intended migration (i.e. when a person arriving in the UK or departing from the UK intended to stay in – or outside – of the country for more than 12 months), whereas the new sources permit, in principle, estimating the actual migration. This complies with the UN definition of an international migrant*. This, however, causes a delay in providing statistics as persons need to stay in the UK as usual residents for the 12 months before they are recorded as migrants in the database. This limitation is overcome by developing and providing provisional migration estimates, which is a sound strategy that can satisfy stakeholders at the expense of potential corrections to the provisional estimates once the official data arrive. For this purpose, advanced and novel methods are being developed and applied by the ONS (ONS 16/04/2021).
*A long-term migrant is a person who moves to a country other than that of his or her usual residence for a period of at least a year (12 months), so that the country of destination effectively becomes his or her new country of usual residence. From the perspective of the country of departure, the person will be a long-term emigrant and from that of the country of arrival, the person will be a long-term immigrant (UNdata | glossary).
16. As mentioned in the previous paragraphs, the IPS estimates were subject to the sampling error and were produced with measures of uncertainty. The theoretical foundations for the measures of uncertainty of the admin-based migration estimates (ABMEs) are still under development (ONS 01/06/2023) but the hope is that because of the reliance on the admin data, the uncertainty of the estimates can be reduced. However, the currently provided measures demonstrate substantial uncertainty for selected flows. For instance, the 95% uncertainty interval (based on adjustments and modelling) for EU national immigration in 2022 was (112,800; 195,600), whereas for emigration it was (151,000; 270,600) (ONS 01/06/2023; Table 5). The width of the interval suggests that the net migration of EU nationals can be both positive and negative, if we assume there is no correlation between the two flows. Further, the uncertainty of migration of British nationals is relatively lower, despite them relying on the IPS, which would be expected to yield more uncertain estimates. Given its importance in understanding the uncertainty of the ABPEs and the impact of population components on it, it is indeed crucial to develop and provide reliable measures of uncertainty for international migration.
17. The development of migration statistics includes a rigorous process of quality assurance (QA) at all stages of data production (ONS 25/05/2023). One of the QA aspects is comparison with different sources. To aid the process of developing new methodology, the ONS could consider comparisons of their migration estimates, e.g. emigration of British nationals, with mirror statistics in other countries that are considered to have high-quality migration statistics (such as Sweden and other Nordic countries; cf. De Beer et al. 2010; Dańko et al. 2024). This could help testing the use of admin data and the assessment of the methods for producing provisional estimates.
18. The internal migration and cross-border migration are derived from the Personal Demographics Service (PDS) that is based on the registrations with a GP (ONS 23/11/2023b). The data are adjusted by using HESA data on persons moving to or leaving higher education and who are slow to update their health registration. The cross-border moves are further agreed with the National Records of Scotland and Northern Ireland Statistics and Research Agency. The adjustment by using the HESA data potentially removes an important bias in the internal migration data, especially for young working age males. However, of concern is the quality of the data as demonstrated through the linkage of the DI to the 2021 Census and CCS (ONS 01/03/2023, Figure 6). This analysis showed that the PDS has shortcomings in terms of having a matching local authority on Census-CCS records (for some age groups this matching was achieved for 75% of records). Also, the internal migration data rely on annual snapshots from the PDS with adjustments based on weekly updates. When coupled with the fact that not everyone registers with a new GP when moving, this may lead to uncertainty in internal migration especially when movements are unstable, such as during the COVID-19 pandemic restrictions.
Back to top