Ed Humpherson to RSS: Review of statistical models used for grading 2020 exam results

Dear Professor Ashby and Ms Witherspoon,

REVIEW OF STATISTICAL MODELS USED FOR GRADING 2020 EXAM RESULTS

Thank you both for your letter of 14 August highlighting concerns about the statistical models used to determine exam grades across the UK this year. You have requested the Office for Statistics Regulation (OSR) conducts a review of the statistical models put in place by the qualifications’ regulators in the UK and the processes by which they were put in place.

Despite the changes in approach to awarding exam grades announced since your letter, we consider there is still value in a review. OSR therefore plans to undertake a review focused on the process of developing the statistical models. Our review will consider the extent to which the organisations developing the models complied with the principles set out in the Code of Practice for Statistics.

There are many areas of the approach to awarding exam grades this year that may warrant review and it is likely that other organisations will commission or carry out reviews. We are conscious that too many reviews could be unhelpful and will seek to minimise overlap between our review and others. We will try to minimise the burden of our review on organisations involved in awarding exam grades and will contribute our findings to other relevant reviews where appropriate.

Our review will seek to highlight learning from the challenges faced through these unprecedented circumstances. We will not review the implications of the model on individual results or take any view on the most appropriate way to award exam grades in the absence of exams.

We plan to publish the findings from our review in September.

Yours sincerely,

 

Ed Humpherson

Director General for Regulation

 

Related Links

Professor Deborah Ashby OBE and Sharon Witherspoon MBE, RSS to Ed Humpherson

Professor Deborah Ashby OBE and Sharon Witherspoon MBE, RSS to Ed Humpherson

Dear Ed,

We are writing as President and as Vice President, Education and Statistical Literacy of the Royal Statistical Society (RSS) to ask formally that the Office for Statistics Regulation (OSR) conduct a measured review of the statistical models that qualifications regulators across the UK put in place, and the process by which they did so, in order to carry out a statistical adjustment of 2020 predicted exam results arising from the fact that there have been no national examinations due to Covid-19.

The RSS has of course seen your statement of 12 August that OSR would not undertake a review “of the implications of the model for individual results”. We want to be clear that the RSS is not seeking that. Given where we are in all the four nations of the United Kingdom, and the tight timetable for appeals and so on, we do not believe that would serve either individuals’ interests or the wider public good. The issues of individual grades, appeals and admissions to higher and further education will now be dealt with in other ways.

But the RSS does believe it is essential that a formal review is carried out to address two issues which we have raised since our earliest engagement with Ofqual; we believe our concerns about these have now been vindicated. Before we summarise them, it may be appropriate for us to say why we believe that OSR does indeed have a responsibility to consider this matter.

We believe that a review is essential to address the issue of the extent to which the qualifications regulators did indeed adhere to their obligation to serve the public good. As you have acknowledged, they have a duty to act in accordance with the principles set out in the Code of Practice for Statistics, to achieve ‘trustworthiness, quality and value’. We ask in particular whether the models and processes adopted by the qualification regulators did in fact achieve quality and trustworthiness. This is linked to the two issues the RSS has raised since April.

Quality

From the outset, the RSS has expressed sympathy with the challenges of establishing an effective grade-estimation procedure, not least in the time-scale required. We raised various technical questions that we believed any such estimation process would have to consider, given the data available, in order to pass any quality threshold. These can be summarised as:

• The likelihood of systematic upward bias (in a statistical sense) in teacher-assessed grades

• The uncertainty that was likely to attach to rankings, especially for middle-ranked students

• The variability in performance by ‘exam centres’ (primarily schools and colleges) and whether these were stable enough to bear the statistical weight put upon them.

These are of course complex issues, and we had no privileged access to data when we raised them. We were clear from the outset that we could understand why one concern might be that simply awarding teacher-assessed grades would be biased upwards and might be unfair not only to different year cohorts of students but to those (including higher and education providers and employers) who would use grades in their decision-making. But we noted the importance of considering uncertainty in rankings, especially for those not at the top or bottom of exam centre rankings, and of considering whether variability in results of exam centre distributions of grade results (not just their averages or medians but the features of the distribution) might be a problem, and might also vary between different types of school/college. This might, in turn, suggest a variety of possible uses of individual-level data about prior achievements, not just to assess exam-centre performance, but to inform statistical adjustments in a context of uncertainty.

We should note that we also understood why the qualification bodies could not produce a detailed algorithm before exam centres submitted their data (teacher-assessed grades and rankings). This is so not only for the reasons that you note in your statement of 12 August, but also because there would not have been the empirical data to hand to test and compare various possible models of statistical adjustment.

However, even after our initial examination of the algorithm used by Ofqual (now published), it is not clear to us that these issues have been taken into account, that they could not or should not have been, and that doing so might have resulted in statistically-adjusted grades that gave more weight to individual students’ performance, and allowed more clearly for a degree of uncertainty. This might have been at the expense of a somewhat larger uplift in historic trends in exam performance but as we discuss below, this bears directly on the issue of ‘trustworthiness’ and the relative lack of transparency in the qualifications regulators’ approach.

Trustworthiness

One issue underpinning trustworthiness of statistics is their quality and accuracy, which is why we have summarised some of our technical concerns. But another element in trustworthiness is the transparency with which the statistics have been set out and considered, and the extent to which they meet public need. On this ground too, we have concerns about the approach of the qualification regulators, which we have expressed increasingly clearly throughout the months since the decision was taken to cancel exams.

First, your statement of 12th August mentions Ofqual’s technical advisory group. The RSS welcomed the formation of such a group (though an announcement that it existed and who its members were was set out only after the initial, and main, Ofqual consultation). We did however have concerns that there were not enough independent external members (who were neither government employees or current or former employees of the qualification regulators). In a letter to Ofqual (and subsequent emails) we suggested that the RSS could nominate two distinguished Fellows who might have relevant statistical expertise. We eventually heard from Ofqual that they could consider these two Fellows, but only with a non-disclosure agreement that gave us real concern. We understood that members of such a group should not give a running commentary in any way, nor divulge any confidential information about exam centres, schools, or the different models being tested – and we wrote back clearly to Ofqual to this effect. But the proposed confidentiality agreement would, on our reading, have precluded these Fellows (who were suggested precisely because of their relevant statistical expertise, and lack of ties to qualification regulators or exam-awarding bodies) from commenting in any way on the final choice of the model for some years after this year’s results were released. We set out our concerns about the terms of the proposed non-disclosure agreement, and restated our willingness to help if a more suitable agreement could be reached. In the end, we did not get an official response to those questions, and our offer to help was not taken up.

We believe this calls into the question one element in the transparency of the process adopted by the qualifications regulators. We would note too that we are not alone in this. It was only after we failed to hear back from Ofqual that we prepared our submission to the House of Commons Education Select Committee. Again, we restated our view that these were complex issues, that difficult judgements would have to be made, that we had offered to help, and that some degree of transparency in the trade-offs and judgements made in the selection of the final model would be essential to public confidence (the ‘trustworthiness’) of the final statistical model chosen for grade adjustment. In its report, the Select Committee cited our evidence and itself called for greater transparency.

In the end, the only information about the statistical adjustment that was released before Scottish exam results were announced was a general, verbal description of the model Ofqual proposed to use. There were no statistical details, and no clear discussion of the trade-offs or judgements involved. The citation of evidence was, as Guy Nason has shown, thin and, in our view, inadequate. There was no real clarity that the statistical adjustment model being proposed privileged keeping within a percentage point or two of prior national grade distributions, and treating the rankings of individual students as sacrosanct, with no measurement error, and relying on individual students’ prior achievements only as part of judging the historical results achieved by particular exam centres, which were also treated as relatively fixed for most types of exam centres, rather than informing individuals’ statistically-adjusted grades. The information did not set out the planned approach with what we believe would be the minimum requirements for real ‘transparency’: the proposed statistical approach and the options considered; the evidence backing that up, including about uncertainty and variability; and a clear justification of the admittedly-difficult choices and trade-offs that would have to be made.

At that stage, we do not believe that any ‘gaming’ of the system could have occurred. We are aware of the time pressure under which the qualifications regulators were operating. But we believe that they could and should have set out alternative models with a clearer indication of the advantages and disadvantages – and more importantly, the judgements that underpinned their choice for wider discussion.

That we were not alone in being concerned about what was meant by the information released by Ofqual at that time is supported by the large volume of queries that the RSS fielded, mainly from specialist educational journalists, about what this meant the model would be, even before the release of the statistically-adjusted exam results from Scotland.

We would stress that none of these observations are a product of hindsight on the part of the RSS – we have been consistent in setting out our statistical concerns and our observations about the need for more transparency.

It may be helpful to end with a statement about why the principle of transparency matters in underpinning the trustworthiness of statistics.

The use of statistics for public good is based only partly on technical statistical issues. Some statistics are technically bad, wrong or worse than others because of the way that data are gathered, or the statistical modelling that takes place. But in many cases, statistics or statistical models are inadequate for the weight being put on them in decision-making, or embed various other judgements that need to be clear. In this case, there are issues about how much ‘grade inflation’ to allow, and how to be ‘fair’ to individual students whose rankings may be uncertain or who are in exam centres whose performance may be less fixed over time than the modelling seems to rely upon. So while we continue to have concerns about various technical decisions made by the qualification regulators, we also believe that having an more open discussion about this well before individual results were announced would have resulted in more trust in, and more trustworthy, statistical choices, in part because there would have been greater understanding of the underlying principles being applied and more detailed justifications of them. That is particularly important given that judgements about what is ‘fair’ have featured so widely in Ofqual’s statements and in other commentary.

‘Fairness’ is not of course a statistical concept. Different and reasonable people will have different judgements about what is ‘fair’, both in general and about this particular issue. But real transparency would have enabled a deeper, earlier and better public discussion not only about the technical issues we have raised, but would have allowed that to be divorced from the ‘strong interest’ you mention in your statement of August 12. This is not because those interests are wrong, or unimportant. But a statistical procedure should be capable of being judged as ‘fair’ or ‘reasonable’ in advance of its being used or knowing which individuals may be affected. This is one reason the RSS puts so much weight on transparency in its Data Manifesto. We do not believe that the development of the statistical adjustment methodology has been transparent enough to meet our concerns about statistical quality or the need for greater involvement of knowledgeable external experts. We are sure that it has not been sufficiently transparent to meet the aim of being trustworthy in the broader sense.

These issues would be worthy of consideration even if we could be sure that the cancellation of exams were a one-off occurrence. But none of us can be certain that the UK will not face similar issues in future. We can, however, be sure that the broader question of transparency in the use of algorithms by public bodies, and its importance to the quality and trustworthiness of statistics, will recur in other areas. An OSR review would seem essential both to address the questions that have arisen this year and to set a benchmark to ensure they do not happen in the future – in this domain or in others.

We look forward to your reply.

Yours sincerely,

Professor Deborah Ashby OBE FMedSci
President of the Royal Statistical Society

Sharon Witherspoon MBE FAcSS
Vice-President of the Royal Statistical Society, Education and Statistical Literacy

 

Related Links

Ed Humpherson to RSS: Review of statistical models used for grading 2020 exam results

Update – Concerns regarding the Teaching Excellence and Student Outcomes Framework (TEF)

Dear Professor Ashby and Professor Nason,

TEF holding reply to Professor Deborah Ashby and Professor Guy Nason – RSS

Thank you for your letter of 5 March 2019. I appreciate you taking the the time to share your concerns about the Teaching Excellence and Student Outcomes Framework (TEF).

I have asked my team to look into the matters that you have raised. I know that Mark Pont from my team has already been in touch with Professor Nason to head more about the issues. We will also speak with Dame Shirley Pearce, as well as with the ONS methodologists who have been commissioned by the Department of Education to examine the methodology for the Independent Review.

I will write again with my conclusions but in the meanwhile, please do feel free to contact us if you have any queries.

Yours sincerely,

Ed Humpherson,
Director General for Regulation

 

Related Links:

Deborah Ashby & Guy Nason to Ed Humpherson, March 2019

Concerns regarding the Teaching Excellence and Student Outcomes Framework (TEF)

Dear Ed,

We are writing to you on behalf of the Royal Statistical Society (RSS) to express our serious concerns about the Teaching Excellence and Student Outcomes Framework (TEF) produced by the Department for Education/Office for Students (DfE/OfS). The TEF is in large part a statistical artefact, and we are concerned that it does not meet the standards of trustworthiness, quality and value that the public might expect. Indeed, the statistical issues are so major that, in our view, the TEF is likely to mislead the public and, in particular, mislead students who use TEF to inform their university choices.

The RSS has written to you before about the TEF and, in doing so, enclosed the key points from our previous consultation responses to the provider-level and subject-level TEF exercises. We are not confident that all of our statistical points have been adequately addressed in relation to the TEF.

We are grateful for your efforts to foster communication between stakeholders, such as the RSS, DfE and OfS. Furthermore, we have welcomed the opportunity to participate in “The Independent Review of the TEF” that is being led by Dame Shirley Pearce. We duly met with Dame Shirley and her team in London in January and explained our concerns. The attached document is our submission to the Independent Review and highlights some of the key problems with the TEF. As you will see, we believe there are several areas where the TEF either does not adhere to, or transgresses, the UK Statistics Authority’s Code of Practice, and we have explicitly referred to these in the document we are sending to the Independent Review.

We would particularly draw your attention to two key issues.

1. Transparency and Reproducibility (Section D)

As far as we can discover, there is no complete, transparent description of how the TEF awards are made – especially in relation to the process by which statistical information and flags are provided to the TEF panels. Partial descriptions exist, together with some spreadsheets. However, several of the TEF recipe assumptions that we can see are not able to be properly evaluated due to the lack of transparency. If TEF is to continue, we would argue that it must be made fully transparent and easy to check. The whole, specific and detailed analysis pipeline should be published, making fully clear the methods and software that was used, plus as much data that can be revealed, as well as a proper detailed explanation of how it all works. If there are reasons why some lower level data should not be published, then these should set out and clearly explained.

Our belief is that full transparency would help all concerned. Once transparent, if everything adheres to best practice and can be validated, then this could help gradually establish the trustworthiness of TEF.

2. The multiple hypothesis testing problem. (Section C(vi))

The TEF process produces flags, the collection of which are used to inform the process of discerning the final TEF award. The flags are produced by assessing the size of Z-scores and comparing them to a ʻstandard’ critical value. However, the TEF computes a large number of Z-scores and this is equivalent to conducting a multiple hypothesis test. In such cases, the Z-scores should not be compared to a ʻstandard’ critical value, but typically one that is much larger (and there are various methods to do this, such as Bonferroni correction or using false discovery rate assessment). Using a single test critical value instead of a multiple test value is a serious statistical mistake, which will result in far too many indicators being spuriously flagged. Hence, the RSS believes that all the TEF awards made so far have been based on seriously flawed inputs and that, because of this, all TEF awards made to date are invalid.

It is hard to discern the exact situation due to the transparency problems mentioned above. However, during our listening session, as part of Dame Shirley’s Independent Review, we questioned members from the government department on this point and they confirmed that they were not using methods that appropriately controlled the size of the multiple hypothesis test.

Based on the above, and our written submission to Dame Shirley’s review, we would ask the Office for Statistics Regulation to consider the validity of the TEF, and to rule on whether TEF does actually provide the public with information which is trustworthy, of high quality and value.

Yours sincerely,

Deborah Ashby,
President, RSS

Guy Nason,
Vice President, RSS

Annex:  RSS Evidence to the TEF consultation, February 2019

 

Related Links:

Ed Humpherson to Deborah Ashby & Guy Nason, March 2019