Guidance for Models: Trustworthiness, Quality and Value

Part 2: Developing and using a model that serves the public good

When you have successfully planned your approach and are confident in your model’s aims you should consider how best to develop and use that model to serve the public good.

While developing your model, collaboration and communication are key to demonstrating Trustworthiness, Quality and Value and you should consider these themes when thinking about the following questions:

Is there data of suitable quality?

Data are an essential part of any model and the focus before model building should be the quality and transparency around the underlying data. Getting data to a suitable quality and format is often the most timely step in model development but it shouldn’t be rushed.

Models are only as good as the data that is fed into them and in the case of machine learning the model builds itself almost entirely on it; where there is bias in the data the model will learn this, where data are missing the model will assume this has meaning (and not simply missing data). Assuring quality of the data, therefore, is vital in any case but particularly so for these types of models.

Transparency around quality is just as important as assuring it. All parties need to be aware how the data will be used, the level of quality needed and where there are limitations, inequalities or biases ensure these are made clear. This transparency helps users to make informed decisions over its use.

Checklist

It is known where the data came from, how it was collected and any limitations it has.
Data suppliers and operational staff know how the data will be used and the level of quality required.
Methods used to clean data have been made clear.
Any limitations and inequalities that exist in the data are clearly communicated and their implications discussed.

What is the right type of model?

Choosing a suitable model is also important as different models are appropriate depending on the nature of the data, the type of problem and level of explainability needed. All these things should be considered and the thought processes behind the choice communicated clearly for those who wish to understand the rationale.

To find technical guidance on what model is applicable for your data and problem please consult ‘A guide to using AI in the public sector’ particularly the section named ‘Assessing if artificial intelligence is the right solution’.

Public acceptability is related to how explainable your model is and the level of explainability should be appropriate for the context and purpose for which it will be used. Ask yourself the following questions:

Will there be an expectation that users will want to know how the decision was reached or statistic produced?
Will it impact decisions made about them?
Will the outcome of the model provide such a public good that there is public acceptance that the model cannot be explained?

The last point makes reference to the fact that some of the best performing models can also be the least explainable so it may be relevant to think about priority with users.

What is the difference between explainability and interpretability?

Explainability is being able to work through the model at every stage and understand how it has come to deliver its output. Interpretability focuses more on the assurances that can be made that the model does what you think it does.

Full explainability of a model is easier for traditional statistical models such as a linear regression but often more difficult when you have a model that is building itself based on the data (e.g. machine learning models). Machine learning techniques to identify relationships and patterns in data at scale that are not easily detectable by humans which can make it difficult to describe how a machine learning model has reached a decision. With that said, you should always be able to communicate assumptions built into the model, any known biases and the uncertainty inherent in its outputs.

For some models, the sophistication of the models learned behaviour means that it may be impossible for you to understand how an outcome has been reached. These models are also known as ‘black boxes’ or opaque models. Black box models do not allow for transparency of model decisions. Without appropriate steps being taken to ensure they are interpretable, black box models could damage the trustworthiness of the statistics produced or decisions made, and you should be sure that a simpler, more explainable model could not have been used instead.

If a model is deemed to be difficult to explain, then it should be interpretable to meet the transparency requirements of TQV. This means having stringent quality assurance processes that can satisfy those who are accountable for the model that even though it can’t be fully explained it is still behaving in an expected way given its inputs. Some examples of these quality assurance processes can be found in ‘How will model quality and performance be measured?’

Checklist

Reasons for model selection have been made clear.
Any limitations and inequalities that exist in the model design are clearly communicated and their implications discussed.
The users’ needs for model explainability are known.
It is known how the model is reaching its outcome or decision, and the result is reproducible.

If the model cannot be fully explained, the model is interpretable and fully quality assured (see ‘How will model quality and performance be measured?’).
It is clear how changes to the inputs affect the outputs.
It is clear to users that the model cannot be fully explained and reasons have been given as to why.

Pillars in Practice: Explainability vs Interpretability

Let’s say you have two hypothetical models in development:

Model 1: Uses data about household energy use to give each household a rating of energy efficiency which will provide them with advice on how to increase it.

Model 2: Uses personal health data to give a person a health rating which will affect their access to some health services.

For Model 1, you speak to households affected, some ask how the results will be used and when you explain it will only be used for giving advice they seem happy to accept the model without explanation. However, your senior manager still needs to know the model is fit for purpose and isn’t going to affect public trust in your organisation (a brand new home should not be expecting a low rating!). You decide to use a machine learning model but ensure one way you test the model is by changing the inputs many times and checking the output is as expected (interpretability).

For Model 2, you speak to members of the public and find out very quickly that this is a sensitive topic. You receive lots of questions around what data are involved and how the model will come to its conclusion. You know your model will need to be fully explainable to all types of audiences in addition to all quality assurance processes to gain public trust.

Are there opportunities for collaboration?

Collaboration can take many forms such as asking advice from domain and modelling experts to approaching teams who may have built something similar before.

You should collaborate with experts in both the type of model being used or developed and the subject matter which the data concern. You could do this by setting up a steering group of subject matter experts or run a workshop to gain feedback on specific aspects of your models design. This is to ensure any new insights drawn from the model are aligned with the experts’ understanding and the right type of model is deployed for the type of problem. It should also help you identify potential errors or bias that might already be built into a model’s design while also offering opportunity for external and independent challenge.

By exposing your model in such a way, it could also offer opportunity to re-use or re-purpose those that already exist. There is lots of work being done on modelling and data access across government and beyond that will enable easier data and method sharing. It is possible someone else has already provided a solution to your problem or modelled something similar. The Central Digital and Data Office (CDDO) have recently launched the Transparency Data Standard which collects information on how government use algorithmic tools and the Integrated Data Service (IDS) launching at the end of 2022 aims to streamline data access and linkage for researchers. There are also government data science forums which can be used to get advice and track current projects.

Checklist

Methods have been exposed to a wide professional audience to ensure appropriateness and opportunity for independent challenge.
Similar models have been sought and/or considered to avoid duplication of effort.

Is the model clear and accessible?

This can be split into two main considerations:

Can those not involved in model development understand how it works?

Effort should be made to communicate with users in a way that is meaningful to them.

Explanatory material should exist and make clear why the model was designed, for what purpose it should be used and for what purpose it should not be used. It should also make clear the involvement of expert steering groups or relevant stakeholders in the development of the model. Full technical explanation should always be produced for those who wish to understand the technical detail and targeted at a more expert group. However, technical documents may not be appropriate for all users and therefore should not be the only method of communication. Consider making use of analogies and visual aids to communicate the model process while still providing access to the essential information. You should also ensure any code is fully annotated so a user can follow what it is doing. Best practice in this area is to use the principles of Reproducible Analytical Pipelines (RAP) which increase reproducibility and auditability which help bring trustworthiness in outputs.

When using a model in place of another process, communicate the strengths and limitations of the usual process so that all audiences can understand usual level of uncertainty. Please consult the Aqua Book chapter 5 ‘The importance and implications of uncertainty’ for more information.

It should always be good practice to make explanatory material available with the most assistive technologies.

Checklist

The documentation is sufficient to allow all types of users to understand the model and statistics or data produced.

Has access to all been considered?

Equality of access means access that is available to all and can come under four categories with regards to models:

Data used in the model
Outputs generated by a model
The code used to build the model
Supporting documentation

The default should be that all the above would be made available to all users, however, we understand the appropriateness of this can vary greatly depending on the context. If any of the above is deemed not appropriate to open and equal access this should be made clear, and an explanation given.

Open access platforms such as GitHub are recommended to make your model available and open to feedback and improvement. Platforms such as this make your model findable, accessible, inter-operable, and re-usable and give the option of public access to increase transparency of your approach.

You should also provide contact information for users who need to raise concerns or ask questions regarding accessibility.

Checklist

The model code, data and documentation have been made available and accessible to all (where appropriate).
If the model code and/or outputs cannot be made available and accessible to all, an explanation has been given as to why.

How will model quality and performance be measured?

Quality assurances provide evidence that the model is fit for purpose and generating outputs that can be trusted but this doesn’t mean that they will be. It may be tempting to think that once the technical processes required to quality assure a model (e.g. producing a performance metric) are completed the model will automatically be trusted by its users but technical assurances alone ignore the social elements. It is the combination of the technical and social elements that help ensure trust.

What this means in practice is that model performance and testing is carried out in a way that takes full account of its intended use and impact. For models that have a direct impact on individuals or groups this means testing the model for those groups and individuals specifically and checking whether any bias exists towards groups. The testing data should be previously unseen by the model and the quality criteria used to determine performance should be clear and chosen to reflect the problem at hand.

With that said, the social impacts of a model can only really be considered by humans. Human scrutiny is important, and it is advised that you take a diverse sub-set of individual outputs and manually check the route the model has taken to its output. There should be provision built-in for outputs/decisions made by the model to be challenged and time should be given to this process before a model’s output is finalised. This final process builds trust by promising exposure of the model and allowing time for corrections and/or explanation before any significant impact is realised.

Please also consult the Aqua Book chapter 6 ‘Verification and validation’ for further information.

Checklist

Quality criteria used is clear and suitable to test model performance.
The quality assurance process has been fully explained (including data used).
The model has been assessed against the groupings that will be affected by the output and those of interest to the public.
There is sufficient human oversight to consider the risks and impacts of the outputs from a social perspective.
It is clear how the output from the model can be challenged and time is allowed for this process to happen.

Pillars in Practice: Making time to test and challenge

In models which feed into or make decisions relating to many entities it is very difficult to test every outcome and even harder to foresee every impact. Take the following hypothetical scenarios:

A model which feeds into a UK wide statistical output (e.g. inflation) and that output feeds into 4 other statistical outputs.
A model that gives a score for a child’s reading age in England. This score affects a child’s class placement which impacts educational outcome.

For both scenarios, even if quality assurance was thorough, the impact on every user of the statistics and on every child will only be fully apparent after roll-out. This is because 1) there are too many output entities to test and 2) there are too many interdependencies to test. It is recommended that in both scenarios a trial phase is implemented to allow the impacts on every user to be fully considered and any challenge to the model to be heard.

« Previous

Download PDF version (346.69 KB)