Adapting model governance is The key to robust machine learning implementation in credit decisions
Authors: Mark Somers, Fabrizio Russo, Torgunn Ringsjø
Artificial Intelligence and within that the subset of techniques commonly referred to as Machine learning (ML) are being explored and adopted across industries at an increasing rate. Regulators are therefore forming their views on the implications of a change that many consider significant as recently reflected in James Proudman, the Executive Director of UK Deposit Takers Supervision, speech at the FCA Conference on Governance in Banking on 4 June 2019. This is a trend observed globally, as the Algorithmic Accountability Act (a bill presented in April in the US Senate) confirms. Whilst it is healthy to worry about whether models are driving informed and fair decisions, the risks brought about by ML are not entirely new. That said, model risk may be more material and difficult to fix as automated decisions become increasingly widespread and underlying data collection more pervasive and unstructured.
The risks of biased models are real, but the causes are often somewhat misunderstood. To recall the example that James Proudman provided, the firm’s model was discriminating against female applicants because they were a minority in the development sample and presumably based on the available legitimate controlling characteristics (qualifications, aptitudes etc), this could not be accounted for. Instead the model had picked up on two features: the word “women’s” in the CV being analysed; and having attended all-women colleges. It was inferred by the model, without human supervision, that these characteristics made candidates less suitable for roles. This is intuitively unfair, and perhaps more importantly, using a proscribed sensitive data item like gender is illegal in many jurisdictions.
This is a perfect example that illustrates the difference between risks arising from the use of more data versus those driven by poor inference of the modelled relationships. The first of the two features listed above can be seen as a feature “words in CV = women”. Typically, traditional models would not have the ability to use such granular data, hence it wouldn’t have been possible for the model to discriminate on that variable. This is a consequence of the application of advanced techniques (Natural Language Processing in this case) but fundamentally remains a data issue, as a minimum proscribed information (or elements that directly link to proscribed information) should have been removed prior to modelling. Achieving this effectively in unstructured and granular data remains a challenge which highlights the need to devise more robust controls over proscribed data exclusions when using large unstructured datasets.
The second feature illustrates an issue that is not driven by how advanced the technique used is but by replicating unfair biases in the development data. Using an “education institution” variable within that same dataset, but through a traditional regression model, may similarly have attributed a lower score to the all-women schools in a very similar fashion as the most advanced technique. To identify these situations, it is critical to have good model interpretability, suitable model governance, human controls and reviews to identify the issue. Two clear challenges here are the size and complexity of the model solutions available under the ML umbrella. As for the complexity, interpretability techniques and measures have been proposed and are continually being studied. At the same time, a solid understanding of the structure of the algorithms helps to suggest simple and intuitive measures to quantify impact of certain characteristics on the modelled outcome. Regarding the size of these models, an interactive way of exploring variables’ relationships and effects is what we found helpful to facilitate model review and sign-off by the stakeholders. Education through visualisation has proven to be a compelling way to engage, speed up understanding and gain buy-in.
A third issue likely to arise is the interaction formed by the ML models, whereby even if you control the “education institution” and “words in CV” variables so they do not bias your predictions, the model manages to isolate pockets with a certain combination of characteristics that are predominantly pertinent to a protected group, essentially creating a proxy for the bias you are trying to avoid. Although there is no evidence of this in the development sample it is possible that such behaviour might emerge post deployment as the environment changes. As a result, it is critical in ML applications to monitor outcomes regularly and use human review of sample decisions to ensure detection of inappropriate outcomes in critical applications. Again, setting up a process and adequate tools to facilitate a swift yet informed review process, is key to make ML viable.
The likelihood is that applying ML techniques to a problem that has not been modelled before or with data that has not been previously used with traditional inference/statistical techniques is more prone to unintended biases than the application of ML to a subject like credit risk assessment, where the same data has been used for decades and model driven decisioning is the normality. That said, care must be taken to understand the outputs and relationships established. Having awareness of the three causes of bias would be a good starting point; bias deriving from the use of unstructured data not appropriately analysed, unintended univariate modelled relationships and variable interactions. Simple ways to reduce the risk of these occurring are to vet and understand new data, analyse interactions and critically, do not forget inference.
To quote Proudman, “it is a prudential regulator’s job to be gloomy and to focus on the risks”; at the same time, it’s the role of the innovator to convince the public that the new solution is better than the old, all things considered.
In order to protect data subjects from inappropriate decision models it is important to enhance model governance to cope with the additional risks highlighted above. Augmented approaches to control risks should include:
· A high level statement of what is meant by a fair ML decision and the controls expected to deliver them
· Requirements for the extraction of “key factors” from ML decision processes (e.g. LIME, Shap or other more robust techniques that use game theory concepts of partitioning contributions consistently)
· Ongoing comparison to human decisions and reasoning
· Review of swap sets between ML models and traditional models to investigate reasons for improved performance
· Maintain a library of governance reviews of effects on criteria fairness (e.g. not granting credit because of a poor previous payment history is likely to be fair, whereas not granting credit based on gender is unlikely to be fair). This ensures accountability for reviewing the fairness of criteria is recorded.
· Use of external and “white hat” validation would also be an enhancement on internal independent model review.
In summary, the increased power of ML to drill down and capture nuances in the data that traditional techniques are not able to requires more awareness of the risks and possible causes of biases. As Proudman points out: “Boards should reflect on the range of skill sets and controls that are required to mitigate these risks both at senior level and throughout the organisation”. Having the right framework in place to identify and address unintended bias is therefore key to the successful incorporation of these technological advancements. The applied framework needs to include robust processes but fundamentally people and education around modelling phenomena with the objective of understanding rather than ‘just’ increasing a performance metric by a few points. With more advanced techniques, performance evaluation needs to be more holistic and allow for extraction of key information for decision makers to act upon. Ultimately, this has always been a requirement when models are used in decisioning, the only difference with ML models being that they should offer improved insight into the underlying drivers of the individual decisions and what can and should affect them going forward.