Loading…

Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study

Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can undere...

Full description

Saved in:
Bibliographic Details
Published in:The Lancet. Digital health 2022-05, Vol.4 (5), p.e351-e358
Main Authors: Oakden-Rayner, Lauren, Gale, William, Bonham, Thomas A, Lungren, Matthew P, Carneiro, Gustavo, Bradley, Andrew P, Palmer, Lyle J
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems. We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour. In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0·994 (95% CI 0·988–0·999) compared with an AUC of 0·969 (0·960–0·978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0·980 (0·931–1·000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease). The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions. None.
ISSN:2589-7500
2589-7500
DOI:10.1016/S2589-7500(22)00004-8