Project AIR - commercial AI for bone age prediction on hand XR

This leaderboard shows the results of Project AIR. An objective comparison of commercially available AI algorithms for radiology. Here we evaluate AI solutions for the prediction of bone age from hand radiographs from seven hospitals from the Netherlands.

The full paper, published in Radiology, with all details can be found here: https://doi.org/10.1148/radiol.230981


Abstract:

Background: Multiple commercial artificial intelligence (AI) products exist for assessing radiographs, however, comparable performance data of these algorithms are limited.

Purpose: To perform an independent stand-alone validation of commercially available AI products for bone age prediction on hand radiographs and lung nodule detection on chest radiographs.

Materials and Methods: This retrospective study was carried out as part of Project AIR. Nine out of 17 eligible AI products were validated on data from seven Dutch hospitals. For bone age prediction, the root mean squared error (RMSE) and Pearson correlation coefficient were computed. The reference standard was set by three to five expert readers. For lung nodule detection, the area under the receiver operating characteristic curve (AUC) was computed. The reference was set by a chest radiologist based on CT. Randomized subsets of hand (n=95) and chest radiographs (n=140) were read by 14 and 17 human readers, respectively, with varying experience.

Results: Two bone age prediction algorithms were tested on the hand radiographs (from January 2017 to January 2022) of 326 patients (mean age, 10 years±4 [SD]; 153 males) and correlated strongly with the reference standard (r=0.99, P<.001 for both). No difference in RMSE (years) was observed between algorithms (0.63 [95% CI: 0.58, 0.69] and 0.57 [95% CI: 0.52, 0.61]) and readers (0.68 [95% CI: 0.64, 0.73]). Seven lung nodule detection algorithms were validated on the chest radiographs (January 2012 to May 2022) of 386 patients (mean age, 64 years ± 11 [SD]; 223 males). Compared to readers (mean AUC, 0.81 [95% CI: 0.77, 0.85]), four algorithms performed better (AUC range, 0.86–0.93 [95% CI: 0.82, 0.96]; P range, <.001–.04).

Conclusion: Compared to human readers, four AI algorithms for detecting lung nodules on chest radiographs showed improved performance whereas the remaining algorithms tested showed no evidence of a difference in performance.