Systematic review

Fairness and generalisability in deep learning of retinopathy of prematurity screening algorithms: a literature review

Abstract

Background Retinopathy of prematurity (ROP) is a vasoproliferative disease responsible for more than 30 000 blind children worldwide. Its diagnosis and treatment are challenging due to the lack of specialists, divergent diagnostic concordance and variation in classification standards. While artificial intelligence (AI) can address the shortage of professionals and provide more cost-effective management, its development needs fairness, generalisability and bias controls prior to deployment to avoid producing harmful unpredictable results. This review aims to compare AI and ROP study’s characteristics, fairness and generalisability efforts.

Methods Our review yielded 220 articles, of which 18 were included after full-text assessment. The articles were classified into ROP severity grading, plus detection, detecting treatment requiring, ROP prediction and detection of retinal zones.

Results All the article’s authors and included patients are from middle-income and high-income countries, with no low-income countries, South America, Australia and Africa Continents representation.

Code is available in two articles and in one on request, while data are not available in any article. 88.9% of the studies use the same retinal camera. In two articles, patients’ sex was described, but none applied a bias control in their models.

Conclusion The reviewed articles included 180 228 images and reported good metrics, but fairness, generalisability and bias control remained limited. Reproducibility is also a critical limitation, with few articles sharing codes and none sharing data. Fair and generalisable ROP and AI studies are needed that include diverse datasets, data and code sharing, collaborative research, and bias control to avoid unpredictable and harmful deployments.

What is already known on this topic

  • Retinopathy of prematurity is the most common avoidable cause of childhood blindness, which is a burden mainly in undeveloped countries due to inadequate preterm and neonatal care and a lack of specialists to diagnose and treat patients.

  • Teleophthalmology has been applied in retinopathy of prematurity screening, creating an opportunity for algorithm development and artificial intelligence applications.

What this study adds

  • This study shows that although studies apply artificial intelligence in retinopathy of prematurity, fairness and generalisability efforts are limited in this study field.

How this study might affect research, practice or policy

  • Fairness, generalisability and bias controls are fundamental for adequate artificial intelligence implementation. Retinopathy of prematurity is promising, but more efforts are needed to avoid unpredictable and harmful results.

Background

Prematurity is defined as birth before 37 weeks of pregnancy, classified as either extreme preterm (<28 weeks), very preterm (28–32 weeks) or moderate to late preterm (32–37 weeks).1 There are an estimated 15 million live preterm deliveries each year, primarily in low-income and middle-income countries (LMICs). While the number of preterm births is increasing each year globally, so too are survival rates into adulthood, in large part due to improvements in neonatal intensive care facilities and technology (especially in LMICs). However, with many regions still lacking access to such advancements, maternal and fetal sequelae of prematurity (including retinopathy of prematurity (ROP)) are becoming more and more consequential.1–3

ROP is a proliferative retinal vasculopathy and one of the most common avoidable causes of childhood blindness. Low gestational age, low birth weight and supplemental oxygen at birth are major risk factors for ROP, leading to more than 30 000 children losing vision annually.4 5 Recognition of pertinent screening periods and timely diagnosis and management are challenging due to the lack of available paediatric ophthalmic specialists. And even when available, the variation in classification standards, equipment and examination technique, and treatment thresholds lead to divergent diagnostic concordance even among experts.6 7

Artificial intelligence (AI) algorithms use inputted data to mathematically generate clinical predictions, and the extensive use of ancillary imaging in ophthalmology makes them especially pertinent in aiding diagnosis and informing management for conditions such as ROP8 9 AI has to date been widely applied in the detection of ophthalmological conditions, including diabetic retinopathy, age-related macular degeneration, glaucoma and ROP—shown to perform on-par or better than human clinician.10–12 Through minimising the bias inherent to AI, and external validation of algorithms to optimise for consistency and replicability of predictions, the potential for AI to mitigate the consequences of relative specialist scarcity and provide cost-effective diagnosis and decision-making is powerful. However, currently, there are no commercial AI tools clinically approved for ROP screening.13

The application of teleophthalmology in ROP diagnosis and management in particular (applied to address underserved remote areas) has created an opportunity to collect rich volumes of ophthalmic imaging data, which can be used secondarily to further AI development in this field.7

Prioritising generalisability, fairness and reproducibility when developing AI algorithms is essential to promote nondiscriminatory models. By way of definition, ‘generalisability’ is the ability to provide accurate predictions in a new sample of patients not included in the original training population,14 ‘fairness’ is the assurance that AI systems are not biased in their predictions for subpopulations,15 and ‘reproducibility’ is the system’s capacity to replicate the accuracy in patients not included in the development.14 Code and data sharing are crucial components to facilitate generalisable and reproducible research and validation studies and enable an understanding of how models can be adapted and applied to heterogeneous patient populations globally who stand most to benefit from advancement in AI in ophthalmology.11 12 16

The risk of biased algorithms is a prominent concern in the development and implementation of safe AI and must be addressed to avoid perpetuating existing healthcare disparities. Because of the nature of the patient population in ROP screening and management, medicolegal aspects are also especially crucial before AI can be implemented safely in this space.17

Here, we review ROP studies that implement AI techniques, compare datasets and algorithms characteristics, and analyse efforts to ensure fairness, generalisability and reproducibility of findings.

Methods

A literature search was conducted using PubMed, EMBASE and MEDLINE databases. The search strategy used the following combination of key terms ROP and AI (search strategy detailed in online supplemental file 1).

Two authors (LFN and LZR) assessed articles found in the above search. First, we screened all articles and excluded those for non-human studies and those written in a language other than English, Portuguese or Spanish. Next, a second screening process evaluated article titles and excluded non-relevant articles (see below for a definition of ‘relevant’), reviews, clinical cases and comments. Finally, a third screening consisted of full-text analysis and excluded non-relevant articles, and those not available online.

The final cohort of articles deemed relevant included those mentioning AI that applied computer vision algorithms to ROP and compared the following variables: articles’ objectives, retinal camera, model preprocessing techniques, applied neural network and performance, data and code availability, authors’ nationality, and cohort demographics and nationality.

Results

The search strategy initially identified 220 articles, with 29 deemed eligible for full-text analysis (figure 1). After the full-text analysis, 18 articles were included in the final review (online supplemental file 2).

Figure 1
Figure 1

Article selection flow chart.

General characteristics

All articles were published between 2016 and 2022, with an annual crescent rate of publications (figure 2). According to the model objective, the articles were classified into ROP severity grading algorithms (7 articles—38.9%), plus detection algorithms (6 articles—33.3%), detecting patients that need treatment (3 articles—27.8%), ROP prediction (1 article—5.5%) and automated detection of retinal zones (1 article—5.5%).

Figure 2
Figure 2

Number of publications per year.

Regarding author’s representation, eight authors were from China (44.4%), six were from the USA (33.3%) and four were from India (22.2%). In 13 articles (72.2%), all authors were from a single country, and in 5 (27.8%) from an international collaboration group (figure 3).

Figure 3
Figure 3

Map with authors’ distribution.

In 2 of the 18 articles (11.1%), the code is publicly available.18 19 A further single article (5.5%) has made the code available on request.20 The datasets used for development and validation processes are not available in any of the reviewed articles. None of the articles report a bias control analysis, such as the metrics between different demographics and datasets.

Images and Camera

The most commonly used imaging system in ROP and AI datasets were the RetCam II, III and Shuttle cameras (Natus Medical, Pleasanton, California, USA) in 16 articles of 18 articles (88.9%). One article (5.5%) also used the 3nethra Neo camera (Forus Health, Bangalore, India), and two (11.1%) did not specify the retinal imaging hardware.20 21 In three articles (16.7%), only good-quality imaging examination was included,22–24 one article (5.5%) included both good and bad-quality imaging examination,20 and in the others, no quality control was described.

Dataset

A total of 180 228 images were included in the 18 reviewed articles. The number of individual patients included was not described. In three articles (16.7%),22–24 the sex of the included patients was described, but race/ethnicity information was not available in any article. The countries most represented by the data were China (seven articles) and India (four articles). There was no representation of countries in continental Africa or South America or Australia (figure 4).

Figure 4
Figure 4

Map with study’s population distribution.

Preprocessing

In computer vision algorithms, preprocessing is a fundamental preparatory step in data harmonisation before final model development. Of the articles reviewed, the image preprocessing stages consisted of applying a mask over the retinal image, image resizing, colour normalisation, vessel segmentation, image enhancement, illumination adjustment and image augmentation techniques (flipping and rotation). In seven articles (38.9%), details of the preprocessing techniques applied were not described.23–28

ROP severity grading algorithms

In seven articles (38.9%), the main objective was to automate the grading of ROP severity through ultra-wide colour retinal fundus photos (RetCam and non-specified camera).19 22 23 25–27 29 The grading algorithms included 93 383 images and the convolutional neural networks (CNN): Visual Geometry Group, Inception, Residual Network, Support Vector Machine and DenseNet. The reported metrics consisted of accuracy ranging from 91.9% to 99%, with sensitivity from 88.4% to 96.6%, specificity from 92.3% to 99.3% and area under the curve (AUC) from 0.92% to 0.99%.

Plus and preplus disease detection algorithms

Plus disease is one of the most critical features of ROP, indicating severe treatment-requiring ROP. Plus disease in ROP is characterised by arteriolar tortuosity and venous dilation in ≥2 quadrants of the posterior pole, and ‘preplus’ disease describes aforementioned posterior pole vascular anomalies not fulfilling plus criteria.30 31 In six articles, the main objective was to detect plus or preplus disease in retinal fundus photos through extracted vessel analysis from ultra-wide retinal fundus photos (RetCam and non-specified camera).30 32–36

The plus detection algorithms included 17 176 images and applied a U-Net and a modified U-Net CNN, with reported metrics of accuracy ranging from 72.3% to 94%, sensitivity from 92.4% to 95%, specificity from 92.4% to 94% and AUC from 0.88 to 0.98.

Detecting patients that need treatment

In three articles, the main objective was to detect ROP patients that require treatment. Two articles included RetCam colour fundus photos and one RetCam fluorescein angiography exam.18 28 37 The articles included 59 636 images from 3254 patients and applied a Gridding Residual Network and a U-Net model, with reported metrics of AUC ranging from 0.91 to 0.99. The Campbell et al article reported a 100% sensitivity and 78% specificity,37 and in the others, the sensitivity/specificity was not recorded.

ROP prediction

In one article, the objective was to predict the occurrence and severity of ROP using deep learning in a prospective dataset.24 The article included 7033 RetCam images from 725 patients and applied a ResNet CNN, with a reported accuracy of 68%, sensitivity of 100%, specificity of 46.6% and AUC of 0.87.

Automated detection of retinal zones

According to the International Classification of ROP, the retina is divided into three anatomical zones according to the optic disc, macula and ora serrata distances.31 Zone I is the circle centre on the optic disc with a radius of twice the distance from the optic disc to the centre of the macula, the zone II is the area from the edge of zone I to a circle with a radius distance of the optic disc to nasal ora serrata, and the zone III the residual temporal area outer than zone II.38

Determining the zones is important to classify the ROP stage, determine follow-up frequency and estimate the risk of ROP sequelae.

In one article, the objective was to automate detecting retinal zones in a retinal fundus photograph that included 3000 images (RetCam and 3nethra) and apply a U-Net CNN, with reported metrics of 98% accuracy in detecting retinal zones.20 However, this article does not report sensitivity, specificity and AUC metrics.

Discussion

The social morbidity of ROP is becoming increasingly relevant as global preterm birth and survival to adulthood continue to rise. However, consistent ROP diagnosis and treatment are fraught with challenges. This is in part due to a lack of expert availability, and in part because of the variation in examination technique, findings and treatment thresholds intrinsic to the process of human clinicians making imaging-based clinical diagnoses and decisions.5 AI and deep learning algorithms have the potential to ameliorate these challenges in ROP screening, detection and management, especially in remote areas and LMIC countries.6

Here, we find that most ROP articles employ AI techniques to grade ROP severity, detect plus disease, predict future ROP and identify patients requiring treatment. While metrics indicate promising results, we found that generalisability and fairness efforts are extremely limited in all ROP and AI articles. To ascertain representativeness, more ethnicity/race and country representation are needed in model development. The algorithm’s bias assessment is necessary to promote fairness, and is missing in all of the models.

Representativeness is needed in AI research, Coyner et al demonstrated worse ROP screening algorithm metrics when models are applied to a distinct population.21 Among the articles included in this review, the study population came from 13 countries, with most participants from China and India. There was no representation from South America, Africa and Australia/New Zealand.

The National Institutes of Health encourages sex and race/ethnicity description in clinical studies to assess diverse representation in biomedical research.39 In the reviewed articles, race and ethnicity label is absent; the patients’ sex is available in only two articles. None of the articles reported performance metrics disaggregated according to demographics, and none performed a bias control assessment was performed. Race and ethnicity reports enable biases assessment in model development and fairness analyses.

Aggressive posterior is a particular ROP severe plus disease classification with worse treatment outcomes. None of the included studies focus on aggressive posterior ROP as the target or included group.

In most studies, the retinal fundus photographs came from the ultrawide RetCam camera, which costs approximately US$100 000, and is rarely available in LMIC.40 Images from the 3nethra Neo, a more affordable alternative retinal ultrawide camera retinal camera,41 were included in only one study. More affordable cameras, such as smartphone-based cameras, are already applied in ROP screening but not in AI models.40 42 43 Better data collection and image quality assessment frameworks, in addition to prospective and validation studies, are essential to enable AI-assisted screening programmes in LMICs.

Of the reviewed articles, two shared codes and queries,18 19 but none has data readily available to share. Publicly available datasets and code repository sharing are important to promote reproducibility in AI research.

Model generalisability in machine learning research is ideal but is likely, not feasible because of dataset shifts across place and time.44 AI models should not perpetuate or magnify existing biases in diagnosis and treatment. In this review of ROP and AI articles, limited representation, biased datasets and the lack of bias control assessments are poised to upend successful implementation.

More diverse, representative and fair datasets, generalisable models, prospective studies, and collaborative efforts are needed before real-world deployment. These are particularly more challenging in LMIC countries. The feasibility of algorithm deployment in a clinical setting remains a promise at this time if AI readiness is gauged using published literature.

Conclusion

Distinct modelling approaches have been applied in ROP and AI research to grade ROP severity, detect plus disease, identify treatment-warranted cases, predict outcomes and delimit retinal zones. Although 180 228 images were included in the reviewed studies, most studies use the same ultra-wide retinal camera and lack demographic information and bias control.

The articles showed good reported metrics, but fairness and generalisability remained limited in all AI and ROP articles. Reproducibility is also a critical limitation, with few articles sharing codes, and none of the images and data being publicly available. To avoid perpetuating global healthcare inequalities and ensure access to such technologies to those who stand most to benefit from them, fair and generalisable studies are needed that include diverse datasets, data, and code sharing, collaborative research, and bias control to avoid unpredictable and harmful deployments.