Retina

Explainable artificial intelligence model for the detection of geographic atrophy using colour retinal photographs

Abstract

Objective To develop and validate an explainable artificial intelligence (AI) model for detecting geographic atrophy (GA) via colour retinal photographs.

Methods and analysis We conducted a prospective study where colour fundus images were collected from healthy individuals and patients with retinal diseases using an automated imaging system. All images were categorised into three classes: healthy, GA and other retinal diseases, by two experienced retinologists. Simultaneously, an explainable learning model using class activation mapping techniques categorised each image into one of the three classes. The AI system’s performance was then compared with manual evaluations.

Results A total of 540 colour retinal photographs were collected. Data was divided such that 300 images from each class trained the AI model, 120 for validation and 120 for performance testing. In distinguishing between GA and healthy eyes, the model demonstrated a sensitivity of 100%, specificity of 97.5% and an overall diagnostic accuracy of 98.4%. Performance metrics like area under the receiver operating characteristic (AUC-ROC, 0.988) and the precision-recall (AUC-PR, 0.952) curves reinforced the model’s robust achievement. When differentiating GA from other retinal conditions, the model preserved a diagnostic accuracy of 96.8%, a precision of 90.9% and a recall of 100%, leading to an F1-score of 0.952. The AUC-ROC and AUC-PR scores were 0.975 and 0.909, respectively.

Conclusions Our explainable AI model exhibits excellent performance in detecting GA using colour retinal images. With its high sensitivity, specificity and overall diagnostic accuracy, the AI model stands as a powerful tool for the automated diagnosis of GA.

What is already kanown on this topic

  • Geographic atrophy (GA) is expected to increase in prevalence in the upcoming years. Prompt detection and monitoring of GA are crucial to optimising treatment benefits and minimising vision loss. Colour fundus photography (CFP) is a widely accessible and affordable method for GA screening and surveillance. Artificial intelligence (AI) shows potential for automating GA diagnosis via retinal image analysis, but previous AI models that used CFP suffered from limited sensitivity and lack of explainability.

What this study adds

  • This study introduces an AI model that is both highly accurate and explainable for detecting GA using CFP images. The model surpasses previous CFP-based AI methods with a sensitivity and specificity of over 97%. Class activation mapping techniques are used, which visually explain the AI model’s decision-making process, enhancing transparency.

How this study might affect research, practice or policy

  • The explainable AI model offers automated GA screening using readily accessible CFP imaging. Due to its high performance and explainability, the model can support clinical validation and promote the adoption of AI for GA diagnosis. Through optimising early GA detection, this AI method can increase patient access to innovative treatments, potentially preserving vision.

Introduction

Geographic atrophy (GA), an advanced stage of age-related macular degeneration, is a significant global health concern, being a leading cause of legal blindness. The condition affects approximately 5 million individuals, and projections suggest that this number could rise to 10 million cases by 2040.1 2 The challenge in managing GA lies in its poorly understood aetiology and pathogenesis. GA is characterised by the progressive death of retinal pigment epithelium (RPE) and photoreceptor cells, as well as choriocapillaris loss. These alterations result in clearly delineated regions observable on retinal imaging and, if the central foveal area is implicated, they can result in significant vision loss.3 4

The recent approval of pegcetacoplan injection (SYFOVRE, Apellis Pharmaceuticals), an antagonist of C3 complement, has opened new avenues for GA therapy. By targeting the complement system, this treatment slows disease progression.5 6 This therapeutic breakthrough underscores the critical need for early diagnosis and regular monitoring of patients with GA to maximise the benefits of such treatments. To facilitate accurate and timely detection of GA, the use of advanced imaging techniques, such as optical coherence tomography (OCT), fundus autofluorescence (FAF) and colour fundus photography (CFP), is crucial. These imaging modalities offer detailed information on retinal structure and function, providing in-depth insights that enable healthcare professionals to identify GA and monitor its progression more effectively. In CFP, GA is depicted as a sharply defined, typically circular area displaying either partial or complete depigmentation of the RPE, often revealing the large choroidal blood vessels underneath. FAF amplifies the accuracy in pinpointing GA lesions and their boundaries due to the notable contrast between atrophic and non-atrophic regions, aiding in a more precise outlining and segmentation of GA lesions. OCT, as a three-dimensional imaging technique, offers advantages over two-dimensional methods (CFP and FAF), facilitating a thorough examination of atrophy and a quantitative evaluation of the involvement of specific retinal layers. It plays a pivotal role in identifying complete RPE and outer retinal atrophy, characterised by a hypertransmissive zone exceeding 250 µm, RPE disruption over 250 µm, photoreceptor degeneration, and absence of signs of RPE tears.7–11

The rapid progress in artificial intelligence (AI) technology has sparked growing interest in its application for prompt diagnosis and management of ophthalmic diseases, including GA screening and monitoring. AI algorithms have the capacity to analyse vast quantities of data from various imaging modalities, enabling the detection of subtle retinal changes that may be overlooked by human observers.12 By learning patterns and features of GA from extensive image databases, AI models can apply this knowledge to new images for automated GA diagnosis.13 However, while deep learning models employing FAF and OCT have shown remarkable performance in identifying GA when compared with CFP, their practicality and availability in various healthcare settings remain limited.14–16 In contrast, CFP stands out as a more widespread, accessible and cost-effective technique for GA screening and monitoring. CFP is a simpler method that captures images of the retina using a fundus camera, which is generally more affordable and portable than FAF and OCT equipment. This makes CFP a more viable option for healthcare facilities with limited resources or those located in remote areas. Moreover, the operation of CFP typically requires less specialised training than FAF and OCT, making it more accessible to a broader range of healthcare professionals. As a result, CFP can be more easily integrated into GA screening programmes, ensuring that a larger population has access to early detection and monitoring services.17 18

The development of explainable AI (XAI) algorithms is essential for ensuring the reliability and safety of medical decision-making processes, as they allow interpretation and explanation of AI decisions. By offering a transparent understanding of the decision-making process, XAI can foster trust between patients and healthcare providers, particularly in fields like ophthalmology, where early detection and treatment of eye diseases such as GA are paramount. XAI algorithms can analyse retinal images to identify disease-specific features and patterns, while providing a clear explanation of the diagnostic process, which not only improves diagnostic accuracy but also streamlines decision-making, enabling faster and more efficient treatment.19–21

Given the importance of such advancements, our study aimed to develop an XAI model that can accurately and reliably identify GA from colour fundus images. This model is designed to provide a cost-effective, accessible and explainable tool for early GA detection.

Materials and methods

Study population

Participants in this prospective study were recruited from individuals attending their regular yearly appointments at the Istituto Europeo di Microchirurgia Oculare (IEMO) in Udine, Italy. Eligible individuals were invited to join the research project. To be included in the study, patients were required to fulfil the following criteria: a minimum age of 18 years and a spherical equivalent ranging from −6 to +6 dioptres. Informed consent was obtained from each participant in written form prior to their enrolment in the study. All procedures were conducted in accordance with the Declaration of Helsinki and were approved by the IEMO review board.

Imaging collection

Each patient underwent retinal imaging using a fully automated, white LED confocal scanner (Eidon, CenterVue Spa, a company of iCare Finland Oy; Vantaa, Finland). This device employed a slit confocal technique, capturing 60 degree, 14 megapixel colour retinal images automatically through a non-mydriatic pupil using a broad-spectrum white LED (440–650 nm) as light source. Colour fundus images, centred on the foveal midpoint, were acquired for both eyes of each participant under their natural pupil size. A single image was captured for each eye of every participant involved in the study. A technician obtained the images, ensuring they were of gradable quality. Images of each eye that could not be properly identified, such as those that were blurred or defocused due to severe cataracts or keratitis, were excluded from further analysis. All fundus images were extracted as JPG files. To ensure privacy and confidentiality, all fundus photographs were subsequently anonymised and then uploaded to the AI system for further analysis and evaluation.

Imaging processing

Reference standards were established by randomly allocating each image to two retinal specialists (VS and DV) with 5–10 years of post-certification experience in a tertiary hospital. The image labelling was considered finalised only after both experts reached a consensus; if agreement could not be achieved, a third expert (PL) with over 10 years of post-certification experience in a tertiary hospital made the final decision. Each image was classified into one of three categories: healthy eye (category 1), eyes with GA (category 2) and eyes with any retinal conditions other than GA (category 3).

Explainable learning architecture

The proposed approach incorporates a framework consisting of two principal elements: (1) a feature extractor for classification purposes and (2) a class activation map (CAM) module employed to elucidate the interpreted outcomes (figure 1).

Figure 1
Figure 1

Explainable learning architecture. ReLU, rectified linear unit.

The feature extraction process relies on a Deep Convolutional Neural Network (CNN), specifically, the Efficientnet_b2 model, renowned for its high efficiency and compact design, pretrained on the substantial ImageNet data set. This CNN was employed to deduce a compact array of representative, low-level features from the input images, subsequently used for image classification. As the task at hand involves ophthalmological image classification, we strategically removed the final few layers of the network. We retained only the first six and five convolutional blocks for settings with two and three classes, respectively. This practice is customary when transferring knowledge from one domain to another, as the deeper the layers in a neural network architecture, the more domain-specific the extracted features become, thereby complicating their effective application in a new domain.

The second component of the proposed framework is the GradCAM, a CAM module. This module leverages the spatial information conserved through the convolutional layers of the feature extraction network to generate a heatmap, emphasising the regions of the original image that significantly contributed to predicting the output class. The heatmap generation process involves selecting a convolutional layer from the feature extractor (generally the final one, as it optimally balances high-level semantics and detailed spatial information) and performing a weighted average of the produced feature maps. GradCAM’s distinctive characteristic lies in how this average is calculated—the weights are based on the gradients of each feature map.

In our study, each image was randomly examined and then assigned to one of three categories by the explainable learning model reliant on CAM techniques, developed by the Department of Mathematics, Computer Science and Physics at the University of Udine.

Training, validation and testing data

The training, validation and testing data sets were created by randomly splitting the total data set into three parts. The training data set contained 60% of the total images, the validation data set contained 20% and the testing data set also contained 20%. This split was chosen to ensure that the model had a sufficient amount of data for learning, while also allowing for robust validation and testing. To prevent overfitting, we employed two regularisation techniques—early stopping and dropout. Early stopping was implemented by monitoring the validation loss during training and halting training if the validation loss did not improve after 20 consecutive epochs. In addition, we applied a dropout layer with a rate of 90%.

Performance metrics and statistical analysis

The entire test set was evaluated for accuracy, precision and recall in detecting the presence of GA. Additionally, key performance metrics such as sensitivity, specificity and F1 score were also calculated. We assessed the performance of our algorithm for both two-class (healthy eyes and eyes with GA) and three-class distinctions (eyes with GA, healthy eyes and eyes with retinal diseases other than GA).

Accuracy serves as a measure of the system’s proficiency in delivering correct predictions, while precision assesses the ability of the system to accurately classify positive cases. Recall, on the other hand, evaluates the system’s competence in recognising all instances of positive cases. In the context of medical applications, particularly in ocular disease screening, higher values in these metrics signify superior performance of the software.

The performance of the AI system was evaluated using key statistical measures—specifically the receiver operating characteristic (ROC) curve and the precision-recall (PR) curve. These analyses serve as robust methodologies to assess the prediction accuracy and the balance between sensitivity and specificity.

The proficiency of our deep learning algorithms in discerning GA was determined by scrutinising the area under the ROC (AUC-ROC) and PR (AUC-PR) curves. These evaluation metrics provide a comprehensive perspective on the effectiveness and accuracy of our computational approach in detecting GA. The AUC-PR curve is presented as the average precision value. Perfect agreement with the gold standard (human grading) would be indicated by area under the curve equal to 1. The closer the curve converges towards the top right-hand corner, the more accurate the AI-based system is. Comparisons between AUC-ROCs and AUC-PR were made by using the method devised by DeLong test.22 Statistical analyses were performed using MedCalc, V.15.0 (MedCalc Software, Ostend, Belgium) and SPSS statistical package, V.25 (IBM Corp, Armonk, New York, USA).

Results

A total of 540 colour retinal photographs were collected and incorporated into the study, representing 540 eyes divided into three categories: 180 healthy eyes, 180 eyes with GA and 180 eyes with retinal diseases other than GA. For each category, 100 images were designated for training the learning model, 40 for validation and 40 for performance testing following the training phase. Among the 180 fundus photographs depicting retinal diseases other than GA, the diagnoses included retinal vascular occlusions (n=18), epiretinal membranes (n=33), central serous chorioretinopathy (n=28), diabetic retinopathy (n=47) and choroidal neovascularisation (n=54). Further demographic characteristics of the population can be found in table 1.

Table 1
|
Demographic patients’ characteristics

The performance metrics for detecting the presence of GA among the entire test set are detailed in table 2 and figure 2. Confusion matrices illustrating our model’s performance are presented in online supplemental table 1.

Table 2
|
Performance metrics of explainable artificial intelligence model
Figure 2
Figure 2

Graphical representation of explainable artificial intelligence model performance metrics. Two classes: healthy, geographic atrophy; three classes: healthy, geographic atrophy and other retinal diseases.

In the identification of GA within the two categories of interest, our model demonstrated a sensitivity of 100% (95% CI: 83.2% to 100%) and specificity of 97.5% (95% CI: 86.8% to 99.9%). Further, the model showed a diagnostic accuracy of 98.4%, complemented by an AUC-ROC of 0.988 (95% CI: 0.918 to 1). The F1 score, measuring the model’s accuracy, was calculated at 0.976. Moreover, the AUC-PR presented an average precision of 0.952 (95% CI: 0.719 to 0.994). Additionally, our model demonstrated a diagnostic accuracy of 96.8% in identifying GA among the three categories within the entire data set. The performance metrics of the model were robust, with an accuracy of 96.8%, precision of 90.9% and recall of 100%, yielding an F1-score of 0.952. The model reported a sensitivity of 100% (95% CI: 83.2% to 100%), and a specificity of 95% (95% CI: 83.1% to 99.4%). The AUC-ROC was 0.975 (95% CI: 0.897 to 0.998), and the AUC-PR reached an average precision of 0.909 (95% CI: 0.685 to 0.979), further supporting the model’s diagnostic capabilities. The difference between the two AUC-ROC was 0.0125 (95% CI: −0.0120 to 0.0370; p=0.3173) and between the two AUC-PR was 0.04329 (p=0.4232). Training was halted due to early stopping based on the validation loss plateauing. The gap between training and validation loss indicated overfitting was beginning to occur. However, the early stopping criteria successfully terminated training to prevent further overfitting, as evidenced by the model’s strong performance on the independent test set. Figure 3 presents examples of the CAM visualising the prediction process executed by the trained model. As illustrated, it was observed that the CAM for each test data demonstrated marked activation in the posterior pole particularly in retinal images of category 2, which corresponds to those exhibiting GA.

Figure 3
Figure 3

Examples of a class activation map for prediction of healthy eyes (category 1), eyes with geographic atrophy (category 2) and eyes with retinal diseases other than geographic atrophy (category 3) by the trained model using the test data set.

Discussion

In the wake of pegcetacoplan’s recent approval for GA treatment and ongoing investigations into novel therapeutic strategies, the significance of early GA detection has become markedly clear.5 6 Early detection and timely intervention are crucial in preserving visual function and enhancing the quality of life for those affected by GA. As the need for effective GA screening escalates, it is worth noting that current methodologies such as FAF and OCT exhibit commendable efficacy in detecting GA.14 FAF, in particular, provides a more detailed depiction of GA as it captures lesions and hyperfluorescent regions distinctly, enabling a superior visual representation of the retina in patients with GA. The pronounced contrast between atrophic and non-atrophic regions in FAF images allows for a more precise delineation and segmentation of GA lesions compared with CFP images, boosting the identification accuracy and reproducibility for both human and AI algorithms.9 10 23 Ramsey and colleagues’ study corroborates this, as they reported superior accuracy (±SD) with FAF images (0.75±0.16) compared with CFP images (0.42±0.25) when using the same algorithm. They inferred that the differences in accuracy were primarily due to the presentation of GA features in these distinct imaging modalities.24 This highlights the advantages of using FAF over CFP for GA detection, given its superior accuracy and reliability.

However, FAF and OCT methodologies, although effective, come with significant costs and require highly skilled personnel to manage the equipment and interpret the results. This limitation could impede widespread adoption, particularly in resource-limited settings or areas lacking specialised healthcare professionals.17 18 In contrast, CFP emerges as a more affordable, accessible and user-friendly alternative. Historically, CFP has been the gold standard for imaging GA and the principal tool for measuring GA lesions in clinical trials.11 25 It is the primary modality employed in large-scale epidemiological studies and disease classification systems. GA lesions in CFP appear as retinal depigmentation, which enhances visibility of the underlying choroid. However, due to media opacities and low contrast between atrophic and non-atrophic areas, CFP’s depiction of GA features is limited, making GA lesion and boundary detection challenging. Consequently, CFP has been associated with subpar performance in identifying GA, and due to image quality constraints, it is often not suitable for automated or semi-automated detection algorithms.11 This is further evidenced by the mixed results reported in the literature, emphasising the need for alternative imaging techniques or improved AI algorithms for more accurate and reliable GA detection using CFP images.14

In contrast with previous studies, our results successfully demonstrate that the implementation of the present XAI model can significantly enhance GA detection using CFP. This underscores the potential of our AI-based strategy to overcome traditional CFP-based model limitations, serving as a viable alternative for more accurate and reliable GA detection in a variety of healthcare environments. Using a unique explainable learning model, our approach achieved high diagnostic accuracy for identifying GA in a diverse dataset of colour retinal images.26 27

Remarkably, our model demonstrated commendable performance by employing a pretrained model anchored in ImageNet, and then refining this model using a mere 300 images for training. This approach allowed us to exceed the performance exhibited in Keenan’s research, showcasing the effectiveness of our method despite the lean data set.28 This superior performance and improved transparency, ascribed to the explainable nature of our AI system, promotes greater understanding and trust in its predictive abilities. Specifically, this model does not merely function as a ‘black box’, but rather provides a transparent process that explains its diagnostic conclusions. The deep learning model, based on the Efficientnet_b2 architecture pretrained on the ImageNet data set, demonstrated remarkable diagnostic accuracy in identifying GA from retinal photographs. These results surpassed expectations and were found to be robust in a multicategory classification scenario involving other retinal diseases. Moreover, using colour retinal photographs, a more accessible and cost-effective imaging modality compared with OCT or FAF, supports the model’s applicability across a variety of healthcare settings.

In addition to the accuracy achieved, a noteworthy feature of our study is the use of GradCAM, a CAM module, to produce heatmaps. These heatmaps offer an invaluable tool for identifying the specific regions in the retinal image that were most influential in the model’s predictions. This visualisation can potentially bridge the gap between the AI’s decision-making process and the human clinician’s understanding, fostering trust and facilitating more effective communication. The ‘explainability’ of the AI model presents a unique advantage, particularly in ophthalmology where clear interpretation and explanation of diagnostic decisions can impact patient care significantly.

By offering interpretability and transparency in the decision-making process, XAI enhances trust in AI systems, enabling clinicians to make better-informed decisions and validate AI-generated outcomes. XAI also addresses regulatory compliance and ethical concerns by making AI decision-making processes more understandable, auditable, fair and accountable. Additionally, XAI facilitates error detection and model refinement, ultimately improving the accuracy and performance of AI models in complex domains such as medical imaging and screening.27

The use of the GradCAM mechanism provides a path to uncover the model’s internal logic and potentially identify novel features or patterns that contribute to the detection of GA. Such insights could enhance our understanding of the disease’s pathology, leading to better diagnostic and treatment strategies in the future.

Though our study achieved promising results, it is essential to address its limitations. A potential limitation of our AI model is its reliance on images captured using a single fundus camera, which may impact the generalisability of the results to images acquired from different cameras or imaging systems. Furthermore, the study was conducted at a single centre and employed a relatively homogeneous population which could introduce potential biases and restrict the model’s applicability to a more diverse range of patients with varying demographics and clinical characteristics.

However, we view these limitations as potential strengths of our approach. The specificity of our AI model to a single imaging system and a particular population illustrates that focusing on a distinct imaging device and a specific patient demographic can deliver high performance, even with smaller training data sets. Whereas many AI models are trained on expansive image databases from multiple fundus cameras and across a broad spectrum of pathologies, our study underscores the potential advantages of developing bespoke AI solutions specific to individual diagnostic tools and certain retinal pathologies. This approach could significantly boost the precision and efficacy of AI-assisted detection and diagnosis in ophthalmology.

In reflection on the limitations of our study, the exclusion of individuals with high myopia is acknowledged. Moreover, in this study, patients with concurrent GA and other retinal conditions were not included. These criteria, while aiding in the uniformity of our study population, may not mirror the diversity of patients encountered in real-world clinical settings. In practice, the inclusion of patients with high myopia or those with concurrent GA and other retinal conditions could introduce additional variability, given the distinct retinal changes often associated with these conditions. Such inclusion could potentially impact the AI model’s performance, necessitating further refinements to accurately identify and classify GA amidst the backdrop of other retinal alterations. Future studies should consider these patient demographics to ensure broader applicability and validation of the AI model in a more inclusive and diverse patient population.

One statistical limitation of this study is the potential for overfitting due to the relatively small sample size. By training on the entire data set with early stopping after 20 epochs, we were able to maximise the use of the limited training data while avoiding overfitting. The addition of aggressive dropout with a rate of 90% during training also regularised the model to improve generalisation. While these techniques help reduce overfitting, testing the model on larger and more diverse data sets remains important future work to fully confirm its applicability across different demographics and imaging systems. Compared with previous studies that used deep learning models for GA detection, our study stands out due to the explainability of our AI model. This transparency not only fosters trust in the model’s predictions but also provides valuable insights into the features that the model considers important for GA detection.

In conclusion, our study presents an effective, XAI model capable of accurately diagnosing GA from colour retinal photographs, thus demonstrating the potential of AI in the field of ophthalmology. By employing a transparent decision-making process, our model enhances trust and improves understanding, contributing to the potential widespread adoption of AI technology in clinical settings. As such, the results from our study suggest that AI may become a powerful tool in screening campaigns to diagnose GA promptly and accurately, enabling timely intervention and improved patient outcomes.