Objective To evaluate the accuracy of convolutional neural networks technique (CNN) in detecting keratoconus using colour-coded corneal maps obtained by a Scheimpflug camera.
Design Multicentre retrospective study.
Methods and analysis We included the images of keratoconic and healthy volunteers’ eyes provided by three centres: Royal Liverpool University Hospital (Liverpool, UK), Sedaghat Eye Clinic (Mashhad, Iran) and The New Zealand National Eye Center (New Zealand). Corneal tomography scans were used to train and test CNN models, which included healthy controls. Keratoconic scans were classified according to the Amsler-Krumeich classification. Keratoconic scans from Iran were used as an independent testing set. Four maps were considered for each scan: axial map, anterior and posterior elevation map, and pachymetry map.
Results A CNN model detected keratoconus versus health eyes with an accuracy of 0.9785 on the testing set, considering all four maps concatenated. Considering each map independently, the accuracy was 0.9283 for axial map, 0.9642 for thickness map, 0.9642 for the front elevation map and 0.9749 for the back elevation map. The accuracy of models in recognising between healthy controls and stage 1 was 0.90, between stages 1 and 2 was 0.9032, and between stages 2 and 3 was 0.8537 using the concatenated map.
Conclusion CNN provides excellent detection performance for keratoconus and accurately grades different severities of disease using the colour-coded maps obtained by the Scheimpflug camera. CNN has the potential to be further developed, validated and adopted for screening and management of keratoconus.
Data availability statement
No data are available.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
What is already known about this subject?
Early and accurate detection of keratoconus provides opportunities to address risk factors and offer treatments to potentially slow its progression.
What are the new findings?
The proposed method can automatically analyse Scheimpflug tomography scans and accurately stage the severity of keratoconus.
How might these results change the focus of research or clinical practice?
The results show that automatic analysis has the potential to allow for a faster and more accurate evaluation of keratoconus.
Keratoconus is a non-inflammatory degeneration that leads to progressive corneal thinning, myopia, irregular astigmatism and scarring, resulting in debilitating disease that may significantly affect a patients’ quality of life.1 The early and accurate detection of keratoconus provides opportunities to reduce risk factors and offer treatments to potentially slow its progression.2
Detecting keratoconus in its early stages, however, can be difficult, as patients may have normal visual acuity and an unremarkable slit-lamp examination. The diagnosis often requires an assessment of the topography and/or tomography of the cornea to reveal subtle changes in corneal morphology.3–6 The evaluation of corneal topography and tomography may itself be challenging and open to misinterpretation.5 Several indices have been proposed to facilitate differentiation between keratoconus and normal eyes, such as the zone of increasing corneal power, inferior–superior corneal power asymmetry, steepest radial axes, posterior and anterior ectasia and corneal pachymetry.7–9
Utilisation of artificial intelligence (AI) to evaluate and detect early stages of various diseases10–14 at much faster rates than physicians has seen a sharp rise in recent years.15–17 In particular, AI techniques developed from artificial neural networks, deep learning (DL)-based algorithms show great promise in extracting features and learning patterns from complex data.18 Indeed, in eyecare, DL techniques have helped physicians in detection of anatomical structures and/or lesions,19–22 diagnosis of various diseases23 and provided prognostic insights.24 25 Most studies are, however, primarily focused on relatively common diseases such as diabetic retinopathy, age-related macular degeneration and glaucoma.26
The sensitivity and specificity of machine learning for the detection of keratoconus has been evaluated in several studies.27–40 Among current neural networks, the one which may be more suitable for the evaluation of the keratoconus is the convolutional neural networks technique (CNN), which is one of the main methods of recognising and classifying images through their colour-code pattern.41 CNN, therefore, can be applied to corneal topographic colour maps. Four studies have applied CNNs to patients with keratoconus,27–29 39 but with relatively small numbers, limiting it potential application. We, therefore, aimed to develop a CNN model to automatically detect keratoconus using standard colour-coded corneal maps and to provide a code that could be used in the clinic. This would be beneficial as colour-coded maps provide large amounts of information and clinicians are intrinsically more familiar with interpreting colour-coded maps in comparison to complex topographic indexes.
Material and methods
We included tomographic images of keratoconic and healthy eyes provided by three centres: The Royal Liverpool University Hospital (UK); Sedaghat Eye Clinic, Mashhad (Iran) and The New Zealand National Eye Center (New Zealand). Patients with all stages of keratoconus were included. Images were obtained between January 2013 and January 2020. Scans were obtained using the standard 25 scans setting without pupil dilation, under scotopic conditions by trained technicians using the Pentacam HR (Oculus Optikgeräte, Wetzlar, Germany). Patients were asked to fixate on the target light and asked to blink completely just before each measurement to allow for adequate tear film coverage over the corneal surface. The examiner checked each scan and its quality before recording it and only scans of acceptable quality, reporting OK at quality score section, were included. For the control group subjects with a Belin/Ambrósio Enhanced Ectasia total deviation index (BAD-D) of less than 1.6 SD from normative values (indicating the absence of ectasia) were included.42 43
Four different maps were investigated: axial, anterior elevation, posterior elevation and pachymetry. All of the four different maps used the absolute or standard colour scale, meaning that the maps display a fixed range of value selected in the settings of the tomographer, regardless of the map selected. All scans were categorised into four stages (online supplemental figure 1) according to Amsler-Krumeich grading.44 45
Five different classification tasks were considered: (1) healthy and keratoconus, (2) healthy versus stage 1, (3) stage 1 vs 2, (4) Stage 2 vs 3 and (5) 5-class classification between healthy and each stage of keratoconus. In order to tackle the problem of an unbalanced label, and accurately classify healthy and keratoconic eyes, the weights of each class were balanced by using the inverse class frequency for the training. For example, the weights for healthy eyes (n=82) and keratoconus eyes (n=1033) were set as (82+1033)/1033 and (82+1033)/82, respectively.
In this work, different classification strategies were developed. Considering that each of the four maps contained numerous parameters that could be used to detect keratoconus and in order to preserve the maximum possible level of information for the prediction, we trained four models with one for each type of map images mentioned above, and a fifth model which used a concatenation of 4 map images (axial, corneal thickness, front and back elevation) as input. During the prediction, the output of each CNN model was the predicted class label of the input map images so that each of the trained models would give a predicted class label based on its own opinion.
We also made a sixth model by using the majority voting strategy from the above mentioned five models for each given task. For this model, a final prediction was made on the class of the input samples through a voting strategy as each of the five models will produce a prediction. More specifically, for each image sample the categorical labels from the above five different models were pooled together the argument of the maxima (abbreviated argmax) operation was applied to determine the best prediction (online supplemental figure 2). Without first performing the argmax operation, the predicted results would be directly influenced by the addition of the probability distribution.
In this work, the whole Liverpool (UK) and New Zealand (NZ) datasets were randomly split into training (80%) and testing (20%) images (table 1). Twenty per cent of the training set were randomly selected as the validation set during the model training process. The entirety of the Mashahd (IRAN) dataset was kept as an independent testing set and not involved at any stage of the training phase.
Maps from the right eye were flipped along the centre vertical axis. Maps from the left eye remained unchanged. Four individual maps were cropped using an in-house programme in Matlab version 2019b (Mathworks, Natick, USA). In brief, all of the white pads around the four individual maps and the colour bars were removed respectively and then the four cropped individual maps were resized into 224 by 224 as the final inputs for the model learning. For the purpose of learning, the four individual maps together with the four resized individual maps were concatenated back in the original order.
Neural network architecture
We adopted a VGG16 model46 for this study, but in order to prevent overfitting, the number of connected weights of the top layer were reduced. Everything before the flattened layer was the same as with the standard VGG16. After the flattened layer for the 2-class task and in order to fully connect the (FC) layer with 128 outputs with a rectified linear unit (ReLU), an FC layer with 64 outputs with ReLU and an FC layer with two outputs (the final output layer representing two classes) was used with a SoftMax activation function. After the flattened layer for the 5-class task, the order was an FC layer with 128 outputs, a ReLU dropout layer with probability 0.5, FC layer with 64 outputs with ReLU dropout layer with probability 0.5, FC layer with five outputs (the final output layer representing five classes) were used with a softmax activation function. Please see online supplemental figure 2 for more information.
Saliency maps46 and Gradient-weighted Class Activation Mapping (Grad-CAM)47 were developed for visualisation of the learning, because the outputs of saliency maps and Grad-CAM were the same shape as the input image and could provide some intuition of attention. For saliency maps, the gradient of output with respect to input image from our models were computed. These gradients can highlight certain regions which contribute the most. Higher gradient values mean more contribution towards the output. In Grad-CAM, the computed gradients of output with respect to input image, produce a coarse localisation map or heat map highlighting the important regions in the image to show where the network has been focusing on. This provides a bit more explanation on the black-box characteristic of DL.
We implemented our classification models by using Python V.3.7, Keras V.2.3.1 and Tensorflow V.1.14 as back-ends. All of the models were trained by using the Adam optimiser. We searched the optimal learning rate from the set of (10–3, 10–4, 10–5, 10–6) for each model based on the validation set. Binary cross-entropy was the loss function used for the 2-class classification models, and the categorical cross-entropy loss function for the 5-class classification models. The batch size was set to 20. All the training processes were performed on a TESLA V100 GPUs and all the testing experiments were conducted using a 2080Ti GPU.
In this work, accuracy, sensitivity, specificity and area under the receiver operating characteristic curve (AUC) were used. In brief, Accuracy (ACC)=(TP+TN)/(TP +TN + FP+FN); Sensitivity: TP/(TP+FN); Specificity: TN/(TN+FP), where TP is a true positive, FP a false positive, TN a true negative and FN a false negative. AUC is valued between 0 and 1, where 1 means perfect classification. AUC will be more informative when the number of samples in different classes are imbalanced. De Long’s method was introduced to construct confidence intervals for accuracy, sensitivity, specificity and AUC, by bootstrapping with 2000 samples to calculate 95% CI.48
A total of 1926 corneal tomography scans were obtained comprising of 134 healthy controls and 1702 patients with keratoconus. A total of 1394 scans were collected from UK/NZ including 102 healthy, 199 stage 1, 264 stage 2, 144 stage 3 and 685 stage 4. An additional 532 keratoconus scans (32 healthy, 83 stage 1, 161 stage 2, 64 stage 3, and 192 stage 4) collected from IRAN between February 2017 to January 2020 were used a the validation set.
We will first present the results on the testing datasets of the UK/NZ datasets for the five different classification tasks, and then present the results on the IRAN dataset as the external validation set. Due to the imbalance of the classification problems, we will focus on the AUC results.
Healthy (n=20) vs keratoconus (n=259)
All the models using the thickness map, front elevation map, the back elevation map, the concatenated four maps and majority voting performed well, all with AUC higher than 0.80 (table 2). The model using the concatenated four maps as the input achieved the best AUC of 0.9423 (95% CI 0.8773 to 0.9942), followed by the model using the back elevation map (0.9173, 95% CI 0.8453 to 0.9793) and the majority voting model (0.8942, 95% CI 0.8115 to 0.9641).
Healthy (n=20) vs stage 1 (n=40)
The majority voting model achieved the best AUC of 0.90 (95% CI 0.8242 to 0.9653), followed by the model using concatenated 4 maps 0.8875 (95% CI 0.8103 to 0.9556) (table 3).
Stage 1 (n=40) vs stage 2 (n=53)
All models except the axial and front elevation maps performed well with AUCs higher than 0.88. The model using the back elevation map had the best performance with an AUC of 0.9153 (95% CI 0.8618 to 0.9592), followed by the majority voting model 0.9092 (95% CI 0.8557 to 0.9569). The difference, however, was marginal (table 4).
Stage 2 (n=53) vs stage 3 (n=29)
The model using the concatenated four maps achieved the best AUC of 0.8165 (95% CI 0.7398 to 0.8908) (table 5).
The majority voting model achieved the best AUC of 0.888 (95% CI 0.8677 to 0.907), followed by the concatenated 4 maps model 0.8835 (95% CI 0.8641 to 0.9029) (online supplemental table 1). The confusion matrices of the above models are shown in online supplemental table 2.
From the above results of different classification tasks on the testing sets, it is apparent that the top three models are always concatenated map model, major voting model and the model using back elevation maps. The differences in performance between the aforementioned models, however, were marginal. The concatenated map model performed consistently across all classification tasks when compared with the voting model but importantly required almost 80% less computational resources. We, therefore, selected only the model using concatenated four maps for the validation testing of the IRAN dataset.
Results on the external IRAN validation dataset
In the IRAN dataset, there were 532 images comprising: Healthy (n=32); stage 1 (n=83); stage 2 (n=161); stage 3 (n=64) and stage 4 (n=192). As all the images were from patients classified as keratoconus, we were not able to evaluate the sensitivity and specificity of this model on this dataset.
The AUC was 0.9737 (95% CI 0.9653 to 0.9821) for the classification of healthy (n=32) vs keratoconus (n=500). The AUC was 0.7304 (95% CI 0.6817 to 0.7783) for the classification of healthy (n=32) vs stage 1 (n=83). The AUC was 0.8988 (95% CI 0.8649 to 0.9305) for the classification of stage 1 (n=83) vs stage 2 (n=161). The AUC was 0.8285 (95% CI 0.7793 to 0.8754) for the classification of stage 2 (n=161) vs stage 3 (n=64). The AUC was 0.859 (95% CI 0.8438 to 0.8749) for the 5-class classification. The confusion matrices of the above models are shown in online supplemental table 2.
Visualisation of the learnt features
We also studied the area of interest (focus) of the models during its learning process. Saliency maps and Grad-CAM were developed to obtain heat maps to show where the network highlighted the important regions in the image. The following are some good examples (online supplemental figure 3).
Our findings suggest that accurate automated detection of keratoconus and its evolution are possible using a CNN. When provided with all four maps (axial, corneal thickness, front and back elevation) the model is automatically able to discern between keratoconus and healthy eyes with an accuracy of 99.07% (97.57%–99.43%). Moreover, it can detect different stages of keratoconus with an accuracy of 93.12% (86.75%–93.98%). We postulate that the CNN model has better accuracy than the current gold standard of human interpretation. The use of Amsler-Krumeich classification of keratoconus in four stages is related to its worldwide use in daily practice.
We believe that DL will aid in screening and in staging of keratoconus in a clinical setting, because the precise detection of early keratoconus is still challenging in daily practice.
The current gold standard of keratoconus diagnosis and staging relies on human interpretation of biomicroscopy features and tomography scans, and has previously been shown to be limited by poor reproducibility.5 49 The most commonly used parameter to determine keratoconus progression has been maximum keratometry (Kmax).50–52 Kmax is a single point reading representing the maximum curvature typically taken from the axial or sagittal anterior corneal curvature map. Kmax has numerous limitations as a single point reading is a poor descriptor of the cone morphology, a change in cone morphology (eg, a nipple cone progressing to a globular cone) can sometimes be associated with a reduction in Kmax, single point readings tend to have poor reproducibility, changes in Kmax do not correlate to changes in visual function and Kmax is limited to the anterior corneal surface, ignoring the posterior cornea, thereby having no ability to detect early or subclinical disease or early disease progression.53–57
The ability of an algorithm to detect keratoconus is based on an operationalisable distinction between a normal and ectatic cornea. A Global Consensus Panel in 2015 was able to agree on a definition of keratoconus in terms of abnormal posterior ectasia, abnormal corneal thickness distribution and clinical non-inflammatory corneal thinning.45 Various indices have been evaluated with regard to their ability to discriminate an ectatic from a normal cornea. Among these, the Smolek/Klyce and the Klyce/Maeda (KCI) have been shown to possess a good specificity and sensitivity in distinguishing between keratoconus and healthy eyes.42 The Tomographic and Biomechanical Index uses AI to combine Scheimpflug tomography and corneal biomechanical parameters to optimise ectasia detection with good sensitivity and specifity.58 This index has been shown to be even more accurate than Corneal Biomechanical Index.59
Regarding the detection of keratoconus progression, however, the Global Consensus Panel noted that specific quantitative data were lacking and, moreover, would most likely be device specific. Determinants for assessing keratoconus progression have been reviewed by Duncan et al.60 They concluded that this multitude of suggested progression parameters highlights the need for a new or standardised method to document progression.61–65
The use of colour-coded maps for DL provides more complete information with the global status of the cornea, instead of using topographic and/or tomographic numeric indices as done in the past.32 34 35 37 38 40 66–69 Numeric values can easily exemplify corneal shape, but they fail to represent the spatial gradients and distributions of corneal curvature, elevation, refractive power and thickness.
However, is not possible to demine the superiority of the CNN over a single numerical index, as the Kmax, in view of the overall learning process, which required four maps: axial, anterior elevation, posterior elevation and pachymetry. Furthermore, for the evaluation of keratoconus, a single parameter is not sufficient.45
In this study, the images of four colour-coded maps (axial, corneal thickness, front and back elevation) were used for DL, instead of topographic and tomographic numeric indices. The reason of our choice is based on the capability of colour-coded maps to hold a larger amount of corneal information than these numeric values for this learning. The maps were obtained via tomography Scheimpflug imaging which has an advantage over Placido disk-based corneal topography as it is able to evaluate both the anterior and posterior surfaces of the cornea. Evalaution of the posterior corneal surface is essential as both curvature and elevation of the posterior corneal surface have to be considered for the detection of early keratoconus.70–72
A multiplicity of machine-learning techniques such as neural network, support vector machine, decision tree, unsupervised machine learning, custom neural network, feedforward neural network and CNN have been used in previous studies but only in four studies has a combination of colour-coded maps and CNN been used used.27–29 39 The reason for opting for CNN over other machine learning methods in this study was based on the ability of the CNN to directly extract the morphological characteristics from the obtained images without preliminary learning, subsequently providing higher classification precision, especially in the field of image recognition. This study is different from the aforementioned ones as Lavric and Valentin28 did not use real clinical data, Zéboulon et al27 focused on refractive surgery; Kamiya et al39 used a no tomography device (AS-OCT), and Kuo et al29 had a smaller sample size.
Our study has a number of limitations. First, the number of eyes is still modest in nature and there are only a small number of eyes in some specific groups and this may create a classification model bias. A potential strategy for this is to use generative adversarial networks to synthesise images from a small number of real images but such models would require further external validation. Second, other risk factors of keratoconus were not included in the used prediction models and such factors will be useful for further refinement of the prediction performance. Future studies incorporating such risk factors such as family history, atopy or ethnicity,73–75 may improve the overall function of the model.
Moreover, given the multicentre nature of the study a number of technicians were used to obtain the scans but the Scheimpflug tomographer has previously been reported to have good intra and interobserver repeatability in healthy patients76 and those with keratoconus.77 Finally, the adoption of more recent CNN models and tricks (eg, attention and customised loss functions) can potentially further enhance the performance of the model.
In summary, our results demonstrate that AI models provide excellent detection performance for keratoconus and can accurately grade different severities of disease and therefore have the potential to be further developed, validated and adopted for screening and management of keratoconus. Clinical implication of automated detection and screening are of considerable importance in view to their ability to provide diagnosis in shorter time, increasing in this way the patient care. Indeed, it could be deployed particularly in regions with a high burden of disease so that the CNN model may have the potential to provide earlier diagnosis of keratoconus, improve access to treatments such as corneal cross-linking and potentially reduce preventable visual loss. A larger external validation study with another study population including healthy controls is required to confirm this study’s preliminary findings.
Data availability statement
No data are available.
The study received Institutional Review Board approval (A02719) and was conducted accordingly to the ethical standards set in the 1964 Declaration of Helsinki, as revised in 2000.
Contributors VR, YZ and SK: conceptualisation, methodology, investigation, visualisation, review and editing. XC and JZ: analysis, data curation, writing. KCI, DB, DR, AG, CNJM, YZ, M-RS, HM-M and MZ: data collection, writing
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.