Introduction
Keratoconus (KCN) is an eye condition characterised by the thinning and bulging of the cornea, which is the clear, dome-shaped surface that covers the front of the eye.1 This condition leads to visual distortions, sensitivity to light and other visual problems.2 KCN typically starts during the teenage years or early adulthood and progresses slowly over a period of several years. Although the precise cause of KCN not fully understood, it is believed to be a result of a combination of genetic and environmental factors.3 Treatment for KCN may include eyeglasses or contact lenses to improve vision, or more advanced treatments such as corneal cross-linking, which uses ultraviolet (UV) light and a special solution to strengthen the cornea or corneal transplant surgery.4 To detect KCN, it is important to have a comprehensive eye exam performed by an eye doctor or ophthalmologist. The exam may involve several tests to evaluate the shape, thickness and curvature of the cornea. Here are some common tests that are used to detect KCN:
Corneal topography: This test uses a special instrument called a corneal topographer to create a detailed map of the cornea’s surface. It can detect any irregularities or distortions in the cornea’s shape that may indicate KCN.5
Keratometry: This test measures the curvature of the cornea by shining a light on the eye and measuring how it reflects off the cornea.6
Pachymetry: This test measures the thickness of the cornea. In KCN, the cornea is often thinner than normal.7
Visual acuity testing: This test measures the sharpness of vision using an eye chart.8
In addition to these tests, the eye doctor may also perform additional tests, such as corneal tomography or corneal biomechanical testing, to verify the diagnosis and assess the seriousness of the situation. The corneal topography test creates a three-dimensional representation of the cornea’s surface, providing detailed information about its shape, curvature and thickness, as well as any irregularities or distortions that may be present. The resulting colour-coded map highlights areas of the cornea that are flatter or steeper than normal, which can indicate conditions such as KCN or astigmatism. In particular, the four refractive maps commonly used for KCN detection are the sagittal, corneal thickness (CT), elevation front and elevation back maps.9 However, manually diagnosing KCN from the output maps provided by Pentacam and Optical coherence tomography (OCT) devices is a tedious and time-consuming task that requires a high level of expertise. Therefore, automated and accurate methods for KCN diagnosis are needed to improve the efficiency and reliability of the diagnostic process and facilitate early detection and treatment.10 Creating a pretrained model for image classification tasks is a challenging task that requires careful consideration of the architecture and pretraining method to effectively capture the relevant features and patterns in the input images. There is often a trade-off between the complexity of the model and its ability to generalise to new data. Thus, selecting a suitable architecture and pretraining method that balances both is crucial.11 One issue with using feature extraction techniques is that they may not capture all of the relevant information in the input data, which can limit the performance of downstream tasks. However, feature fusion can be a useful technique for improving the performance of downstream tasks by combining the outputs of multiple feature extraction techniques or modalities into a single representation that captures complementary information from each source.12 The vision transformer (ViT) is one example of a transformer-based approach for image classification that has shown promising results. The ViT processes images by dividing them into a sequence of non-overlapping patches and feeding them through a transformer encoder network. The patches are first linearly embedded into a sequence of feature vectors, which are then fed into the transformer encoder network to produce a sequence of context-aware representations. The final representation of the image is obtained by applying a mean pooling operation to the sequence of representations.13 This approach allows the ViT to capture global contextual information from the image while preserving spatial relationships between the patches. Additionally, the ViT can be fine-tuned on a specific image classification task by training the network end-to-end using a standard cross-entropy loss function. Fine-tuning the ViT in this way can improve its performance on the specific task and further optimise its ability to capture the relevant features and patterns in the input images.
The main objective of this study is to develop an image classification system that leverages the strengths of multiple pretrained models and a transformer architecture to achieve state-of-the-art performance. The approach involves extracting features from three pretrained models, combining them using a feature fusion technique, and using the fused features as input to a ViT. The system will be fine-tuned on the task at hand and evaluated on standard image classification benchmarks. The goal is to demonstrate that this approach can significantly improve the accuracy of image classification models and provide insights into the benefits of using multimodel feature fusion and ViTs for image classification tasks.
The paper’s structure can be outlined as follows: In section 2, we provide an overview of the related research. Section 3 is divided into three subsections that elaborate on the materials and methods used in the study. These include a description of the dataset, an explanation of the proposed fusion and ViT approach, and a discussion of transformer learning. In section 4, we present the experimental results. Finally, in section 5, we summarise our conclusions and present potential future directions for research.
Related work
Some research work has been done based on transfer learning for classifying KCN. These studies14–24 have used different pretrained convolutional neural network (CNN) models such as VGG16, VGG19, InceptionV3, ResNet152, InceptionResNetV2, SqueezeNet, AlexNet, ShuffleNet and MobileNetv2. In Al-Timemy et al,14 the authors introduced a method called dnsemble of deep transfer learning (EDTL) to detect KCN. They suggested a Peripheral iridectomy/iridotom (PI) classifier and EDTL technique that aggregates the output probabilities of five classifiers to make a decision based on the fusion of probabilities. The proposed method resulted in an accuracy of 98.3%. In Al-Timemy et al,15 an EfficientNet-b0-based transfer learning model is proposed for KCN detection, and the authors achieved 97.7% accuracy in two classes and 84.4% in three classes.16 A transfer learning approach on the VGG16, InceptionV3 and ResNet152 pretrained model achieved 93.1%,93.1% and 95.8% accuracy. In Lavric et al,17 the objective of the study was to reduce diagnostic errors and facilitate treatment using a CNN method. The study achieved an accuracy of 99.33% and used a dataset consisting of 3000 images based on 1 map, with 2 classes being classified and in Kamiya et al,18 a transfer learning approach was applied to the ResNet18 model and achieved 99.1% accuracy. In Al-Timemy et al,19 a deep learning model using Xception and InceptionResNetV2 architectures was proposed for the early detection of clinical KCN. By fusing the extracted features, the model was able to effectively detect subclinical forms of KCN with high accuracy. The model achieved an impressive accuracy range of 97%–100% and an AUC of 0.99 for distinguishing normal eyes from eyes with subclinical and established KCN. In addition, the model was validated using an independent dataset, obtaining AUCs ranging from 0.91 to 0.92 and an accuracy range of 88%–92%. In Xie et al,20 the authors achieved an accuracy of 94.7% using the InceptionResNetV2 method. The dataset used in the study consisted of images and four maps, with five classes being classified. In Aatila et al,21 the authors applied transfer learning using six pretrained CNN models (VGG16, InceptionV3, Xception, DenseNet201, MobileNet and EfficientNetB0) and fine-tuned them on a dataset of 2924 topographic images with three classes. The results show that VGG16 with 98.51% and DenseNet201 with 99.31% achieved the highest accuracy. In Otuna-Hernández et al,22 a CNN model with corneal profile, dioptre, pachymetry and ART as inputs to diagnose KCN. The results showed a sensitivity of 83.3% and specificity of 92.59% for the normal category, sensitivity of 80.64% and specificity of 93.75% for the initial category, sensitivity of 92.59% and specificity of 96.15% for the medium category, and sensitivity of 92.59% and specificity of 98.68% for the severe category. In Fassbind et al,23 the objective of the study was to predict the most common corneal diseases using CNNs and evaluated with a 95.45% accuracy for two classes and a 93.52% accuracy for five classes. The dataset used in the study consisted of 1940 cornea scans, and there were two classes being classified. In Elsawy et al,24 the objective of the study conducted by Elsawy in 2020 was to achieve early detection of corneal diseases. The study used three methods, AlxN, VGG16 and VGG19 and achieved high accuracy rates of 99.12%, 99.96% and 99.93%, respectively. The dataset used in the study consisted of 413 eyes, with 4 classes being classified based on 4 maps.
Table 1 presents different datasets with varying numbers of images and classes and has reported different levels of accuracy. This study presents an approach to improve image classification performance by combining features extracted from three pretrained models using a feature fusion technique. The fused features are then used as input to a ViT, which is fine-tuned on the task at hand. This approach leverages the strengths of multiple pretrained models and a powerful transformer architecture to achieve state-of-the-art performance on image classification tasks.
The main contributions of this paper include:
The study offers a novel approach to image classification that combines the strengths of multiple pretrained models and a transformer architecture. This approach contributes to the field of computer vision by providing a new way to improve the accuracy of image classification models.
By leveraging the features extracted from multiple pretrained models and fusing them using a feature fusion technique, the study is able to capture a more comprehensive representation of the input image, leading to improved accuracy. This contribution demonstrates the benefits of multimodel feature fusion for image classification tasks.
By using a transformer architecture, the project is able to effectively model the complex relationships between the input features, allowing for more fine-grained classification decisions. This contribution highlights the potential benefits of using ViTs for image classification tasks.
The study provides insights into the benefits of multimodel feature fusion and ViTs for image classification tasks through extensive experiments on standard image classification benchmarks. This contribution demonstrates the superior performance of the approach compared with state-of-the-art baseline models, further contributing to the field of computer vision.