Original Research

Keratoconus disease classification with multimodel fusion and vision transformer: a pretrained model approach

Abstract

Objective Our objective is to develop a novel keratoconus image classification system that leverages multiple pretrained models and a transformer architecture to achieve state-of-the-art performance in detecting keratoconus.

Methods and analysis Three pretrained models were used to extract features from the input images. These models have been trained on large datasets and have demonstrated strong performance in various computer vision tasks.

The extracted features from the three pretrained models were fused using a feature fusion technique. This fusion aimed to combine the strengths of each model and capture a more comprehensive representation of the input images. The fused features were then used as input to a vision transformer, a powerful architecture that has shown excellent performance in image classification tasks. The vision transformer learnt to classify the input images as either indicative of keratoconus or not.

The proposed method was applied to the Shahroud Cohort Eye collection and keratoconus detection dataset. The performance of the model was evaluated using standard evaluation metrics such as accuracy, precision, recall and F1 score.

Results The research results demonstrated that the proposed model achieved higher accuracy compared with using each model individually.

Conclusion The findings of this study suggest that the proposed approach can significantly improve the accuracy of image classification models for keratoconus detection. This approach can serve as an effective decision support system alongside physicians, aiding in the diagnosis of keratoconus and potentially reducing the need for invasive procedures such as corneal transplantation in severe cases.

What is already known on this topic

  • Detecting keratoconus is important because it is a progressive eye disease that can lead to significant visual impairment and may require corneal transplantation in severe cases. This study proposed to develop a novel keratoconus image classification system by leveraging the strengths of multiple pretrained models and a transformer architecture to achieve state-of-the-art performance.

What this study adds

  • To enhance the efficacy of keratoconus detection, the authors propose a novel keratoconus classification algorithm that integrates pretrained InceptionRestNetV2, VGG16 and EfficientNetB0 as feature extractors. Additionally, a vision transformer architecture is employed to further enhance classification ability.

How this study might affect research, practice or policy

  • Our designed and implemented models are effective in diagnosing keratoconus and can serve as a decision support system alongside physicians to aid in disease diagnosis.

Introduction

Keratoconus (KCN) is an eye condition characterised by the thinning and bulging of the cornea, which is the clear, dome-shaped surface that covers the front of the eye.1 This condition leads to visual distortions, sensitivity to light and other visual problems.2 KCN typically starts during the teenage years or early adulthood and progresses slowly over a period of several years. Although the precise cause of KCN not fully understood, it is believed to be a result of a combination of genetic and environmental factors.3 Treatment for KCN may include eyeglasses or contact lenses to improve vision, or more advanced treatments such as corneal cross-linking, which uses ultraviolet (UV) light and a special solution to strengthen the cornea or corneal transplant surgery.4 To detect KCN, it is important to have a comprehensive eye exam performed by an eye doctor or ophthalmologist. The exam may involve several tests to evaluate the shape, thickness and curvature of the cornea. Here are some common tests that are used to detect KCN:

  • Corneal topography: This test uses a special instrument called a corneal topographer to create a detailed map of the cornea’s surface. It can detect any irregularities or distortions in the cornea’s shape that may indicate KCN.5

  • Keratometry: This test measures the curvature of the cornea by shining a light on the eye and measuring how it reflects off the cornea.6

  • Pachymetry: This test measures the thickness of the cornea. In KCN, the cornea is often thinner than normal.7

  • Visual acuity testing: This test measures the sharpness of vision using an eye chart.8

In addition to these tests, the eye doctor may also perform additional tests, such as corneal tomography or corneal biomechanical testing, to verify the diagnosis and assess the seriousness of the situation. The corneal topography test creates a three-dimensional representation of the cornea’s surface, providing detailed information about its shape, curvature and thickness, as well as any irregularities or distortions that may be present. The resulting colour-coded map highlights areas of the cornea that are flatter or steeper than normal, which can indicate conditions such as KCN or astigmatism. In particular, the four refractive maps commonly used for KCN detection are the sagittal, corneal thickness (CT), elevation front and elevation back maps.9 However, manually diagnosing KCN from the output maps provided by Pentacam and Optical coherence tomography (OCT) devices is a tedious and time-consuming task that requires a high level of expertise. Therefore, automated and accurate methods for KCN diagnosis are needed to improve the efficiency and reliability of the diagnostic process and facilitate early detection and treatment.10 Creating a pretrained model for image classification tasks is a challenging task that requires careful consideration of the architecture and pretraining method to effectively capture the relevant features and patterns in the input images. There is often a trade-off between the complexity of the model and its ability to generalise to new data. Thus, selecting a suitable architecture and pretraining method that balances both is crucial.11 One issue with using feature extraction techniques is that they may not capture all of the relevant information in the input data, which can limit the performance of downstream tasks. However, feature fusion can be a useful technique for improving the performance of downstream tasks by combining the outputs of multiple feature extraction techniques or modalities into a single representation that captures complementary information from each source.12 The vision transformer (ViT) is one example of a transformer-based approach for image classification that has shown promising results. The ViT processes images by dividing them into a sequence of non-overlapping patches and feeding them through a transformer encoder network. The patches are first linearly embedded into a sequence of feature vectors, which are then fed into the transformer encoder network to produce a sequence of context-aware representations. The final representation of the image is obtained by applying a mean pooling operation to the sequence of representations.13 This approach allows the ViT to capture global contextual information from the image while preserving spatial relationships between the patches. Additionally, the ViT can be fine-tuned on a specific image classification task by training the network end-to-end using a standard cross-entropy loss function. Fine-tuning the ViT in this way can improve its performance on the specific task and further optimise its ability to capture the relevant features and patterns in the input images.

The main objective of this study is to develop an image classification system that leverages the strengths of multiple pretrained models and a transformer architecture to achieve state-of-the-art performance. The approach involves extracting features from three pretrained models, combining them using a feature fusion technique, and using the fused features as input to a ViT. The system will be fine-tuned on the task at hand and evaluated on standard image classification benchmarks. The goal is to demonstrate that this approach can significantly improve the accuracy of image classification models and provide insights into the benefits of using multimodel feature fusion and ViTs for image classification tasks.

The paper’s structure can be outlined as follows: In section 2, we provide an overview of the related research. Section 3 is divided into three subsections that elaborate on the materials and methods used in the study. These include a description of the dataset, an explanation of the proposed fusion and ViT approach, and a discussion of transformer learning. In section 4, we present the experimental results. Finally, in section 5, we summarise our conclusions and present potential future directions for research.

Related work

Some research work has been done based on transfer learning for classifying KCN. These studies14–24 have used different pretrained convolutional neural network (CNN) models such as VGG16, VGG19, InceptionV3, ResNet152, InceptionResNetV2, SqueezeNet, AlexNet, ShuffleNet and MobileNetv2. In Al-Timemy et al,14 the authors introduced a method called dnsemble of deep transfer learning (EDTL) to detect KCN. They suggested a Peripheral iridectomy/iridotom (PI) classifier and EDTL technique that aggregates the output probabilities of five classifiers to make a decision based on the fusion of probabilities. The proposed method resulted in an accuracy of 98.3%. In Al-Timemy et al,15 an EfficientNet-b0-based transfer learning model is proposed for KCN detection, and the authors achieved 97.7% accuracy in two classes and 84.4% in three classes.16 A transfer learning approach on the VGG16, InceptionV3 and ResNet152 pretrained model achieved 93.1%,93.1% and 95.8% accuracy. In Lavric et al,17 the objective of the study was to reduce diagnostic errors and facilitate treatment using a CNN method. The study achieved an accuracy of 99.33% and used a dataset consisting of 3000 images based on 1 map, with 2 classes being classified and in Kamiya et al,18 a transfer learning approach was applied to the ResNet18 model and achieved 99.1% accuracy. In Al-Timemy et al,19 a deep learning model using Xception and InceptionResNetV2 architectures was proposed for the early detection of clinical KCN. By fusing the extracted features, the model was able to effectively detect subclinical forms of KCN with high accuracy. The model achieved an impressive accuracy range of 97%–100% and an AUC of 0.99 for distinguishing normal eyes from eyes with subclinical and established KCN. In addition, the model was validated using an independent dataset, obtaining AUCs ranging from 0.91 to 0.92 and an accuracy range of 88%–92%. In Xie et al,20 the authors achieved an accuracy of 94.7% using the InceptionResNetV2 method. The dataset used in the study consisted of images and four maps, with five classes being classified. In Aatila et al,21 the authors applied transfer learning using six pretrained CNN models (VGG16, InceptionV3, Xception, DenseNet201, MobileNet and EfficientNetB0) and fine-tuned them on a dataset of 2924 topographic images with three classes. The results show that VGG16 with 98.51% and DenseNet201 with 99.31% achieved the highest accuracy. In Otuna-Hernández et al,22 a CNN model with corneal profile, dioptre, pachymetry and ART as inputs to diagnose KCN. The results showed a sensitivity of 83.3% and specificity of 92.59% for the normal category, sensitivity of 80.64% and specificity of 93.75% for the initial category, sensitivity of 92.59% and specificity of 96.15% for the medium category, and sensitivity of 92.59% and specificity of 98.68% for the severe category. In Fassbind et al,23 the objective of the study was to predict the most common corneal diseases using CNNs and evaluated with a 95.45% accuracy for two classes and a 93.52% accuracy for five classes. The dataset used in the study consisted of 1940 cornea scans, and there were two classes being classified. In Elsawy et al,24 the objective of the study conducted by Elsawy in 2020 was to achieve early detection of corneal diseases. The study used three methods, AlxN, VGG16 and VGG19 and achieved high accuracy rates of 99.12%, 99.96% and 99.93%, respectively. The dataset used in the study consisted of 413 eyes, with 4 classes being classified based on 4 maps.

Table 1 presents different datasets with varying numbers of images and classes and has reported different levels of accuracy. This study presents an approach to improve image classification performance by combining features extracted from three pretrained models using a feature fusion technique. The fused features are then used as input to a ViT, which is fine-tuned on the task at hand. This approach leverages the strengths of multiple pretrained models and a powerful transformer architecture to achieve state-of-the-art performance on image classification tasks.

Table 1
|
The summary of the previous works related to keratoconus diagnosis

The main contributions of this paper include:

  1. The study offers a novel approach to image classification that combines the strengths of multiple pretrained models and a transformer architecture. This approach contributes to the field of computer vision by providing a new way to improve the accuracy of image classification models.

  2. By leveraging the features extracted from multiple pretrained models and fusing them using a feature fusion technique, the study is able to capture a more comprehensive representation of the input image, leading to improved accuracy. This contribution demonstrates the benefits of multimodel feature fusion for image classification tasks.

  3. By using a transformer architecture, the project is able to effectively model the complex relationships between the input features, allowing for more fine-grained classification decisions. This contribution highlights the potential benefits of using ViTs for image classification tasks.

  4. The study provides insights into the benefits of multimodel feature fusion and ViTs for image classification tasks through extensive experiments on standard image classification benchmarks. This contribution demonstrates the superior performance of the approach compared with state-of-the-art baseline models, further contributing to the field of computer vision.

Materials and methods

Data collection

The data used in this study were obtained from the Shahroud Cohort Eye dataset,25 which comprises a total of 92 images available for both eyes. Out of these images, 83 are sourced from healthy individuals, while the remaining 9 are from individuals afflicted with KCN, a corneal disease. In this study, the CT map was used for diagnosing the disease. Owing to the limited volume of data available, data augmentation techniques26 were employed to augment the dataset.

As per the findings of a previous study,14 the diagnosis of a specific eye condition, namely KCN, can be influenced by the scaling and translation of topographic maps. It is crucial to consider the potential effects of scaling, translation and rotation when analysing CT maps for the diagnosis of KCN. Careful selection of preprocessing techniques is necessary to preserve the accuracy and reliability of the diagnostic process. In consultation with a physician, rotation was excluded due to its impact on the skew angle. However, vertical flipping around the y-axis was deemed appropriate for data augmentation. These findings, in alignment with the physician’s opinion, significantly contribute to improving the precision and dependability of KCN diagnosis using CT maps. The justification for excluding rotation and using vertical flipping around the y-axis for CT maps can be as follows: Rotation angle variation: The exclusion of rotation ensures a constant rotation angle, minimising unnecessary variations in CT maps. Simplified analysis: By excluding rotation, the analysis process is simplified, allowing for a more focused examination of other crucial factors in KCN diagnosis. Avoidance of unnecessary changes: The exclusion of rotation prevents the introduction of additional complexity and variations to CT maps, thereby preserving the integrity of the data and enhancing the performance of the diagnostic algorithm.

Proposed method

This research paper presents a novel approach for classifying KCN disease using a hybrid model that combines transfer learning models and ViTs. The proposed method consists of six stages, which are illustrated in figure 1.

Figure 1
Figure 1

The structure of the proposed model. The figure illustrates the architecture of the proposed keratoconus image classification system. The model consists of three main components: feature extraction, feature fusion and a vision transformer. In the feature extraction step, three pretrained models are employed to extract features from the input images. These models have been trained on large datasets and have shown strong performance in various computer vision tasks. The extracted features from the three pretrained models are then fused using a feature fusion technique. This fusion aims to combine the strengths of each model and capture a more comprehensive representation of the input images. The fused features are used as the input to a vision transformer, which is a powerful architecture known for its excellent performance in image classification tasks. The vision transformer learns to classify the input images as indicative of keratoconus or not. The figure highlights the flow of information through the proposed model, demonstrating how the extracted features are fused and used in the vision transformer for accurate keratoconus detection.

  1. Load the dataset into the system: This is the initial stage where the dataset is loaded into the system.

  2. In this stage, the dataset has been divided into two subsets, namely a training set and a testing set, with an 80:20 ratio. To improve the performance of the network, we considered different percentages between the pretrained network and the vit network. The purpose of this step is to use one subset for training the model and the other subset for testing the model’s performance. This ensures that the model is evaluated on data that it has not seen during training, providing a more accurate assessment of its generalisation ability.

  3. The study used various data augmentation techniques to enhance the available dataset and addressed the issue of imbalanced data by implementing oversampling. However, rotation was excluded, and only vertical flipping around the y-axis was considered appropriate for data augmentation. These findings, in alignment with the physician’s opinion, significantly contribute to improving the precision and dependability of KCN diagnosis using CT maps.

  4. To extract features from the images in the training set, we loaded seven pretrained models that have been trained on large image datasets such as ImageNet, including InceptionResNetV2,27 VGG16,28 EfficientNetB0,29 ResNet50,30 InceptionV3,31 DenseNet20132 and MobileNet.33 However, for our study, we evaluated the performance of each model individually, ultimately selecting the top three models with the highest accuracy for feature extraction. Specifically, we selected the models, InceptionResNetV2, VGG16 and EfficientNetB0, as they consistently outperformed the other models on our dataset.

A: InceptionResNetV2

This model combines two well-known architectures, ‘Inception’ and ‘ResNet’. The Inception architecture is designed to improve the efficiency of feature extraction from images and optimise the utilisation of computational resources. The ResNet architecture, on the other hand, addresses the issue of vanishing gradients in neural networks through the use of skip connections. The selection of InceptionResnetv2 may stem from the combination of these two architectures and their unique abilities to extract informative and interpretable features. InceptionResNetV2 is a transfer learning model that consists of 164 layers, including residual blocks, Inception modules and a global average pooling layer. The default input size for InceptionResNetV2 is 299×299×3 pixels. The final layer of the InceptionResNetV2 model produces a tensor with a size of 8×8×2048.

B: VGG16

This model is one of the most renowned architectures for image feature extraction. It comprises multiple convolutional layers and fully connected layers, enabling the extraction of complex and hierarchical features from images. The choice of Vgg16 may be attributed to its capability to extract deep and detailed features required for KCN detection in images. VGG16 is another transfer learning model that consists of 16 layers, including convolutional and pooling layers. When using VGG16 model, if the size of the input image is 224×224×3, the output size is 7×7×512. However, if we want the output size to be 8×8 and the input image size is 299×299, we need to resize the image to 256×256×3 so that we can get the desired output size of 8×8×512.

C: EfficientNetB0

This model belongs to the EfficientNet family of architectures, which leverage advanced techniques such as depth scaling, width scaling and resolution scaling to optimise the trade-off between accuracy and model complexity. EfficientNetB0 has demonstrated high efficiency and excellent performance in computer vision tasks, making it a viable choice. EfficientNet-B0 is a scalable transfer learning model that consists of convolutional layers, depthwise-separable convolutional layers, and a global average pooling layer. The default input size for EfficientNet-B0 is 224×224×3 pixels, if the size of the input image is 224×224×3, the output size is 7×7. However, if we want the output size to be 8×8 and the input image size is 299×299, we need to resize the image to 256×256×3 so that we can get the desired output size of 8×8×1280.

Therefore, by resizing the input images to 299×299×3 pixels and fine-tuning the models, we can obtain an output of 8×8 xY for all models that Y is 2048 for InceptionResNetV2, 512 for VGG16, 1280 for EfficientNet-B0.

To implement a feature fusion technique34 called concatenation, we can extract features from three pretrained models and concatenate them into a single feature vector for each image in the training set. Let’s assume that we have extracted feature vectors F1, F2 and F3 from the three pretrained models. We can concatenate these feature vectors into a single vector, which can be represented as follows:

Concatenated feature vector=(F1, F2, F3).

This concatenated feature vector can be used as input to a classification model to train a new model that can classify images based on the combined features extracted from the three pretrained models. The resulting concatenated feature vector has a shape of (H, W, C1+C2 + C3), where H, W are the height and width of the feature map, and C1, C2, C3 are the number of channels in the feature maps from the three pretrained models) InceptionResNetV2, VGG16 and EfficientNetB0). Concatenate the output feature tensors of the three models along the channel dimension (axis=3) to obtain a combined feature tensor. The resulting tensor size will be 8×8×3840.

This stage involves using the extract features as input to a ViT.

A: vision transformer

The ViT is a CNN architecture introduced by Google in 2020.13 The ViT uses a transformer-based model, similar to those used for natural language processing, to analyse images. The input to the ViT is a sequence of image patches, which are passed through a series of transformer layers to learn global and local features. The output of the final transformer layer is fed into a multilayer perceptron (MLP) to produce a classification result.

B: feature fusion for ViT input

The proposed model for deep feature extraction and classification of keratoconus images comprises three primary stages, namely feature confusion, deep feature extraction using the ViT model, and classification employing MLP head. In the feature extraction stage, pretrained models, namely InceptionResNetV2, VGG-16, EfficientNet-B0 are employed to extract initial features from KCN images. In this phase, input the concatenated feature vector F_concat into the ViT model for further processing. The ViT will then perform patch embedding, position embedding and transformer encoding, followed by a classification head. In the deep feature extraction stage of the ViT model, the initial features are partitioned into patches and flattened into 1D vectors. These patches are then fed into the ViT, the 8×8 feature vector is divided into four sets of 4×4 patches. In the classification stage, the deep feature vectors obtained from the transformer encoding step are input to an MLP head for classification. The MLP head comprises two hidden layers with 256 and 64 neurons and activation function Relu, respectively, and an output layer with 1 neurons, representing the number of classes in the dataset. The MLP head outputs the predicted class probabilities for the input image. The implementation was done using the TensorFlow, einops, imblearn and seaborn libraries in Python on Google Colab. We use the rearrange method from the einops library to split the dimensions of the input data into smaller dimensions and bring them into the form of patches for the ViT.

In the implemented model, the following functions and hyperparameters were used:

  1. Image augmentation: Vertical flipping was applied during data augmentation.

  2. Oversampling technique was employed to balance the class distribution.

  3. The dataset was divided into train, validation and test sets. The train and validation sets were allocated 80% of the data, while the test set received 20% for porpose model. The stages of training a pretrained neural network and the ViT neural network involve using different data partitions.

  4. During training, the following parameters were set: batch size: 32, earlyStopping method was used to stop training based on a specified metric, optimiser: Adam, learning rate: 0.001.

  5. Feature fusion was performed using the concatenation method.

  6. The MLP head consisted of two hidden layers with the following sizes: first hidden layer: 256 units, second hidden layer: 64 units, activation function: rectified linear unit.

  7. The ViT model was used with the following specifications: patch size: 4, projected dimension: 64, key dimension: 16, number of attention heads: 4, number of layers: 8. These configurations were employed to train the model and achieve the desired results.

Results and discussion

To validate the effectiveness of the proposed approach, we used the k-fold cross-validation scheme. Initially, the dataset is divided into k different parts. For each experiment, one part is used for testing, while the remaining (k−1) parts are used for training. This process is repeated k times. To evaluate the effectiveness of the proposed method under different scenarios, we also performed a fivefold cross-validation scheme. We used seven pretrained models in our study, but ultimately selected three models with the highest accuracy for feature extraction. Specifically, we selected the models VGG16, InceptionResNetV2 and EfficientNet-B0. We evaluated the performance of each model individually and in various combinations and found that these three models consistently outperformed the others on our dataset. We then used a feature fusion technique to combine the extracted features from these three models, and input the fused features to a ViT, specifically the ViT model. This approach allowed us to effectively leverage the strengths of the pretrained models with the highest accuracy, resulting in improved performance compared with using a single pretrained model.

To evaluate35 the performance of a binary classification model for detecting KCN disease, we can use the following classification metrics:

  • Accuracy: It measures the proportion of correct predictions made by the model. It is calculated as (TP+TN)/(TP+TN + FP+FN), where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives.

  • Precision: It measures how many of the predicted positive cases are actually positive. It is calculated as TP/(TP+FP).

  • Recall: It measures how many of the actual positive cases are correctly predicted as positive. It is calculated as TP/(TP+FN).

  • F1-score: It is the harmonic mean of precision and recall. It is calculated as 2×precision×recall/(precision+recall).

These metrics can be calculated using the true labels and predicted labels from the classification model. In the context of KCN disease classification, the true labels represent the actual disease status of the patients, and the predicted labels represent the disease status predicted by the model. We conducted experiments using three different pretrained models to classify KCN diseases in our dataset. This means that the models used only the initial features that were previously mentioned as part of the feature extraction process. To assess the effectiveness of using ViT for data classification, we replaced ViT with an MLP for classification. In figure 2, we plotted the accuracy per epoch and loss per epoch. As evident from the figure, the model did not suffer from overfitting in each epoch.

Figure 2
Figure 2

(A) Loss and accuracy per epoch for MLP-based classification. (B) Loss and accuracy per epoch for ViT-based classification. It depicts the performance of two different classification models over multiple epochs. (A) The loss and accuracy values are presented for the MLP-based classification model. (B)The loss and accuracy metrics per epoch for the ViT-based classification model. MLP, multilayer perceptron; ViT, vision transformer.

The proposed image classification system leverages the strengths of multiple pretrained models, including EfficientNet-B0, InceptionResNetV2 and VGG16, and a transformer architecture to achieve state-of-the-art performance. The approach involves extracting features from these pretrained models, combining them using a feature fusion technique and using the fused features as input to a ViT. To compare the performance of the proposed model with the pretrained models, we have computed the accuracy, precision, recall and F1-score measures for each model on the cohort Shahrod dataset.

The results show that the proposed model outperforms the pretrained models in terms of accuracy, precision and F1-score. Specifically, the proposed model achieves an accuracy of 96.06 %, a precision of 85.54%, a recall of 90.65% and an F1-score of 88.21%, while the pretrained models of EfficientNet-B0 achieve an average accuracy of 85.72%, an average precision of 82.88%, an average recall of 84.36% and an average F1-score of 83.54%. The pretrained models of InceptionResNetV2achieve an average accuracy of 88.60%, an average precision of 82.58%, an average recall of 82.34% and an average F1-score of 82.46%. The pretrained models of VGG16 achieve an average accuracy of 84.28%, an average precision of 69.24%, an average recall of 78.25% and an average F1-score of 73.46%. Therefore, our results suggest that leveraging the strengths of multiple pretrained models and a transformer architecture through the proposed approach can significantly improve the accuracy of image classification models compared with using a single pretrained model.

An evaluation was also conducted to assess the effectiveness of the proposed model on additional datasets15(https://drive.google.com/drive/folders/1GR9Tp7GWGY_0nI5sm8GdJ4V6qlV4vZ2?usp=sharing). Our model was evaluated using a publicly available dataset consisting of seven pentacam maps. Among these maps, we only considered the CT map as the input to the model, and the model’s output was examined for two classes: normal and diseased. In order to improve the results, the learning rate was adjusted to 0.0001. Subsequently, the proposed model was executed on the dataset, and the obtained results are presented in figure 4. Additionally, we trained our model on different pretrained model. The highest accuracy was achieved by the MobileNet, DenseNet201 and ResNet50 model. Subsequently, the proposed model was fine-tuned on these three networks, and the extracted features were combined and fed into a ViT for classification. To improve the results, we adjusted the learning rate to 0.0001. The results of the proposed model on the public dataset are presented in figure 4. The results obtained using pretrained networks, ViT and k-fold cross-validation on the public dataset are reported as follows: an accuracy of 96.20%, a precision of 94.04%, a recall of 93.43% and an F1-score of 93.74%.

Figure 4
Figure 4

Comparing the efficiency of the proposed model with other models. Results for a fivefold cross-validation methodology on the publicly available dataset. (A) Accuracy per fold, (B) precision per fold, (C) recall per fold, (D) Fscore per fold. showcases the comparison of the proposed model’s efficiency with other models using a fivefold cross-validation methodology on a publicly available dataset. (A) The accuracy values per fold, providing insights into the overall classification performance of the model. (B) The precision values per fold, indicating the model’s ability to correctly identify positive instances. (C) The recall values per fold, representing the model’s capability to accurately capture positive instances. Finally, (D) the Fscore values per fold, which assess the model’s balance between precision and recall.

The results in figures 3 and 4 indicate that the input dataset plays a crucial role in selecting the pretrained models for feature extraction. To achieve better generalisation, it is important to choose an appropriate pretrained model for feature extraction and then feed the extracted features into a ViT for classification. Figure 3 demonstrates that the proposed model outperforms the pretrained models, such as EfficientNet-B0, InceptionResNetV2 and VGG16, in terms of accuracy, precision, recall and F1-score. However, in figure 4, it is shown that the effectiveness of the proposed model greatly depends on the choice of the input dataset and the pretrained models used for feature extraction. By fine-tuning the proposed model on different pretrained models, such as MobileNet, DenseNet201 and ResNet50, and combining the extracted features, the highest accuracy is achieved. Finally, the combined features are fed into a ViT for classification. Therefore, it can be concluded that selecting an appropriate pretrained model for feature extraction is crucial for better generalisation and performance of the model when using a ViT for classification.

Figure 3
Figure 3

Comparing the efficiency of the proposed model with other models. Results for a fivefold cross-validation methodology on the Shahroud cohort dataset. (A) Accuracy per fold, (B) precision per fold, (C) recall per fold and (D) Fscore per fold. It illustrates the comparison of the proposed model’s efficiency with other models using the fivefold cross-validation methodology on the Shahroud cohort dataset. (A) The accuracy values per fold, showcasing the model’s overall classification performance. (B) The precision values per fold, indicating the model’s ability to correctly identify positive instances. (C) The recall values per fold, representing the model’s capability to accurately capture positive instances. Finally, (D) the Fscore values per fold, which assesses the model’s balance between precision and recall.

Conclusion

Our study provides insights into the benefits of using multimodel feature fusion and ViTs for image classification tasks. We demonstrated that the fused features from multiple pretrained models were able to capture complementary information, and the ViT was able to effectively leverage this information to achieve improved performance. Our results also show that the ViT-Large model is a powerful tool for image classification tasks, especially when combined with multiple pretrained models. In the future work, the model can be trained using other maps from the Pentacam device as input, as well as employing other transfer learning techniques for disease detection. This model can also be used for detecting other diseases.