Discussion
In this study, we developed a DL model based on vision transformer for DM grading based on OCT-related morphological features. We achieved an accuracy of 82.00%, an F1 score of 83.11% and an AUC of 0.96. Our research showed that the accuracy of our model in this novel grading system was promising, which can help with patients in a preliminary screening to identify groups with serious conditions. As this classification may be able to predict the treatment outcome and visual prognosis of DM better in the future, our model can help ophthalmologists to develop personalised treatment plans for patients with DM.
DM at four different stages reflects the severity of the disease. Early DME usually corresponds to a short duration of hyperglycaemic state.16 So most of the time patients can maintain a good vision if they can take good control of their blood glucose. EZ/ELM state is different between advanced and severe DME. In the former one, EZ/ELM may be damaged but still visible, and the layers of the inner retina are usually recognisable. In the latter one, the internal retinal layers and/or EZ/ELM are mostly destroyed and undetectable. These two groups of patients may have distinct differences in treatment response and visual prognosis and should be distinguished.16 Patients with advanced DME should be treated promptly. Anti-VEGF treatment may prevent progression of the disease into next stage with ELM and/or EZ being recovered and CST decreasing to normal values. While once the disease progresses into severe DME, it may be difficult in resolution of oedema despite positive treatment, and finally may inevitably develop into atrophy stage. Macular atrophy is characterised by complete EZ/ELM destruction and DRIL, usually as a result of long-term macular oedema, and has a poor visual outcome.16
Hence, this novel grading system can assist the ophthalmologists in predicting the prognosis of patients with DM in their clinical work, and personalised therapeutic strategies could be made according to the OCT grading. Especially in the former two stages, taking good control of blood glucose and timely treatment are significant to promote recovery and prevent them from progressing into the more severe stages. For these patients, early screening and long-term follow-up can maintain a better vision outcome. However, detection and grading of DM currently required expertise and are time-consuming. Thus, it is particularly beneficial and promising to develop an intelligent system for the DM grading based on this new system to assist the clinical decision-making processes in patients.
With the continuous development of DL technology, now we all have more opportunities to achieve automatic diagnosis and classification of diseases. Numerous studies have demonstrated the expert performance of DL technology in detecting DME. For instance, Alqudah19 proposed a multiclassification model based on SD-OCT for four types of retinal diseases (age-related macular degeneration, choroidal neovascularisation, DME and drusen) as well as normal cases. The proposed CNN architecture with softmax classifier correctly identified 99.17% of DME cases overall. Zhang et al20 proposed a multiscale DL model, which were divided into two parts: self-enhancement model and disease detection model, with achieving 94.5% accuracy in identifying DME. Meanwhile, they proved that this model provided a better ability to recognise low-quality medical images. Wu et al21 trained a DL model using Visual Geometry Group 16 (VGG16) network as the backbone to detect three OCT morphologies of DME, including DRT, CME and SRD. The accuracy was 93.0%, 95.1% and 98.8%, respectively. All the above studies indicated that DL model had good feasibility and application prospects in diagnosing DME. However, there is still a lack of DL model for automatic detection for this OCT-based grading of DM. Meanwhile, it should be noted that most of the above studies were based on CNN. The advantage of CNN is that it can extract image features well, which has been verified by a large number of scholars. However, there is still little research on Visual Transformer, which has better classification capabilities than CNN to solve image classification problems.22
In current study, we trained a DL model using Vision Transformer as the backbone to detect this novel grading in OCT images. Vision Transformer proposed in 2020 is a new image classification model, which is considered to be the best image classification model at present, showing better performance than traditional CNN model.22 Vision Transformer is not dependent on any CNN and is completely based on transformer structure designed with different feature extraction methods from CNN.23 Research has proved the recognition ability of Vision Transformer for OCT images is stronger than CNN models and traditional machine learning algorithms.23 In the accuracy comparison of the same test set between Vision Transformer and four CNN models: VGG16, Resnet50, Densenet121 and EfficentNet, Vision Transformer has the highest classification accuracy of 99.69%. Meanwhile, both VGG16 and Vision Transformer are faster than other CNN models in the recognition speed of a single image.23 Although our result was slightly less impressive than the previous studies using other DL architectures to detect DME based on OCT images and the detection of the OCT patterns. It can be more complicated and challenging than distinguishing DME from other retinal diseases or simply detecting the overall patterns of DME, with less obvious differences in characteristics and subtle lesions between different OCT grading.
To our knowledge, this is the first article to detect the severity of DM according to the novel classification standard based on OCT images by DL and the first article to use Vision Transformer to detect DM. As mentioned above, this classification may be able to predict the treatment outcome and visual prognosis of DM better in the future and help ophthalmologists develop precise treatment plans for patients. And as the Vision Transformer can better mine global information through its self-attention mechanism and has less bias towards local texture features, it is more robust to noise compared with CNN commonly used in past studies. So our model combined with these advantages is very promising for detecting OCT images of DM or other retinal diseases. Our model had a slightly lower performance in predicting severe DME. Possibly because there were fewer images compared with other stages, and patients almost always had poor vision after progressing into this stage, resulting in worsen image quality. However, our model can help with patients in a preliminary screening to identify groups with serious conditions. These patients need a further test for an accurate diagnosis, and a timely treatment to prevent further deterioration in time. Overall, the result achieved by our DL model was promising and encouraging.
Although our model showed great potential, there are still several limitations in the study. First, OCT images only obtained from the Optovue RTVue imaging system in our study. The model needs to be further validated by images from different OCT equipment. Second, We only perform the classification training in this model. In the future, studies can train models to predict treatment outcomes based on this new grading system. Finally, type of data we used only included images from one eye centre. More OCT images from other multicentre trials in the future can be used to improve our model.
In conclusion, our DL model based on Vision Transformer demonstrated a relatively high accuracy in the detection of the different OCT-based stages of DM. This DM grading model can reduce the burden on clinical ophthalmologists and provide a reference in making personalised therapeutic strategies. These results emphasise the potential of AI in reducing the necessary time of clinical diagnosis, assisting clinical decision-making and guaranteeing the cure rate in the future.