Discussion
To the best of our knowledge, this is the first head-to-head comparative cross-sectional study evaluating the performance ChatGPT-4 and Google Bard in generating cataract surgery patient education material. ChatGPT-4-generated responses had significantly better PEMAT-P understandability scores in comparison to Google Bard, particularly on areas related to ‘cataract surgery and preparation for surgery’ and ‘recovery after cataract surgery’, with comparable results for the ‘condition’ domain. With regard to PEMAT-P actionability scores, no statistically significant difference was found between ChatGPT-4- and Google Bard-generated responses, both overall and also in each individual domain. These findings indicate that patients from different backgrounds and with differing levels of health knowledge are more likely to understand better ChatGPT-4-generated patient education material on cataract surgery than Google Bard-generated material. This is in spite of the PEMAT-P tool considering the inclusion and quality of any visual aids in the patient education material, with Google Bard regularly including images while ChatGPT-4 did not. However, when it comes to patients being able to identify what they can do based on this material, no difference was found between the two LLMs.
Of note, a concern that was raised over the past few months in the comparison of these two LLMs is the September 2021 knowledge cut-off date that earlier ChatGPT models had.24 However, since its integration into Bing in February 2023, ChatGPT is now able to browse the internet in real-time in order to provide up-to-date information, similar to Google Bard.25 Therefore, in their current forms, both LLMs are able to provide updated and contextually relevant information which reflects real-world updates and developments in cataract surgery.
A potential issue with the integration of LLMs into clinical practice is the accuracy and safety of information. LLMs are highly dependent on their training data, and since they were trained using a variety of resources including unverified internet-based content, inaccuracies could arise if the training data is incorrect, leading to patient harm. As mentioned above, the relevance and accuracy of each chatbot response were assessed as part of the ‘understandability’ PEMAT-P domain. A separate subjective screen conducted by each of the four graders during scoring did not identify any dangerous information (defined as any incorrect or inaccurate information that could lead to patient harm). This provides a good confirmation of the safety record of both ChatGPT-4 and Google Bard in generating responses to cataract surgery FAQs. However, rigorous validations should still be carried out before prematurely deploying LLMs into day-to-day clinical practice, due to the possibility of incorrect information.
Another way that chatbots can generate misleading or harmful information is through a phenomenon known as ‘AI hallucination’—responses that sound confident despite being nonsensical or unfaithful to the training data.26 This differs from the above, in which case the incorrect information provided is in keeping with the training data. Reported types of hallucinations include factual errors or inaccuracies, logical fallacies and confabulations (adding irrelevant details on top of a correct answer).27 28 However, due to the lack of unified, established terminology, there is extensive inconsistency in the definition and use of the term ‘hallucination’.29 Due to these inconsistencies, we did not set out to measure the frequency of this phenomenon. However, in one of its output responses, Google Bard included an image of cervical dilation instead of pupillary dilation, even though the text was discussing pupillary dilation. This can be considered to be a hallucination.
To minimise these ‘hallucinations’, the use of ‘prompt engineering’ was used in this study, as advised by the UK National Cyber Security Centre.30 ‘Prompt engineering’ enables optimisation and fine-tuning of LLMs for specific functions, through the creation of effective inputs. In our study, the preceding phrase ‘please provide patient education material to the following question at a fifth-grade reading level’ was designed to promote inclusivity of patients with poorer literacy levels, by stimulating the generation of patient education material at the recommended fifth-grade reading level. ChatGPT-4 showed higher fidelity to this ‘prompt engineering’ instruction in comparison to Google Bard, with a significantly better mean Flesch-Kincaid Grade level of 5.75. This was also seen across all three question domains. Of note, although both only used medical jargon sparingly, with clear explanations when this was used, Google Bard was noted to present longer and more detailed answers which could have influenced this score. ‘Prompt engineering’ is a useful tool that would enable healthcare providers to use LLMs effectively and safely while allowing them to specifically craft the tone, format and deliver AI-generated patient education material and minimise hallucinations.
Although LLMs have the potential to bring about significant changes, it is important to approach them with great caution, particularly in the vital context of patient care. An issue with LLM-generated material is the inability to fact-check presented information, as responses are not accompanied by references. Biased responses may, therefore, be inadvertently generated due to biased training data. AI chatbots also lack the ability to assume responsibility or adhere to ethical or moral limitations, hence restricting their current functionality to simply ‘assistive tools’.
Our study has a number of limitations. First, while the PEMAT-P scoring system has undergone extensive testing and is validated, it cannot ensure that a material that scores highly would be effective with a given patient population, as the PEMAT-P contents might not fully reflect patients’ perspectives. In our study, the patient education material was graded by ophthalmologists with good knowledge and understanding of cataract surgery. It is possible that the way the patient education material is graded would be different to how patients would grade it. Furthermore, since ‘prompt engineering’ for patient education material to be at a fifth-grade reading level was used in this study, the results might differ from patient search results on LLM chatbots in the real world. The next step would ideally be to do a follow-up study and test AI-generated patient education material on cataract surgery in the real world, with patient participation. Another limitation is that the PEMAT-P scoring can vary depending on the reviewer’s interpretation. We attempted to eliminate this subjectivity in our study through the use of four independent reviewers, with results showing statistically significant fair to excellent inter-rater reliability scores. As mentioned previously, blinding was not feasible in this study, as the PEMAT-P assesses the visual layout as part of its scoring system. Considering these limitations, we believe that this methodology is ideal for its intended purpose. Finally, the understandability and actionability of the educational materials were measured at one distinct time point. Therefore, longitudinal comparative studies should be conducted to determine improvements, especially since LLM technology is constantly evolving, as evidenced by the imminent arrival of ChatGPT-5.31
This study provides a strong proof of concept for future deployment of AI in ophthalmology and offers valuable guidance to both patients and healthcare providers in selecting between the two main AI chatbots as sources of educational content on cataract surgery. ChatGPT-4 outperformed Google Bard in terms of overall PEMAT-P understandability ratings and adherence to the prompt engineering instruction. No statistically significant difference was found in the PEMAT-P accountability scores, and dangerous information was identified for either LLM chatbots. As mentioned above, the next step would ideally be to do a follow-up study to test AI-generated patient education material on cataract surgery in the real world, with patient participation, alongside longitudinal comparative studies. In particular, it will be important to measure the impact of LLMs on patient education initiatives and understand the impact on the clinical pathway even before medical consultation. For example, future work should assess whether LLMs improve patient satisfaction or rather cause more preoperative anxiety and whether they reduce patient visits or consultation time. Studies assessing the accuracy and consistency, along with the hallucination generation rate, in AI-generated patient education material should also be conducted. Furthermore, it is crucial to assess the cataract surgery patient education material produced in different languages in order to measure the worldwide impact of this LLM technology.