Original Research

Artificial intelligence chatbots as sources of patient education material for cataract surgery: ChatGPT-4 versus Google Bard

Abstract

Objective To conduct a head-to-head comparative analysis of cataract surgery patient education material generated by Chat Generative Pre-trained Transformer (ChatGPT-4) and Google Bard.

Methods and analysis 98 frequently asked questions on cataract surgery in English were taken in November 2023 from 5 trustworthy online patient information resources. 59 of these were curated (20 augmented for clarity and 39 duplicates excluded) and categorised into 3 domains: condition (n=15), preparation for surgery (n=21) and recovery after surgery (n=23). They were formulated into input prompts with ‘prompt engineering’. Using the Patient Education Materials Assessment Tool-Printable (PEMAT-P) Auto-Scoring Form, four ophthalmologists independently graded ChatGPT-4 and Google Bard responses. The readability of responses was evaluated using a Flesch-Kincaid calculator. Responses were also subjectively examined for any inaccurate or harmful information.

Results Google Bard had a higher mean overall Flesch-Kincaid Level (8.02) compared with ChatGPT-4 (5.75) (p<0.001), also noted across all three domains. ChatGPT-4 had a higher overall PEMAT-P understandability score (85.8%) in comparison to Google Bard (80.9%) (p<0.001), which was also noted in the ‘preparation for cataract surgery’ (85.2% vs 75.7%; p<0.001) and ‘recovery after cataract surgery’ (86.5% vs 82.3%; p=0.004) domains. There was no statistically significant difference in overall (42.5% vs 44.2%; p=0.344) or individual domain actionability scores (p>0.10). None of the generated material contained dangerous information.

Conclusion In comparison to Google Bard, ChatGPT-4 fared better overall, scoring higher on the PEMAT-P understandability scale and exhibiting more faithfulness to the prompt engineering instruction. Since input prompts might vary from real-world patient searches, follow-up studies with patient participation are required.

What is already known on this topic

  • Artificial intelligence (AI) has the potential to transform ophthalmology in a number of ways; disease-based deep learning algorithms are now being used to assist in the diagnosis and evaluation of a range of ophthalmic conditions.

  • Patients undergoing cataract surgery will be increasingly exposed to AI-generated patient education materials, as these supersede traditional sources.

What this study adds

  • To the best of our knowledge, this is the first time that cataract surgery patient education material generated by Chat Generative Pre-trained Transformer (ChatGPT-4) and Google Bard, its primary competitor, have been compared side by side.

  • ChatGPT-4 fared better overall, scoring higher on understandability metrics and fidelity to the prompt engineering instruction. Patients with different backgrounds and degrees of health literacy are likely to comprehend cataract surgery patient education material provided by ChatGPT-4 more readily than those generated by Google Bard.

How this study might affect research, practice or policy

  • Both ChatGPT-4 and Google Bard exhibited a good baseline safety profile for generating responses to cataract surgery frequently asked questions. However, rigorous validations should still be carried out before prematurely deploying large language models into day-to-day clinical practice, due to the possibility of incorrect information.

Introduction

Artificial intelligence (AI) has advanced significantly since its inception in 1956.1 With an uptake of 100 million users within the first 2 months of its launch in November 2022, the large language model (LLM) ChatGPT (Chat Generative Pre-trained Transformer) (Open AI, San Francisco, California, USA) set off seismic waves around the world.2 This was shortly followed by the release of the Google Bard chatbot (Alphabet, Mountain View, California, USA) in March 2023.3

These generative AI LLMs were were initially pretrained in an unsupervised manner on massive text corpora, including books, articles and other online sources, totaling billions of words. This was followed by some model optimisation for different downstream tasks, a process referred to as ‘few-shot learning’. With the use of statistical word prediction (informed by context and prior words), this architecture enables the processing and transformation of the entire input and context into meaningful human-like text.4

These general-purpose LLMs are continually proving themselves to be powerful tools for their language generation capabilities, and due to their remarkable adaptability and practicality, have permeated all fields including healthcare. They are continuously evolving, both through their own inherent natural language processing and also through software updates, with OpenAI unveiling its latest version, ChatGPT-4, in March 2023.5 In comparison to its predecessor, ChatGPT-4 is ‘more reliable, creative and able to handle many more nuanced instructions’, while also being able to perform better in academic and specialised fields.5 6 In fact, ChatGPT-4 was able to outperform both ChatGPT-3.5 and other LLMs specifically fine-tuned on medical knowledge (Pathways Language Model and Large Language Model Meta AI 2) in US Medical Licensing Exam and Fellowship of the Royal College of Ophthalmologists Part 2 mock examinations, highlighting its potential as a valuable tool for clinical support and medical education.7 8 Google Bard has also undergone numerous software upgrades, with the Gemini Pro LLM being the latest addition in December 2023.9

AI has the capacity to profoundly transform the field of ophthalmology in numerous ways.4 Disease-based deep learning algorithms in ophthalmology are already being used to aid diagnosis and assessment of retinal diseases, glaucoma, cataract, corneal diseases and many others.10–13 However, AI’s utility in ophthalmology goes beyond simply aiding diagnosis or assessment. It also possesses the ability to transform the manner in which patients receive information and knowledge about their condition or recommended procedure/s.

With more than 20 million procedures performed annually worldwide, cataract surgery is one of the most common surgeries.14 LLM chatbots are being increasingly used by patients and the general public as an alternative source of patient information to printed patient leaflets. It remains unclear whether these are reliable resources in the context of cataract surgery.

This study aims to conduct the initial direct comparison of patient education material on cataract surgery produced by ChatGPT (version GPT-4) and its primary competitor, Google Bard. By examining the understandability and actionability of these new information sources, we aim to provide additional reassurance and confidence to healthcare providers as well as patients when using LLM-based patient information.

Methods

98 frequently asked questions (FAQs) about cataract surgery in English were compared in this cross-sectional study, conducted in November 2023. These FAQs were compiled from the following five reliable, online sources of patient information: the Moorfields Eye Hospital cataract surgery leaflet, the Royal College of Ophthalmologists and the Royal National Institute of Blind People patient information leaflet ‘Understanding Cataracts’, the UK National Health Service patient information webpage on cataracts, the National Eye Institute patient information webpage on cataracts, and ‘Patient’ (UK registered trademark) patient information webpage on cataracts.15–19

Figure 1 illustrates the question curation flow chart, with 39 excluded duplicates and 20 augmented questions to ensure that the information is clear and comprehensive.

Figure 1
Figure 1

Question curation flow chart. ChatGPT, Chat Generative Pre-trained Transformer.

A total of 59 remaining questions were divided into three domains: condition (n=15), cataract surgery and preparation for surgery (n=21) and recovery after cataract surgery (n=23). They were then used as input prompts for ChatGPT-4 and Google Bard on 15 November 2023 and 16 November 2023. The statement ‘please provide patient education material to the following question at a fifth-grade reading level’ was used for ‘prompt engineering’. This was followed by 1 of the 59 questions, which were entered into the ChatGPT-4 and Google Bard user interphases. Examples of this process are provided in figure 2. The decision to use a fifth-grade reading level was based on the premise that patient education material should be prepared at a reading level appropriate for sixth grade or lower in order to ensure maximum comprehension and conformity.20 Prior to inputting each new question, the ‘New Chat’ feature was used in ChatGPT-4 and Google Bard. This was done in a private browser with a cleared cache to avoid any data leakage or utilisation of previous question prompts and responses, the purpose being to simulate real-world patient enquiries.

Figure 2
Figure 2

A comparison of the large language model chatbot user interface, with an example question prompt (A) ChatGPT-4 (B) Google Bard. ChatGPT, Chat Generative Pre-trained Transformer.

For each question prompt, the initial ChatGPT-4 and Google Bard output responses were used. To determine the readability of the generated responses and evaluate conformity to the prompt engineering instruction for both ChatGPT-4 and Google Bard, the Flesch-Kincaid Grade Level was worked out using a Flesch-Kincaid calculator.21 The Flesch-Kincaid Grade Level indicates the level of education required to comprehend a specific text. Since accessory preceding or concluding sentences in responses such as, ‘Here’s some patient education material about conditions that can cause symptoms similar to cataracts, written at a fifth-grade reading level’, and ‘This material is designed to be easy to understand and engaging for someone at a fifth-grade reading level’, affect the Flesch-Kincaid difficulty and were not present in every statement, they were removed for standardisation.

Four ophthalmologists (two registrars/residents and two consultants) independently graded the responses for each of the 59 questions using the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) Auto-Scoring Form,22 to obtain understandability and actionability scores of the patient education material responses. The PEMAT-P is a validated instrument that uses 26 binary questions to assess and compare the understandability and actionability of patient education materials. Higher scores, expressed as percentages, indicate superior performance. Within this tool, understandability is characterised as the ability of individuals from different backgrounds and with differing levels of health literacy to comprehend and articulate essential messages. Actionability, on the other hand, refers to their capacity to determine actionable steps based on the provided information.23 Blinding was not feasible in this study, as the PEMAT-P assesses the visual layout as part of its scoring system. Therefore, in order to assess how each chatbot formatted their answers visually, a screenshot of each response was required to preserve the visual layout as presented. Considering these limitations, we believe that this methodology is ideal for its intended purpose. The relevance and accuracy of each chatbot response were also assessed as part of the ‘understandability’ domain, with a specific binary question, ‘the material does not include information or content that distracts from its purpose’. As a secondary measure, these ophthalmologists also evaluated the generated responses for any inaccurate or hazardous information.

All data were analysed by using IBM SPSS Statistics for Windows, V.24.0 (IBM; released 2016). The normality of the data was assessed using the Shapiro-Wilk test. Inter-rater reliability for PEMAT-P scoring was assessed using an intraclass correlation coefficient, specifically, a two-way mixed effects, absolute agreement and multiple raters method. Non-parametric related samples were analysed using a Wilcoxon signed rank test. A p<0.05 was deemed statistically significant.

Patients and public involvement

Patients or the public were not involved in this study.

Results

Flesch-Kincaid Grade Levels

ChatGPT-4 Flesch-Kincaid Grade Levels ranged from 2.80 to 8.70, while those for Google Bard ranged from 5.00 to 12.90. Google Bard had a significantly higher mean overall Flesch-Kincaid Grade Level (8.02) compared with ChatGPT-4 (5.75) (z=−6.15, p<0.001). This was also noted in all three domains as shown in table 1.

Table 1
|
Comparison of ChatGPT-4 and Google Bard mean Flesch-Kincaid Grade Levels in each domain

PEMAT-P understandability and actionability scores

PEMAT-P understandability scores ranged from 53% to 94%, while actionability scores ranged from 0 to 75%. Online supplemental table displays the individual mean PEMAT-P understandability and actionability scores of ChatGPT-4 and Google Bard-generated responses.

ChatGPT-4 had significantly higher overall PEMAT-P understandability scores in comparison to Google Bard (z=−4.4, p<0.001), while there was no statistically significant difference in overall actionability scores (z=−95, p=0.344). Table 2 presents a comparison of ChatGPT-4 and Google Bard PEMAT-P understandability and actionability mean scores per domain.

Table 2
|
Comparison of ChatGPT-4 and Google Bard PEMAT-P understandability and actionability mean scores per domain

Discussion

To the best of our knowledge, this is the first head-to-head comparative cross-sectional study evaluating the performance ChatGPT-4 and Google Bard in generating cataract surgery patient education material. ChatGPT-4-generated responses had significantly better PEMAT-P understandability scores in comparison to Google Bard, particularly on areas related to ‘cataract surgery and preparation for surgery’ and ‘recovery after cataract surgery’, with comparable results for the ‘condition’ domain. With regard to PEMAT-P actionability scores, no statistically significant difference was found between ChatGPT-4- and Google Bard-generated responses, both overall and also in each individual domain. These findings indicate that patients from different backgrounds and with differing levels of health knowledge are more likely to understand better ChatGPT-4-generated patient education material on cataract surgery than Google Bard-generated material. This is in spite of the PEMAT-P tool considering the inclusion and quality of any visual aids in the patient education material, with Google Bard regularly including images while ChatGPT-4 did not. However, when it comes to patients being able to identify what they can do based on this material, no difference was found between the two LLMs.

Of note, a concern that was raised over the past few months in the comparison of these two LLMs is the September 2021 knowledge cut-off date that earlier ChatGPT models had.24 However, since its integration into Bing in February 2023, ChatGPT is now able to browse the internet in real-time in order to provide up-to-date information, similar to Google Bard.25 Therefore, in their current forms, both LLMs are able to provide updated and contextually relevant information which reflects real-world updates and developments in cataract surgery.

A potential issue with the integration of LLMs into clinical practice is the accuracy and safety of information. LLMs are highly dependent on their training data, and since they were trained using a variety of resources including unverified internet-based content, inaccuracies could arise if the training data is incorrect, leading to patient harm. As mentioned above, the relevance and accuracy of each chatbot response were assessed as part of the ‘understandability’ PEMAT-P domain. A separate subjective screen conducted by each of the four graders during scoring did not identify any dangerous information (defined as any incorrect or inaccurate information that could lead to patient harm). This provides a good confirmation of the safety record of both ChatGPT-4 and Google Bard in generating responses to cataract surgery FAQs. However, rigorous validations should still be carried out before prematurely deploying LLMs into day-to-day clinical practice, due to the possibility of incorrect information.

Another way that chatbots can generate misleading or harmful information is through a phenomenon known as ‘AI hallucination’—responses that sound confident despite being nonsensical or unfaithful to the training data.26 This differs from the above, in which case the incorrect information provided is in keeping with the training data. Reported types of hallucinations include factual errors or inaccuracies, logical fallacies and confabulations (adding irrelevant details on top of a correct answer).27 28 However, due to the lack of unified, established terminology, there is extensive inconsistency in the definition and use of the term ‘hallucination’.29 Due to these inconsistencies, we did not set out to measure the frequency of this phenomenon. However, in one of its output responses, Google Bard included an image of cervical dilation instead of pupillary dilation, even though the text was discussing pupillary dilation. This can be considered to be a hallucination.

To minimise these ‘hallucinations’, the use of ‘prompt engineering’ was used in this study, as advised by the UK National Cyber Security Centre.30 ‘Prompt engineering’ enables optimisation and fine-tuning of LLMs for specific functions, through the creation of effective inputs. In our study, the preceding phrase ‘please provide patient education material to the following question at a fifth-grade reading level’ was designed to promote inclusivity of patients with poorer literacy levels, by stimulating the generation of patient education material at the recommended fifth-grade reading level. ChatGPT-4 showed higher fidelity to this ‘prompt engineering’ instruction in comparison to Google Bard, with a significantly better mean Flesch-Kincaid Grade level of 5.75. This was also seen across all three question domains. Of note, although both only used medical jargon sparingly, with clear explanations when this was used, Google Bard was noted to present longer and more detailed answers which could have influenced this score. ‘Prompt engineering’ is a useful tool that would enable healthcare providers to use LLMs effectively and safely while allowing them to specifically craft the tone, format and deliver AI-generated patient education material and minimise hallucinations.

Although LLMs have the potential to bring about significant changes, it is important to approach them with great caution, particularly in the vital context of patient care. An issue with LLM-generated material is the inability to fact-check presented information, as responses are not accompanied by references. Biased responses may, therefore, be inadvertently generated due to biased training data. AI chatbots also lack the ability to assume responsibility or adhere to ethical or moral limitations, hence restricting their current functionality to simply ‘assistive tools’.

Our study has a number of limitations. First, while the PEMAT-P scoring system has undergone extensive testing and is validated, it cannot ensure that a material that scores highly would be effective with a given patient population, as the PEMAT-P contents might not fully reflect patients’ perspectives. In our study, the patient education material was graded by ophthalmologists with good knowledge and understanding of cataract surgery. It is possible that the way the patient education material is graded would be different to how patients would grade it. Furthermore, since ‘prompt engineering’ for patient education material to be at a fifth-grade reading level was used in this study, the results might differ from patient search results on LLM chatbots in the real world. The next step would ideally be to do a follow-up study and test AI-generated patient education material on cataract surgery in the real world, with patient participation. Another limitation is that the PEMAT-P scoring can vary depending on the reviewer’s interpretation. We attempted to eliminate this subjectivity in our study through the use of four independent reviewers, with results showing statistically significant fair to excellent inter-rater reliability scores. As mentioned previously, blinding was not feasible in this study, as the PEMAT-P assesses the visual layout as part of its scoring system. Considering these limitations, we believe that this methodology is ideal for its intended purpose. Finally, the understandability and actionability of the educational materials were measured at one distinct time point. Therefore, longitudinal comparative studies should be conducted to determine improvements, especially since LLM technology is constantly evolving, as evidenced by the imminent arrival of ChatGPT-5.31

This study provides a strong proof of concept for future deployment of AI in ophthalmology and offers valuable guidance to both patients and healthcare providers in selecting between the two main AI chatbots as sources of educational content on cataract surgery. ChatGPT-4 outperformed Google Bard in terms of overall PEMAT-P understandability ratings and adherence to the prompt engineering instruction. No statistically significant difference was found in the PEMAT-P accountability scores, and dangerous information was identified for either LLM chatbots. As mentioned above, the next step would ideally be to do a follow-up study to test AI-generated patient education material on cataract surgery in the real world, with patient participation, alongside longitudinal comparative studies. In particular, it will be important to measure the impact of LLMs on patient education initiatives and understand the impact on the clinical pathway even before medical consultation. For example, future work should assess whether LLMs improve patient satisfaction or rather cause more preoperative anxiety and whether they reduce patient visits or consultation time. Studies assessing the accuracy and consistency, along with the hallucination generation rate, in AI-generated patient education material should also be conducted. Furthermore, it is crucial to assess the cataract surgery patient education material produced in different languages in order to measure the worldwide impact of this LLM technology.