Are your patients diagnosing themselves with AI?

By Naveed Saleh, MD, MS | Fact-checked by Barbara Bekiesz

Published October 18, 2023

Key Takeaways

An increasing number of patients are using AI chatbots to self-diagnose.
AI chatbots are remarkably accurate in suggesting a diagnosis among a list of differentials, but human physicians are still better.
It’s only a matter of time before a major healthcare institution offers a chatbot to help patients diagnose their condition. The advent of such technology will require physicians to work closely with patients to mitigate risk and maximize clinical benefit.

Physicians have long known that patients are wont to research their symptoms on the internet before coming into the office. Wisely, physicians typically discourage this practice.

Nowadays, however, patients no longer only rely on a Google search for diagnostic information—which often lacks context or draws from unreliable sources—but also look to OpenAI’s ChatGPT. This is the latest version of Microsoft’s search engine Bing (which is based on OpenAI’s software) and Google’s Med-PaLM.

Anecdotally, physicians who have served patients who use ChatGPT are impressed with the technology and see great potential. But along with this potential comes concerns about the pitfalls of such innovation.

State of technology

As background to a study on AI diagnostic models, Harvard researchers writing in medRxiv explain the features of AI systems currently utilized in medicine.^[]

Various AI applications in healthcare, they note, are highly accurate and improve a broad gamut of clinical realms, including safety, quality, and diagnosis.

To date, however, most of these systems are trained on specific tasks such as segmentation using a single data modality (eg, detecting solitary pulmonary nodules based on a database of chest CT scans).

Such systems depend on human input to establish foundational truths in the dataset. Establishing fundamental truths for interpretation is time- and labor- intensive.

“This single-task, single-model approach means that enormous effort is required to create an algorithm for a new task since new data must be acquired and labeled before the AI can learn to complete the task,” the authors write. “Additionally, this process may only result in systems that will function optimally at the institution or with the dataset where it was trained. This limits broader deployment of AI models in health care.”

Foundation models may be the key to overcoming the limitations of single-task, single-model approaches. Presently, general-purpose, self-supervised foundation models are being used outside of healthcare. These don’t need labeled data in the conventional sense and are able to complete novel tasks they weren’t originally trained to conduct.

One example of this is the Generative Pre-Trained Transformer 3 (GPT-3). Its sole task is to predict the next word. Based on a large collection of unstructured text derived from the internet, GPT-3 is a massive self-supervised model and one of the largest AI models in existence.

It contains more than 175 billion model parameters, and was trained using nearly 570 gigabytes of data from the “common crawl” dataset. The common crawl dataset contains all the unstructured text extant on the internet. Intriguingly, GPT-3 can converse with human users, translate among different languages, and answer questions, despite never being specifically trained in these tasks.

ChatGPT is the newest iteration of GPT-3 and has been a regular topic in the media. One benefit of ChatGPT is that it exhibits a simple user interface. Researchers, however, are not able to examine it in a systematic fashion, and the newer platform offers the user less control regarding its parameters.

How well does GPT-3 diagnose?

In their study, the Harvard researchers analyzed the accuracy of GPT-3’s diagnostic and triaging prowess using 48 validated case vignettes of both common and serious conditions, such as viral illnesses and heart attack, respectively. They compared results with those from a lay audience and practicing physicians in terms of diagnosis and triage.

The researchers found that based on the clinical vignettes presented, GPT-3 served up the correct diagnosis in a list of three differential diagnoses 88% of the time vs 54% for a lay audience (P < 0.001) and 96% for physicians (P = 0.0354).

As for triage, GPT-3 triaged about as well as lay individuals but much worse than physicians.

Benefits and drawbacks of AI-assisted diagnosis

Experts writing about AI in Scientific American highlight various potential benefits of chatbots in diagnosis.^[] Notably, chatbots are more accurate than online symptom checkers (SCs), with online SCs leading to accurate self-diagnosis only 51% of the time, per the research.

The following are some perceived benefits of using chatbots to diagnose:

They are easier to use than online symptom checkers because the patient simply describes their experience vs fitting their experiences into programs that compute the statistical likelihood of a disease.
Chatbots can ask patients who are self-diagnosing follow-up questions, much like an HCP.
Chatbots can augment the diagnosing capability of a physician.

Issues with large-language model chatbots used in diagnosis include the following:

They are susceptible to give misinformation, as their ability to predict the next word in a series is based on the online text it was trained on. This online text can give equal weight to, for example, information published by the CDC and that available on Facebook.
Susceptibility to disinformation from bad actors—those who flood the internet with information intentionally meant to mislead.
Chatbots can “hallucinate” not only new information but also completely new sources.
Patients could be intolerant of or uncomfortable with their physicians using chatbots to help diagnose—especially with more complex conditions.
Chatbots could recapitulate human biases because they are trained on human-generated clinical data, which is prone to bias. For instance, women are less likely to be prescribed pain medications than men. As another example, Black patients are more likely to be diagnosed with schizophrenia than White patients. AI could pick up on such data.

In a scoping review published in the Journal of Medical Internet Research, authors assessed AI or algorithm-based self-diagnosing apps, as well other tools that lay people had access to in primary care or nonclinical settings.^[]

“On the basis of the literature, SCs can partially outsource and improve preliminary diagnoses and enhance diagnostic decision-making. Furthermore, they could become tools for ‘appropriate triage advice’ by the patients themselves and a ‘first line support for advice and guidance’ to laypersons,” they wrote.

Elaborating on the practical implications, the authors observed that during the COVID-19 pandemic, self-triage tools had the potential to improve triage efficiency and quickly connect patients with the appropriate care venue, thus preventing unnecessary emergency department and urgent care visits.

“As the use of SCs by laypersons will potentially increase,” they wrote, “GPs need awareness and understanding of patients’ possible SC use to reflect their own concerns. It is important that GPs have knowledge about existing SCs to adequately assess the information provided by patients.”

What this means for you

Experts predict that soon a major medical center will introduce an AI chatbot which helps diagnose disease to the public. It remains to be seen how such information will be monetized and how patient data will be protected, and who will be responsible if a chatbot risks a patient’s health. Physicians will need to work closely with patients and the AI technology to ensure patient safety.

The rollout of AI in patient healthcare, whether as a self-diagnostic tool or an adjunct to physician diagnosis, should be slow, given the need for further research. Any use of AI-assisted diagnosis should serve as a prelude to evaluation by a physician.

SHARE THIS ARTICLE