Informatics

Section 2 - Navigating the Frontier: Early-Phase Risks of Implementing Language Models in Healthcare

As healthcare technology professionals, we are keenly aware of the potential of large language models (LLM) and generative artificial intelligence (AI) to revolutionize how we deliver care and manage our workflows. LLMs are AI models trained on large sets of data, created to process, and produce text. Generative AI uses the power of machine learning algorithms to produce novel and original content.  There has been a flurry of activity in the development of LLMs and Generative AI designed specifically for clinicians and patients, with the promise of increased efficiency, reduced clinician burden and burnout, unrivaled decision support, and improved patient engagement. This technology is advancing rapidly, and while the applications of LLMs and generative AI in healthcare are still in their infancy, it is important to consider the limitations and potential risks associated with their use.  

In this article, we present a framework for thinking about the applications of LLMs and generative AI. This framework assists with the evaluation of use cases and assessing limitations and risks before considering adoption in clinical practice. We also discuss the spectrum of use-cases for LLMs and generative AI, ranging from low-risk classification tasks to high-risk tasks such as diagnosis and treatment.  

Machine learning-based predictive analytics have long been used in healthcare settings and have generally received significant scrutiny and evaluation prior to implementation. LLMs and generative AI require the same degree of scrutiny and validation given the risk of inaccuracies and hallucinations. However, generative AI is simpler for an untrained novice to develop a “model” for solving a clinical problem. While predictive analytics have remained in the realm of data scientists and those with deep knowledge of the risks and benefits of how algorithms are developed, the same is not true for generative AI. As a result, it’s important to maintain guardrails - including undertaking traditional validation exercises, developing a process to flag high risk errors in a clinical setting, and always keeping a “human in the loop” - when implementing clinical use cases to mitigate risks from these inaccuracies. Conversely, when using LLMs and generative AI in the many non-clinical healthcare scenarios in which they can be applied, there may be a higher tolerance for riskier models that can drive automation of work. 

LLMs and generative AI can generally be trusted when they are asked to perform a task that requires them to analyze data that is presented to them, focused on “reductive” tasks, in which a large amount of text is reduced or transformed to more concise, easily readable form (e.g., summarization, categorization, information extraction).

However, these tools are less reliable when they are asked to generate new text or make inferences based on their underlying knowledge base. Despite being pretrained on trillions of tokens, this still represents a subset of all human knowledge both known and unknown. The intent of pretraining is not to store facts with any fidelity, but rather to discover and encode probabilistic rules about human language. Therefore, it is high risk to rely on information that may (or may not) be encoded in the LLM, especially in healthcare, a domain in which veracity and verifiability is critical.  

The following is a set of ways LLM and generative AI tools might be used for clinical use cases from low to high risk of inaccuracy. This framework specific to LLMs should be considered alongside more traditional measures of risk and value when implementing such tools. Those with high value might justify a higher risk, keeping the safeguards and validation discussed above in place.   

Classification 

  • Categorize incoming patient messages to allow for better routing and triage. For example, “I started taking the new high blood pressure medicine, and now I am feeling dizzy when I stand up” might be categorized to key terms such as hypertension, antihypertensive medication, hypotension, light headedness. 

Intent analysis 

  • Query patients about what they would like to accomplish and direct the patient to the right tool – for example "I am scheduled for an appointment with Dr. Jones for tomorrow at 2 PM that I need to change” would direct the patient to the scheduling aspect of their patient portal to cancel and reschedule the appointment. 

Information extraction 

  • Determine the estimated date of discharge from a clinical note. For example, assimilating outstanding care orders, severity of illness scores, patient location (ICU, Acute Care), and socioeconomic factors (transportation, home assistants), to generate a percent readiness for discharge by XX date output. 

Summarization of a single document 

  • Summarize a lengthy note to its key points.  For example, consuming a long winded, poorly written history of present illness (HPI) note section and generating a concise, efficient, easily consumable HPI. 

Summarization of multiple documents to pull information from multiple parts of the electronic health record to create one or more unified documents.  

  • Summarize a hospitalization for discharge, handoff or shift change. 
  • Patient story generation. 
  • Generate disease specific summaries (e.g. Heart Failure Summary). 

Translation 

  • Translate after visit summary in patient’s language. 

Explanation 

  • “Translate” from medical terminology notes to patient friendly language at distinct reading levels (e.g., progress note, laboratory result, pathology result, radiology report, or patient bill).

Documentation 

  • Generate a discharge summary. 

(ambient) Text generation 

  • Generate a clinical note from a transcription of a conversation between clinicians and/or patients. 
  • Extract data from a patient clinician encounter to create an assessment and plan. 
  • Extract data from a patient-clinician encounter to create coded diagnoses, CPT charges, and billing charges. 
  • Extract data from a patient encounter note to suggest improved documentation to justify the selected diagnoses and charges. 

(ambient) Conversational interactions 

  • Use voice to prescribe a medication or place an order. 

Decision Support 

  • Find health care gaps and suggest remediation.  For example, a patient with the diagnosis of Type II Diabetes who has not had a HbA1C lab test nor nutritional counseling.  

Care Plans 

  • Evaluate all patient data to determine a patient-specific care plan.  For example, extraction of ED visit frequency, hospitalizations, work absence, prescribed medications, and pulmonary function testing to generate an Asthma Action Plan for a patient with the diagnosis of asthma. 

Patient Medical Advice 

  • Leveraging autoresponders.  For example, a patient calls or goes to their patient portal leaving a message that they are concerned about their serum sodium level which returned at 134 mEQ/L which is listed as abnormal.  The autoresponder would generate an explanation for this laboratory result.  
  • Chat direction to appropriate care setting. For example, “I have surgery at 8 AM today, but I am lost...  Can you give me directions to the hospital?” 

Diagnosis & Treatment 

  • Recommend specific missed diagnoses. For example, suggesting the diagnosis of “anemia” to be added to the problem list or patient encounter note after extracting a hemoglobin value from the laboratory of 6.5 g/dL.
  • Recommend missing standard of care treatment. For example, a patient with the diagnosis of eczema who does not have a corticosteroid cream on their medication profile. 
  • Recommending specific treatment for any given condition.  For example, if the patient has a new diagnosis of Streptococcal pharyngitis on their problem list, the EHR would recommend that the patient have start Amoxicillin for 10 days therapy, use warm saline gargling and acetaminophen for pain, and to notify their provider if they develop difficulty swallowing or breathing, fever, or respiratory distress. 

It is clear that over time, many healthcare tasks will be automated, and even more will be augmented by AI. However, with few exceptions, the applications of LLMs and generative AI in healthcare right now are still in the feasibility and safety evaluation phase. As healthcare technology professionals, it is important to understand the risks associated with the use of these tools and to be aware of the evaluation that should be conducted before adoption in clinical practice. 

In section three of this series, the team will cover how generative AI, specifically LLMs aid in documentation.  

Other Sections

The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.