On This Page

    As large language models (LLMs) move deeper into clinical workflows, it is increasingly critical for clinicians and health systems to be confident that they work as intended. LLMs are the engine that powers generative AI solutions for services like clinical decision support and ambient listening. And assessing the value of those solutions means finding ways to evaluate LLMs for their intended use.

    As part of his Master of Health Care Innovation course, Using Data for Transformation, Penn researcher and principal at Manganese Health Data Solutions Daniel Shenfeld, PhD, explains how AI researchers evaluate LLMs. And he suggests a framework that can help others get started doing the same.

    Daniel Shenfeld, PhD discusses considerations for evaluating health care LLMs.

    The Challenge

    When evaluating LLMs, Shenfeld explains, there are two core challenges:

    1. LLMs output text, not numbers, so we must establish a clear understanding of what it means for text to be correct.
    2. When we ask an LLM the same question multiple times, we may get similar meaning but completely different words. Therefore, we must be able to evaluate content across variations in language.

    Currently, the standard of practice is manual evaluation. Humans ask LLMs questions formulated in different ways. Then they grade the answers using rubrics—scoring criteria that assess features like accuracy, clarity of communication, and quality of recommendations. This process is illuminating but very labor intensive. And it is not comprehensive. An LLM that is effective for answering questions about diabetes, for example, may not be equipped to offer advice about heart disease.

    Evaluation in Context

    Shenfeld also emphasizes that it is important to evaluate the performance of LLMs in the specific contexts and workflows where they will be implemented. An LLM almost always has a whole system built around it—prompt templates, a retrieval step, guardrails, and often a human in the loop.

    Diagram illustrating a core model surrounded by four labeled components (clockwise): Prompt template, Retrieval step, Guardrails, and Human in the loop. The diagram highlights key elements interacting with the core model to enhance or guide its functionality.

    Even if a model scores well on paper, we should still ask whether this whole system works reliably for a given task—whether lab tests and benchmarks actually translate to accurate outputs, clinician time saved, and improved outcomes for patients.

    When considering LLM-based products, we must recognize that no evaluation method is perfect. The standardized benchmarks that developers use to test their models can overestimate real-world performance. And even the most carefully constructed custom testing does not preclude the possibility of “hallucinations” and wrong information.

    Therefore, we should start from a position of skepticism. It is important to ask how the models are evaluated and how their performance is monitored. And it is important to assess new tools in context to test task-specific workflows, and to construct effective guardrails for when the LLM is wrong.

    The skills to evaluate and implement technology responsibly are essential to effective and ethical health care innovation. Learn more about how to evaluate and implement generative AI and other machine learning models in Using Data for Transformation.