Early adopters built custom internal platforms and structured oversight around them. With vendor healthcare versions now emerging, the technology may become easier to deploy—even as ECRI warns about misuse.
By Alyx Arnett
Large language models (LLMs) are increasingly being integrated into hospital workflows. Some health systems have already built secure internal chatbots to support documentation, messaging, elements of decision support, and more.
At the same time, healthcare-specific LLM products are beginning to enter the market just as ECRI ranks AI chatbot misuse as the top health technology hazard for 2026. The combination highlights the challenge: The technology promises efficiency, but deploying it inside regulated clinical environments requires structure.
At Stanford Medicine Children’s Health, which has run a HIPAA-compliant internal LLM chatbot for more than a year and is now beginning to adopt OpenAI for Healthcare, Keith Morse, MD, pediatric hospital medicine physician and chief medical informatics officer, says the focus has to be on how these tools are rolled out.
“It would be massively risky if you just sprinkle large language models throughout a health system without any broader support system or training infrastructure,” says Morse. “But with careful planning and rollout, those risks can be mitigated.”
How Stanford Has Been Using LLMs
About a year and a half ago, Stanford Medicine Children’s Health launched its internal LLM, built on Microsoft Azure infrastructure and powered by OpenAI models. The tool is available primarily to nonclinical staff, while Stanford’s physician groups have access to a different tool provided via Stanford University.
Since launch, the chatbot has been used around 350,000 times, according to Morse. Morse co-authored a study of early usage and found that the most common request “by far” was an application around “spell check on steroids,” says Morse, which helps refine, edit, and modify large chunks of text.1 “So, ‘Take this paragraph and make it that more professional,’ or ‘Take these two pages and condense them into one-paragraph summary,” he says.
The study also revealed that half of all conversations came from just 5% of users. “That tells us that, for 5% of people, they have found massive value in these tools,” Morse says. The organization has reached out to those users to better understand what is working and how it can be shared more broadly.
The health system also has LLM tools made available through its electronic health record vendor Epic. One is an automated in-basket drafting tool, which Stanford recently piloted and published results about the clinicians’ experience with it.2
“That is a tool for when a patient sends a message to their doctor via our portal, the large language model both reads the message and then is able to look at some information from the patient’s chart, and using both of those sources of information, generate a draft response for basically what the provider would want to send,” says Morse, who co-authored the study. “The provider had the opportunity to edit that message before sending it back.”
The 11-month study included 61 clinicians across pediatric and obstetric clinics, with the obstetric group included to provide an adult comparison. Pediatric clinicians used AI-generated drafts in about 13% of messages, compared with roughly 18% in obstetrics. Pediatric clinicians reported a drop in task load when responding to patient messages at eight-week follow-up, along with a perceived time savings of approximately 1.2 minute per message and an overall positive likelihood of recommending the tool. Burnout levels did not change.
Boston Children’s Internal GPT
Boston Children’s Hospital rolled out its own internal LLM two years ago. “Internal GPT” is a secure, HIPAA-conscious interface running on Microsoft Azure that connects to OpenAI models through an API, giving staff a sanctioned alternative to public chatbots.
An early look at usage showed it being used across job types. “We saw about a third, a third, a third breakdown between administrative users, research-based users, and then clinicians,” says Sarah Lindenauer, senior director of innovation strategy and operations at Boston Children’s Hospital.
One of the first ways the team made the tool more useful was by connecting it to internal content staff already rely on but often struggle to find. The hospital integrated Internal GPT with the intranet and internal document libraries, which Lindenauer says can be difficult to navigate. Instead of searching through menus or “scroll[ing] through PDFs to find the answer,” staff can ask a question in plain language and get a response that points them back to the original source.
Over time, the hospital has also built more guided experiences, called assistants, inside the platform to help people who don’t know where to start. One assistant helps staff run qualitative analysis by walking them through uploading a dataset and choosing whether the system should generate themes or map data to existing categories. Another is built around meeting documentation. Lindenauer says staff can upload a Zoom transcript, for instance, and choose how the notes should be generated, including the length, tone, and whether action items are formatted in a table.
More clinically oriented workflows are still in development and depend on deeper system integrations. Boston Children’s is not yet live with an Epic electronic medical record integration, Lindenauer says, but teams are exploring use cases that would pull information from clinical systems and produce a draft a clinician can review. One example is drafting letters of medical necessity, which she describes as a time-consuming process that requires pulling data from the medical record, checking requirements, and often referencing external documentation before the provider finalizes the letter.
Boston Children’s Hospital is now also beginning to adopt OpenAI for Healthcare, while keeping its Internal GPT tool in place as it evaluates the transition.
In Radiology, LLMs Are Showing Value
Sonia Gupta, MD, chief medical officer of Optum Enterprise Imaging and a practicing radiologist, helped develop LLM-based tools almost six years ago and has been using them in her own clinical work for about three years.
The models can be used to draft radiology reports, impressions, and bring references for a particular diagnosis, Gupta says, and the references and recommendations are based on current medical guidelines.
Another common use case, Gupta says, is for documentation tasks. The LLM can listen to clinician-patient conversations and generate draft notes. They are also being used to translate complex medical language into more understandable terms for patients. Gupta says LLMs can also help draft responses to patient portal messages and requests for additional information about imaging results.
From Gupta’s perspective, the main driver behind adoption is workload.
“We talk a lot about the cognitive load in healthcare,” she says. “Our physicians are seeing more patients than they used to. There’s more complexity. We’re using a lot of different systems. So anything that can help with efficiency is going to be very valuable.”
The Risk With AI Chatbots
ECRI’s ranking of AI chatbots as its top 2026 hazard was largely due to concern over their use without sufficient scrutiny of their output.
ECRI’s own testing found that LLMs can produce incorrect or risky guidance while still sounding authoritative.3 In one example from the report, ECRI asked chatbots whether it was acceptable to place an electrosurgical return electrode over a patient’s shoulder blade. Several models said yes, even though that placement can increase the risk of burns.
The report also describes chatbot answers that could steer users toward the wrong supplies, such as selecting an ultrasound gel that could interfere with a point-of-care ultrasound exam, or choosing isolation gowns that could increase exposure risk to hazardous pathogens.
Francisco Rodriguez Campos, PhD, principal project officer at ECRI, says the concern is that staff may accept an answer too quickly, or they may not realize how much the wording of the question changes the output.
Separately, a systematic review led by Felix Busch, MD, Department of Diagnostic and Interventional Radiology, Technical University of Munich, found similar concerns.4 The study analyzed 89 studies published in 2022 and 2023 across 29 medical specialties and found that “[m]any LLMs are not optimized for medical use, lack transparency about data use, and can be difficult for some users to access. Additionally, the text they generate may sometimes be inaccurate, incomplete, or biased, raising safety concerns.”
The review found that output limitations such as incorrectness and incomprehensiveness were present in almost 90% of studies, and safety issues were reported in over 40%, “which is too frequent to ignore in patient-facing settings unless the task is tightly constrained and supported by workflow guardrails,” says Busch.
In a separate study, Busch and co-investigators analyzed LLMs for simplifying oncologic CT reports for patients with cancer.5 The simplified reports reduced median reading time from seven minutes to two and improved patient-reported comprehension. However, radiologist review identified factual errors in 6% of cases and clinically relevant omissions in 7%.
And while hallucinations are a known concern with LLMs, Busch says the error type that worries him more is the omission. “This is far harder to catch than a hallucination that looks obviously wrong,” he says. “If an LLM summarizes a radiology report but simply leaves out a finding of a pulmonary nodule, the resulting text remains coherent and grammatically perfect, giving the clinician or patient no trigger to suspect something is missing.”
Safer Implementation
Campos says the message for health systems isn’t to avoid LLMs, but to be deliberate about how they use them. “AI is here to stay,” Campos says. “You have to start implementing it in your institution, but you have to be very careful of where are you starting.”
He points to starting with lower-risk uses like summarizing information already on a hospital’s own website for patients, versus higher-risk uses that involve access to clinical records or other sensitive data.
Busch says retrieval-augmented generation (RAG)—a system designed to pull answers from vetted institutional content rather than generate responses from scratch—can make answers more reliable. “RAG can genuinely help,” he says. But he warns against treating it as a guarantee.
“Retrieval can fail silently,” Busch says. “The knowledge base can be incomplete or outdated, and the model can still blend retrieved text with prior generalizations in ways that introduce omissions or plausible inventions.”
He says safe implementation requires maintaining and updating the knowledge base, tracking what sources were retrieved, and monitoring outputs.
Beyond the technical setup, Busch argues that hospitals need formal oversight. Larger systems may need AI governance committees to define liability, set rules for access, and serve as a point of contact for regulators.
When evaluating commercial LLM products, he says health systems should press vendors on data handling and change control: Is patient data retained or used for training? Can the model version be fixed? Are updates announced? Is there a way to test changes or roll them back?
Those answers, he says, determine whether a health system can manage privacy risk and respond when problems arise.
Campos stresses the need for ongoing monitoring, noting that new versions do not always perform better than older ones, and outputs should be reviewed over time—especially after updates.
“If you’re implementing any one of these models, just be very clear about what it would be accessing, what is your risk level, and I think something that is very important—educate your users,” says Campos.
A Blueprint for Rolling Out LLMs
At Boston Children’s Hospital and Stanford Medicine Children’s Health, LLM access has been paired from the start with governance and education.
Boston Children’s built an AI governance structure that defines what is allowed, what is not, and what new tools must demonstrate before they are used widely. Lindenauer says that includes validation requirements and ongoing monitoring to ensure tools continue to meet safety and performance expectations.
The hospital also invested early in AI literacy. “You have to know how to derive value from it,” Lindenauer says. “You have to know how to effectively prompt.” To lower the barrier, her team created a library of roughly 450 prompts that staff can copy and adapt.
Stanford developed its own training tracks. Morse says the organization created an internal “large language models 101” resource and hosts hands-on sessions it calls “prompt-a-thons,” where staff practice prompting techniques and compare approaches. “There is no substitute for experience,” Morse says. “You can’t do that in the abstract.”
Both health systems have been deliberate about which workflows come first. Morse points to Stanford’s use of LLMs largely for administrative tasks. “That’s the space that we see as appropriate for people to be developing their professional acumen around large language models before we even start talking about large language models to identify sepsis or direct care or suggest next lab values or those types of things,” he says.
Boston Children’s has been equally cautious as it considers uses closer to patient care.
“We’ve had to be very clear” that LLM outputs are informational, Lindenauer says. “All responsibility for any care decision…is the responsibility of the clinician.”
The Next Phase
Busch says two developments are shaping what comes next. One is the rise of multimodal foundation models that can process text alongside medical images or biosignals. “This opens up entirely new possibilities for diagnostic support,” he says. “On the other hand, this further limits explainability.”
The other is the improvement of open-source LLMs. As those models approach the capabilities of proprietary systems, Busch says hospitals may have more options to run models locally, keep data on premises, and tailor tools to their own environments.
At the same time, healthcare-specific LLM platforms are entering the market. Stanford Medicine Children’s Health and Boston Children’s Hospital are among early adopters of OpenAI for Healthcare, moving from internally built tools to vendor-supported systems. For Boston Children’s, Lindenauer says the appeal is practical. “Long term, it’s probably going to make more sense from a resource investment and pace of innovation perspective for us to partner,” she says.
Gupta expects adoption to continue expanding as health systems gain experience. “It’s been a gradual rollout, but momentum is starting to build,” she says.
Campos, however, says health systems should not confuse growing adoption with readiness for high-stakes decision-making. “There are a million applications that are more deeply ingrained within patient care,” he says. “We are not there yet, just because we realize, in that space, the risks start to outweigh the benefits.”
References
- Black, KC, Haberkorn, WJ, Ma, SP, et al. Uses of generative AI by non-clinician staff at an academic medical center. npj Health Syst. 2026;3(13).
- Liang AS, Vedak S, Dussaq A, et al. Artificial intelligence-generated draft replies to patient messages in pediatrics. JAMIA Open. 2025;8(6).
- ECRI. The misuse of AI chatbots in healthcare. Hazard #1—2026 Top 10 health technology hazards. 2026 Jan 21.
- Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med (Lond). 2025;5(1):26.
- Prucker P, Bressem KK, Peeken J, et al. A prospective controlled trial of large language model-based simplification of oncologic CT reports for patients with cancer. Radiology. 2025;317(2):e251844.
Alyx Arnett is chief editor of 24×7. Questions or comments? Email [email protected].
ID 394599216 | Ai © BiancoBlue | Dreamstime.com