Evaluating search engines and large language models for answering health questions
npj | digital medicine
Article
Published in partnership with Seoul National University Bundang Hospital
https://doi.org/10.1038/s41746-025-01546-w
Evaluating search engines and large
language models for answering health
questions
Check for updates
1234567890():,;
1234567890():,;
Marcos Fernández-Pichel
, Juan C. Pichel & David E. Losada
Search engines (SEs) have traditionally been primary tools for information seeking, but the new large
language models (LLMs) are emerging as powerful alternatives, particularly for question-answering
tasks. This study compares the performance of four popular SEs, seven LLMs, and retrievalaugmented (RAG) variants in answering 150 health-related questions from the TREC Health
Misinformation (HM) Track. Results reveal SEs correctly answer 50–70% of questions, often hindered
by many retrieval results not responding to the health question. LLMs deliver higher accuracy,
correctly answering about 80% of questions, though their performance is sensitive to input prompts.
RAG methods significantly enhance smaller LLMs’ effectiveness, improving accuracy by up to 30% by
integrating retrieval evidence.
Recent progress in natural language processing (NLP) has positioned
Large Language Models as major players in numerous Information Access
tasks 1–3. The release of ChatGPT in November 2022 has been a gamechanger globally, marking a significant milestone and revolutionizing
many sectors. One of the outstanding features of current LLMs is their
ability to generate coherent and human-like text, which has garnered
attention and excitement among practitioners, researchers, and the general public. This breakthrough has precipitated a transformative shift in
the orientation of information access research towards LLMs, their
potential applications, and the interconnection between LLMs and other
computer-based tools. The conversational paradigm has gained traction,
enabling more interactive and user-friendly search experiences4–7; and
many citizens currently turn to conversational AIs based on LLMs for
consulting multiple types of information needs. However, traditional
search still plays a crucial role in the generative AI era8. While LLMs may
support advanced information access and reasoning capabilities, there is
still a need to advance retrieval technologies. The role of traditional web
search engines in answering user-submitted queries is far from being
relegated. Web search is widely used to obtain health advice9 and effective
medical information retrieval has attracted the attention of the scientific
community over the years10. For example, Wang et al. evaluated several
search engines in terms of their usability and effectiveness for searching
for breast cancer information11. They found that the results highly overlapped among the four search engines tested, all of them providing rich
information about breast cancer. Zuccon and colleagues12 studied the
effectiveness of search engines for the so-called “diagnostic medical circumlocutory queries”, which are searches issued by individuals seeking
information about their health using casual descriptions of symptoms
rather than medical words.
The emergence and global adoption of advanced LLMs have sparked
the urgent need to explore and understand their capacities and knowledge
acquisition attributes. Some research studies have focused on the capabilities
of these models under specific language understanding and reasoning
benchmarks 13–15 and, specifically, interest in assessing the correctness of
health-related AI-based completions has escalated. For example, Chervenak
et al. demonstrated ChatGPT’s abilities to answer fertility questions16, and
Duong and Solomon17 evaluated the effectiveness of LLMs in comparison to
humans when tasked with answering multiple-choice questions on human
genetics. Similarly, Holmes et al.18 conducted a comparative study of LLMs’
knowledge on the highly specialized subject of radiation oncology physics.
Recently, Elgedawy and colleagues tested the capacity of LLMs to query an
extensive volume of clinical records 19 and Kim et al. assessed ChatGPT’s
accuracy in answering 57 epilepsy-related questions20. Kim and others 21
examined the diagnostic accuracy of GPT-4 compared to mental health
professionals and other clinicians employing clinical vignettes of obsessivecompulsive disorder (OCD). In other papers, the authors reported studies
on the role of LLMs for biomedical tasks22, patient-specific EHR questions23,
and bariatric surgery topics24. Tang et al.25 evaluated LLMs’ ability to perform zero-shot medical summarization across six clinical domains. With a
broader perspective, other authors involved physicians in a thorough evaluation of the accuracy of ChatGPT in answering health queries26 or evaluated ChatGPT using the applied knowledge test (AKT) from the Royal
College of General Practitioners27. All of the aforementioned studies are
restricted to a single model, usually ChatGPT, and/or to a specific medical
Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain.
e-mail:
npj Digital Medicine | (2025)8:153
1
https://doi.org/10.1038/s41746-025-01546-w
area. Kusa et al.28 analyzed the impact of users’ beliefs and prompt formulations on completions related to health diagnoses. The study explored
the sensitivity of two GPT models to variations in the user’s context.
Caramancion29 explored users’ preferences between search engines and
LLMs. This analysis revealed interesting trends in user preferences, indicating a distinct tendency for participants to choose search engines for
straightforward, fact-based questions. In contrast, LLMs were more frequently favored for tasks needing detailed comprehension and language
processing. The results clearly showed that users prefer to search for medical
information using traditional search engines. Oeding and colleagues30
compared GPT-4’s and Google’s results for searches concerning the Latarjet
procedure (anterior shoulder instability). These authors discovered that
GPT-4 provided more information based on academic sources than Google
in response to the patient’s queries.
In the area of retrieval-augmented models for the health domain, Li
et al.31 fine-tuned Llama with medical conversations and injected medical
evidence from Wikipedia and other medical sources, while Koopman and
Zuccon32 evaluated ChatGPT’s capacity to answer health questions (based
solely on its internal knowledge or, alternatively, fed with offline retrieval
evidence). Xiong et al.33 proposed MedRAG, a system that indexes multiple
medical corpora and retrieves relevant evidence to ground different LLMs.
This study found that simpler models can reach the performance of GPT-4
when grounded with relevant medical information.
In line with these developments, it is crucial to acknowledge the significance of accurate health information and there is a pressing need to
evaluate th (...truncated)