Evaluating search engines and large language models for answering health questions (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41746-025-01546-w.pdf

Evaluating search engines and large language models for answering health questions

npj | digital medicine Article Published in partnership with Seoul National University Bundang Hospital https://doi.org/10.1038/s41746-025-01546-w Evaluating search engines and large language models for answering health questions Check for updates 1234567890():,; 1234567890():,; Marcos Fernández-Pichel , Juan C. Pichel & David E. Losada Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrievalaugmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50–70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods signiﬁcantly enhance smaller LLMs’ effectiveness, improving accuracy by up to 30% by integrating retrieval evidence. Recent progress in natural language processing (NLP) has positioned Large Language Models as major players in numerous Information Access tasks 1–3. The release of ChatGPT in November 2022 has been a gamechanger globally, marking a signiﬁcant milestone and revolutionizing many sectors. One of the outstanding features of current LLMs is their ability to generate coherent and human-like text, which has garnered attention and excitement among practitioners, researchers, and the general public. This breakthrough has precipitated a transformative shift in the orientation of information access research towards LLMs, their potential applications, and the interconnection between LLMs and other computer-based tools. The conversational paradigm has gained traction, enabling more interactive and user-friendly search experiences4–7; and many citizens currently turn to conversational AIs based on LLMs for consulting multiple types of information needs. However, traditional search still plays a crucial role in the generative AI era8. While LLMs may support advanced information access and reasoning capabilities, there is still a need to advance retrieval technologies. The role of traditional web search engines in answering user-submitted queries is far from being relegated. Web search is widely used to obtain health advice9 and effective medical information retrieval has attracted the attention of the scientiﬁc community over the years10. For example, Wang et al. evaluated several search engines in terms of their usability and effectiveness for searching for breast cancer information11. They found that the results highly overlapped among the four search engines tested, all of them providing rich information about breast cancer. Zuccon and colleagues12 studied the effectiveness of search engines for the so-called “diagnostic medical circumlocutory queries”, which are searches issued by individuals seeking information about their health using casual descriptions of symptoms rather than medical words. The emergence and global adoption of advanced LLMs have sparked the urgent need to explore and understand their capacities and knowledge acquisition attributes. Some research studies have focused on the capabilities of these models under speciﬁc language understanding and reasoning benchmarks 13–15 and, speciﬁcally, interest in assessing the correctness of health-related AI-based completions has escalated. For example, Chervenak et al. demonstrated ChatGPT’s abilities to answer fertility questions16, and Duong and Solomon17 evaluated the effectiveness of LLMs in comparison to humans when tasked with answering multiple-choice questions on human genetics. Similarly, Holmes et al.18 conducted a comparative study of LLMs’ knowledge on the highly specialized subject of radiation oncology physics. Recently, Elgedawy and colleagues tested the capacity of LLMs to query an extensive volume of clinical records 19 and Kim et al. assessed ChatGPT’s accuracy in answering 57 epilepsy-related questions20. Kim and others 21 examined the diagnostic accuracy of GPT-4 compared to mental health professionals and other clinicians employing clinical vignettes of obsessivecompulsive disorder (OCD). In other papers, the authors reported studies on the role of LLMs for biomedical tasks22, patient-speciﬁc EHR questions23, and bariatric surgery topics24. Tang et al.25 evaluated LLMs’ ability to perform zero-shot medical summarization across six clinical domains. With a broader perspective, other authors involved physicians in a thorough evaluation of the accuracy of ChatGPT in answering health queries26 or evaluated ChatGPT using the applied knowledge test (AKT) from the Royal College of General Practitioners27. All of the aforementioned studies are restricted to a single model, usually ChatGPT, and/or to a speciﬁc medical Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain. e-mail: npj Digital Medicine | (2025)8:153 1 https://doi.org/10.1038/s41746-025-01546-w area. Kusa et al.28 analyzed the impact of users’ beliefs and prompt formulations on completions related to health diagnoses. The study explored the sensitivity of two GPT models to variations in the user’s context. Caramancion29 explored users’ preferences between search engines and LLMs. This analysis revealed interesting trends in user preferences, indicating a distinct tendency for participants to choose search engines for straightforward, fact-based questions. In contrast, LLMs were more frequently favored for tasks needing detailed comprehension and language processing. The results clearly showed that users prefer to search for medical information using traditional search engines. Oeding and colleagues30 compared GPT-4’s and Google’s results for searches concerning the Latarjet procedure (anterior shoulder instability). These authors discovered that GPT-4 provided more information based on academic sources than Google in response to the patient’s queries. In the area of retrieval-augmented models for the health domain, Li et al.31 ﬁne-tuned Llama with medical conversations and injected medical evidence from Wikipedia and other medical sources, while Koopman and Zuccon32 evaluated ChatGPT’s capacity to answer health questions (based solely on its internal knowledge or, alternatively, fed with ofﬂine retrieval evidence). Xiong et al.33 proposed MedRAG, a system that indexes multiple medical corpora and retrieves relevant evidence to ground different LLMs. This study found that simpler models can reach the performance of GPT-4 when grounded with relevant medical information. In line with these developments, it is crucial to acknowledge the signiﬁcance of accurate health information and there is a pressing need to evaluate th (...truncated)