A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

Hanguang Xiao simenxiao1211@163.com College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina Feizhong Zhou College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina nykxo99@stu.cqut.edu.cn Xingyue Liu College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina 11923040215lxy@2019.cqut.edu.cn Tianqi Liu College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina 51222314121@stu.cqut.edu.cn Zhipeng Li College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina zhipengli@stu.cqut.edu.cn Xin Liu College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina xinliu@stu.cqut.edu.cn  and  Xiaoxuan Huang College of Artificial Intelligent, Chongqing University of TechnologyChongqingChina hxx@stu.cqut.edu.cn
(2024)
Abstract.

Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have garnered significant attention due to their powerful and general capabilities in understanding, reasoning, and generation, thereby offering new paradigms for the integration of artificial intelligence with medicine. This survey comprehensively overviews the development background and principles of LLMs and MLLMs, as well as explores their application scenarios, challenges, and future directions in medicine. Specifically, this survey begins by focusing on the paradigm shift, tracing the evolution from traditional models to LLMs and MLLMs, summarizing the model structures to provide detailed foundational knowledge. Subsequently, the survey details the entire process from constructing and evaluating to using LLMs and MLLMs with a clear logic. Following this, to emphasize the significant value of LLMs and MLLMs in healthcare, we survey and summarize 6 promising applications in healthcare. Finally, the survey discusses the challenges faced by medical LLMs and MLLMs and proposes a feasible approach and direction for the subsequent integration of artificial intelligence with medicine. Thus, this survey aims to provide researchers with a valuable and comprehensive reference guide from the perspectives of the background, principles, and clinical applications of LLMs and MLLMs.

Large Language Models, Multimodal Large Language Models, Medicine, Healthcare, Clinical Applications
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXjournal: CSURccs: Computing methodologies Natural language processingccs: Applied computing Health informaticsccs: Computer systems organization Neural networks

1. Introduction

Since the introduction of Transformer (Vaswani et al., 2017), there has been a paradigm shift in the fields of Natural Language Processing (NLP) and Computer Vision (CV). Transformer’s robust parallel computing capability and self-attention mechanism enable the integration of vast amounts of training data, laying the foundation for the development of LLMs and MLLMs (Radford et al., 2018). To date, a series of Transformer-based LLMs and MLLMs have emerged (this survey primarily focuses on the vision-language modality), such as the PaLM series (Chowdhery et al., 2023; Anil et al., 2023), GPT series (Brown et al., 2020; Ouyang et al., 2022), and LLaMA series (Touvron et al., 2023a, b) belonging to LLMs, as well as Gemini (Team et al., 2023), GPT-4 (Achiam et al., 2023), and Claude 3 (Anthropic, 2024) belonging to MLLMs. Due to their powerful capabilities in understanding, reasoning, and generation, they have achieved state-of-the-art results in various downstream tasks, including text generation, machine translation and visual question answering (VQA). LLMs and MLLMs demonstrate increasingly powerful generalization abilities, with their impact extending to the medical domain, accelerating the integration of artificial intelligence and medicine (Thirunavukarasu et al., 2023; Thapa and Adhikari, 2023). Particularly, Google’s Med-PaLM 2 (Singhal et al., 2023b) achieved a score of 86.5 in the United States Medical Licensing Examination (USMLE) (Jin et al., 2021), reaching the level of medical experts (Zhou et al., 2023a), further showcasing the enormous potential of LLMs in the medical field. In addition, more medical LLMs and MLLMs, such as ChatDoctor (Li et al., 2023e), LLaVA-Med (Li et al., 2024b) and XrayGLM (Wang et al., 2023a), represent new avenues provided by artificial intelligence for the medical field, offering potential solutions for subsequent medical report generation (Van Veen et al., 2023a, b; Wang et al., 2023c), clinical diagnosis (Wang et al., 2023h; Tu et al., 2024a; Shu et al., 2023), mental health services (Chen et al., 2023f; Liu et al., 2023a), and a range of other clinical applications.

Refer to caption
Figure 1. The process of constructing and evaluating medical LLMs and MLLMs.

Despite the academic breakthrough of LLMs and MLLMs in the medical field, there are still certain challenges for hospitals to train their own medical LLMs and MLLMs and deploy them into practical clinical applications. Firstly, training requires a substantial amount of medical data, which is often costly to acquire and necessitates annotation by medical experts, while also raising concerns regarding data privacy (Zhang et al., 2024a), all of which will pose particular challenges to model development. Secondly, the immense parameters and computation of LLMs and MLLMs demand substantial computational resources for their training and deployment (Moor et al., 2023a; Qiu et al., 2023), significantly raising the threshold for hospitals to adopt LLMs and MLLMs. Thirdly, unlike traditional deep learning models, medical LLMs and MLLMs, as a form of interactive generative models, not only require consideration of their medical expertise, but also need to consider their instruction-following ability (Zhang et al., 2023b; Ouyang et al., 2022; Liu et al., 2024a), safety, and ethical issues (Cui et al., 2023), which require additional training strategies to improve the performance of the models in these aspects. Fourthly, due to the powerful general capabilities of LLMs and MLLMs, they no longer target single tasks like traditional models (He et al., 2023; Zhou et al., 2023b), thus requiring a more comprehensive evaluation, apart from evaluating their accuracy on benchmark datasets, it is also crucial to evaluate their performance in aspects such as ethics, bias, and toxicity (Cui et al., 2023). Furthermore, the development of LLMs and MLLMs in the medical field is still in its early stage, with their application scenarios remaining unclear. Moreover, they also face a series of challenges such as hallucinations (Umapathi et al., 2023; Rawte et al., 2023; Ji et al., 2023), lack of recency (Thirunavukarasu et al., 2023), which significantly hinder the practical clinical application of LLMs and MLLMs.

Refer to caption
Figure 2. The overall structure of the survey. Section 2 to Section 5 are biased toward principles; Section 6 and Section 7 are biased toward applications and impacts.

To address the aforementioned issues, this survey begins by examining the background of LLMs and MLLMs from the perspective of paradigm shifts. Subsequently, it summarizes the mainstream architectures of current medical LLMs and MLLMs, and collects the medical LLMs and MLLMs that currently exist. Following this, the survey collects medical-related datasets and elucidates the entire process of medical LLMs and MLLMs from construction to evaluation in a clear and logical manner, as shown in Fig.1. To maximize the role of LLMs and MLLMs in clinical settings, the survey provides some practical tips for using LLMs and MLLMs. Furthermore, to emphasize the potential significant impact of LLMs and MLLMs in medicine, this survey summarizes their applications in clinical medicine and analyzes their current limitations along with possible solutions.

In comparison with the articles relevant to this survey, they tend to classify MLLMs into LLMs and predominantly focus on discussing LLMs, lacking a detailed investigation of MLLMs (Zhou et al., 2023a; He et al., 2023). Additionally, more articles concentrates on the applications and impacts of LLMs in medicine, while lacking in-depth discussions on the technical aspects (Thirunavukarasu et al., 2023; Thapa and Adhikari, 2023; Qiu et al., 2023; Omiye et al., 2023; Bhayana, 2024), such as datasets, model structures, and construction methods, among others. In contrast, this survey not only covers the background and principles of LLMs and MLLMs but also discusses their applications and impacts in medicine, presenting clear logical structure and substantive depth and breadth in content. In summary, our contributions can be summarized as follows:

  • We have conducted an investigation not only on LLMs in the medical domain but also extensively summarized MLLMs in the medical field, providing an overview and summary of the development background and structure of both. This furnishes medical professionals and researchers with detailed foundational knowledge for understanding LLMs and MLLMs.

  • We have elucidated the entire process from training, evaluation, to the utilization of both LLMs and MLLMs in a clear and logical manner, including pre-training methods, fine-tuning methods, evaluation methods, and usage tips, along with relevant medical datasets. This furnishes medical professionals and researchers with a detailed instruction guide for constructing and utilizing medical LLMs and MLLMs.

  • We have summarized the applications of LLMs and MLLMs in medicine, along with the current limitations and potential solutions in clinical practice. This provides medical professionals and researchers with a valuable reference guide for subsequent application development.

Through the comprehensive details in this survey, our aim is to accelerate the development of LLMs and MLLMs in clinical medicine related products, further fostering the integration of artificial intelligence and the medical domain. The overall structure of this survey is depicted in Fig. 2: Section 2 provides an overview of the development background of LLMs and MLLMs. Section 3 introduces the model structures of existing LLMs and MLLMs, and explains the differences among various structures. Section 4 summarizes the methods for constructing medical LLMs and MLLMs. Section 5 presents the evaluation methods and usage tips for LLMs and MLLMs to fully leverage their potential. Section 6 explores possible applications of medical LLMs and MLLMs at the current stage. Section 7 focuses on discussing the challenges and limitations of LLMs and MLLMs in clinical applications, along with potential solutions. Finally, Section 8 provides the conclusion of this survey. In summary, for those seeking to understand the professional knowledge and principles of medical LLMs and MLLMs, it is advisable to read Section 2 to Section 5; for those interested in current applications, challenges, and future feasible directions of medical LLMs and MLLMs in medicine, it is recommended to read Section 6 and Section 7.

2. Background of LLMs and MLLMs

In this section, we divide the entire development of the NLP field into four stages centered on paradigm shifts: (1) Supervised Learning; (2) Unsupervised Pre-training and Fine-tuning; (3) Unsupervised Pre-training and Prompt; (4) Text-only to Multimodal. We will review the development of LLMs and MLLMs in terms of the above four stages from Section 2.1 to Section 2.4. Additionally, recent research (Zhou et al., 2024) has demonstrated the influence of high-quality datasets on LLMs and MLLMs, so we will analyze the recent trend of transitioning from large-scale datasets to high-quality datasets in Section 2.5.

2.1. Supervised Learning

Supervised learning (Caruana and Niculescu-Mizil, 2006) is a common paradigm in machine learning, where the strategy involves optimizing the loss function as depicted below:

(1) argmin𝜽1ni=1n(f(𝒙i;𝜽),yi)+λΩ(𝜽)𝜽1𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝒙𝑖𝜽subscript𝑦𝑖𝜆Ω𝜽\underset{\boldsymbol{\theta}}{\arg\min}\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}% \left(f\left(\boldsymbol{x}_{i};\boldsymbol{\theta}\right),y_{i}\right)+% \lambda\Omega(\boldsymbol{\theta})underbold_italic_θ start_ARG roman_arg roman_min end_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ roman_Ω ( bold_italic_θ )

where the first term represents the empirical risk, and the second term represents the regularization term. Specifically, supervised learning entails training a model to learn the mapping f𝑓fitalic_f between input variables x𝑥xitalic_x and output variables y𝑦yitalic_y, aiming to minimize the discrepancy between f(𝒙;𝜽)𝑓𝒙𝜽f\left(\boldsymbol{x};\boldsymbol{\theta}\right)italic_f ( bold_italic_x ; bold_italic_θ ) and y𝑦yitalic_y, where θ𝜃\thetaitalic_θ denotes the model parameters, x𝑥xitalic_x can be manually extracted features or raw text, and y𝑦yitalic_y serves as the supervision information which can be category labels, text, or other forms.

Before pre-training methods became prevalent, supervised learning paradigm was the mainstream in the NLP domain. Early NLP relied heavily on feature engineering (Liu et al., 2023e), where researchers needed to extract and select features from datasets, and then utilize these features to accomplish specific tasks such as text classification (Liu and Yu, 2005) and machine translation (Och et al., 2004). With the rise of deep learning (LeCun et al., 2015), models can be trained end-to-end, and the focus of research has shifted from feature engineering to model architecture design, among which the models based on CNN (Zhang and Wallace, 2015) and LSTM (Sutskever et al., 2014) being prominent. During the era of supervised learning in NLP, we witnessed a shift in research focus from feature selection to model architecture design, namely, a transition from feature engineering to structure engineering.

2.2. Unsupervised Pre-training and Fine-tuning

Supervised learning relies on annotated datasets for training, which provide explicit standards for model optimization (Cunningham et al., 2008). However, for certain tasks, especially in medical domains, it may be challenging to acquire a sufficient amount of annotated data due to the scarcity of specialized annotators and the complexity of the annotation process (Zhang et al., 2024a). After the introduction of Transformer (Vaswani et al., 2017) in 2017, the learning paradigm in NLP has changed dramatically, with the supervised learning paradigm becoming increasingly marginalized (Liu et al., 2023e).

Based on Transformer architecture, GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) achieved state-of-the-art results at the time by performing unsupervised pre-training on a large amount of unlabeled text, followed by supervised fine-tuning by designing the appropriate objective function for the downstream task. The proposal of GPT and BERT ushered in a new paradigm in NLP, i.e., unsupervised pre-training &\&& fine-tuning (Radford et al., 2018; Devlin et al., 2018; Yang et al., 2019). In this paradigm, leveraging the highly scalable nature of Transformer, models are initially trained in an unsupervised manner on large-scale unlabeled data using tasks such as masked language modeling (MLM) or next sentence prediction (NSP) (detailed in Section 4.2), and subsequently adapted to target tasks using corresponding supervised objectives (Radford et al., 2018). The advantages of this paradigm lie in: (1) the pre-training data can be drawn from any unlabeled text corpus, thus getting rid of the limitation that supervised learning requires sufficient annotated data (Zhou et al., 2023b); (2) training the model on large-scale unlabeled data allows it to learn more general and abstract language representations, enhancing its generalization ability; (3) during fine-tuning, specific to downstream tasks, only the corresponding objective functions need to be designed, without the need for extensive task-specific architectural modifications, which facilitates the shift from structure engineering to objective engineering.

2.3. Unsupervised Pre-training and Prompt

Although models such as GPT and BERT have achieved state-of-the-art results in downstream tasks like machine translation, sentiment analysis, and question-answer (QA), they still require task-specific fine-tuning for different downstream tasks. In order to construct a general language model capable of handling various tasks without specific fine-tuning, Radford et al. (Radford et al., 2019) collected over 8 million documents from the internet, totaling 40 GB of text data, containing examples from various domains and tasks, and trained GPT-2 on this dataset. GPT-2 achieved state-of-the-art results on 7 out of 8 language modeling benchmarks and without any task-specific fine-tuning. In addition to language modeling tasks, GPT-2 demonstrated the ability to perform various tasks in a zero-shot setting, confirming the substantial improvement in language model performance brought about by both model and dataset scale.

Refer to caption
Figure 3. Example of a few-shot demonstration using the English-French translation task.

In order to further improve the language model generalization capability, Brown et al. (Brown et al., 2020) expanded the model size to 175B based on GPT-2, and the dataset was expanded even more, including the filtered Common Crawl (Radford et al., 2019), two Internet-based book corpora, and the English Wikipedia. Through continuous scaling of both model and dataset size, the trained GPT-3 exhibited a qualitative leap in capability, demonstrating powerful few-shot ability without fine-tuning. As depicted in Fig. 3, GPT-3 could accomplish unknown tasks solely based on provided task examples, sometimes even reaching the competitive level of previous state-of-the-art fine-tuned models. Thus, GPT-3 is often regarded as the beginning of LLMs (Thapa and Adhikari, 2023; Li et al., 2023a). The proposal of GPT-3 once again revolutionized the paradigm of NLP, shifting from unsupervised pre-training &\&& fine-tuning to unsupervised pre-training &\&& prompt (Liu et al., 2023e). Such models, although powerful enough for most NLP tasks, are sensitive to user-supplied prompts, the quality of which will directly affect the quality of the model’s response, which has given rise to researchers’ investigations on prompts (White et al., 2023; Meskó, 2023; Wang et al., 2023e), and has initiated the shift from objective engineering to prompt engineering.

2.4. Text-only to Multimodal

Influenced by GPT-3, more researchers have delved into the research and development of LLMs, leading to the emergence of a series of outstanding works such as GLM-130B (Zeng et al., 2022), PaLM (Chowdhery et al., 2023; Anil et al., 2023), and LLaMA (Touvron et al., 2023a, b). However, these LLMs are only capable of understanding text, although there have been advancements in multimodal work during this period, they often require fine-tuning on new tasks (Zellers et al., 2021; Hendricks et al., 2021) or are unable to generate text (Radford et al., 2021; Li et al., 2021). Inspired by few-shot learners like GPT-3, Alayrac et al. (Alayrac et al., 2022) collected a large-scale multimodal dataset from the web, primarily consisting of text-image pairs and video-image pairs, and used this dataset to train a MLLM named Flamingo. Flamingo can directly adapt to visual tasks through simple few-shot learning without the need for task-specific fine-tuning. The powerful multimodal in-context learning ability and few-shot ability of Flamingo establish it as the GPT-3 moment in the multimodal domain (Li et al., 2023a), thus we consider Flamingo as the beginning of MLLMs (Zhang et al., 2024b). Subsequently, more prominent works emerged in the multimodal domain, such as BLIP-2 (Li et al., 2023d), LLaVA (Liu et al., 2024a), MiniGPT-4 (Zhu et al., 2023), all of which share the same point of adding a visual encoder to LLMs and using additional modules to connect LLMs and visual encoders to bridge the gap between modalities. These MLLMs leverage LLMs as cognitive engines, not only retaining the inherent capabilities of LLMs (Zhang et al., 2024b), but also endowing powerful visual support, which provides a possible direction towards artificial general intelligence.

2.5. High-quality Data

One of the remarkable aspects contributing to the excellence of LLMs and MLLMs is their utilization of large-scale training data, enabling them to acquire universal representations transferable to nearly any language understanding or generation task (Zhou et al., 2024). However, the vast majority of this training data is sourced from the web, such as WebText (Radford et al., 2019) and Common Crawl, and it is inevitable that there are some toxicities and biases in these large amounts of web data, which are also carried over to LLMs and MLLMs (Moor et al., 2023a). To mitigate the negative impact of training on large-scale datasets and further enhance model performance, it’s common to use a number of high-quality datasets to fine-tune the model.

For instance, InstructGPT (Ouyang et al., 2022) employs manually generated and curated high-quality datasets for supervised fine-tuning (SFT) and reinforcement learning (RL), enabling the model to produce outputs more aligned with user expectations and demands, thereby avoiding inaccuracies, irrelevance, or harmful content generation. InstructBLIP (Dai et al., 2024) collects datasets in the instruction format to fine-tune the model, enhancing ability of model to understand and follow user instructions, thereby improving zero-shot capability on new tasks. LLaVA uses GPT-4 to generate high-quality instruction-following data for instruction fine-tuning, bringing its multimodal capabilities closer to GPT-4. Particularly, LIMA (Zhou et al., 2024), after fine-tuning based on LLaMA by using only 1,000 meticulously curated prompts and responses with standard supervised loss, surpassed Alpaca (Taori et al., 2023) and Bard in both human preference and GPT-4 preference scores. Ablation experiments conducted on LIMA revealed that the benefits of improving data quality outweigh those of increasing data quantity when expanding the dataset size without increasing prompt diversity (Zhou et al., 2024). Thus, it can be observed and predicted that data engineering is emerging as one of the new focal points of research. 

In this section, we delineate the development background of LLMs, focusing on the transition from supervised learning, unsupervised pre-training &\&& fine-tuning to unsupervised pre-training &\&& prompt. Inspired by LLMs, the multimodal domain has experienced rapid growth, resulting in the emergence of MLLMs built upon LLM foundations. Particularly, owing to the robust few-shot capability of GPT-3 and Flamingo, we regard GPT-3 and Flamingo as the beginnings of LLM and MLLM, respectively. With recent studies exploring the impact of high-quality datasets on LLMs and MLLMs, we predict that data engineering will become a focus of research in the future. Therefore, throughout the elucidation of the developmental background of LLMs and MLLMs, we contend that the focus of the development of LLMs and MLLMs has shifted from initial feature engineering to structure engineering, objective engineering, and presently, prompt engineering and data engineering.

3. Structure of LLMs and MLLMs

Existing LLMs are all built on the Transformer architecture, which is an encoder-decoder framework. Consequently, these LLMs have evolved into three structures based on the Transformer architecture (Zhou et al., 2023a; Yang et al., 2023a): (1) Encoder-only, represented by models such as BERT; (2) Decoder-only, represented by models such as the GPT series; (3) Encoder-Decoder, represented by models like T5 (Raffel et al., 2020). Current MLLMs typically build on LLMs by adding a visual encoder for understanding visual information and a modal alignment module (Zhang et al., 2024b; Yin et al., 2023) between the visual encoder and the LLMs to bridge the vision-text modalities gap. To provide a comprehensive summary of existing medical LLMs and MLLMs, in this section, we will separately discuss the model architectures of medical LLMs and MLLMs. Specifically, in Section 3.1, we will summarize medical LLMs based on the three aforementioned structures. In Section 3.2, we will discuss common vision encoders, LLM backbones, and modality alignment methods in medical MLLMs. For clarity, detailed information on existing medical LLMs and MLLMs is provided in Table 1 and Table 2.

3.1. Structure of LLMs

3.1.1. Encoder-only

Encoder-only language models (LMs) are composed of multiple encoder layers of Transformer, among which BERT is the earliest and most representative encoder-only LM. Inspired by BERT, more encoder-only LMs have emerged, such as DeBERTa (He et al., 2020a), ALBERT (Lan et al., 2019), and RoBERTa (Liu et al., 2019). These encoder-only LMs typically employ the masked language modeling (MLM) task for pre-training, wherein random tokens in sentences are masked, prompting the model to predict these masked tokens as accurately as possible. Such pre-training task endow encoder-only LMs with remarkable natural language understanding capability, so researchers have also endeavored to develop encoder-only LMs in the medical domain (Lee et al., 2020; Ji et al., 2021; Meng et al., 2020; Gu et al., 2021). For instance, BioBERT (Lee et al., 2020) was pre-trained on biomedical corpora and achieved state-of-the-art results in biomedical named entity recognition, biomedical relation extraction, and biomedical QA tasks. MentalBERT (Ji et al., 2021), on the other hand, was trained on various datasets of mental disorders (such as depression, anxiety, and suicidal ideation) collected from popular social platforms like Reddit and Twitter, enabling LMs utilization in the field of mental health research.

Despite the existence of numerous encoder-only LMs in the medical domain, if they are strictly categorized, the aforementioned models belong to pre-trained language models (PLMs) (He et al., 2023; Wang et al., 2023f) rather than LLMs, because most of these LMs are based on BERT as the base model, employing MLM task for pre-training, and subsequently fine-tuning for various downstream tasks, and they lack the robust ICL and few-shot capability demonstrated by models like GPT-3. Therefore, such PLMs will not be further discussed in subsequent sections.

3.1.2. Decoder-only

Decoder-only is currently the mainstream architecture for LLMs, constructed by multiple decoder layers of Transformer. The earliest decoder-only LM was GPT, and subsequently, GPT-3 ushered in a new era of LLMs, followed by the emergence of several outstanding decoder-only works (Chowdhery et al., 2023; Anil et al., 2023; Ouyang et al., 2022; Touvron et al., 2023a, b). These decoder-only LLMs typically employ next token prediction (NTP) as the pre-training task. During training, the model is tasked with predicting the next token in the sequence given all preceding tokens. This pre-training task endows decoder-only LLMs with excellent generative capability. Due to the remarkable performance of decoder-only LLMs like GPT-3 in general domains, researchers have also attempted to apply such powerful decoder-only LLMs to the medical domain. For instance, Med-PaLM 2 (Singhal et al., 2023b), derived from fine-tuning PaLM 2 (Anil et al., 2023) on medical datasets, achieved a score of 86.5 in the USMLE (Jin et al., 2021), reaching the level of medical experts. Some studies have expanded medical LLMs to other languages (Wang et al., 2023d; Zhang et al., 2023a; Chen et al., 2023e; Ye et al., 2023; Yang et al., 2024b), or extended them to traditional medicine (Yang et al., 2023d), further broadening the application scope and impact of LLMs in the medical domain.

Compared to encoder-only LMs, these decoder-only LLMs utilize NTP as the pre-training task, making them more proficient in text generation (Zhou et al., 2023b). Moreover, researches (Wang et al., 2022c; Dai et al., 2022) have demonstrated that decoder-only LLMs exhibit the best few-shot and zero-shot performance on various downstream tasks, which is one of the reasons why decoder-only has become the predominant framework for LLMs at present.

Table 1. Detailed information on existing medical LLMs.
Model Name Architecture (Sec.3.1) Base Model Para.(B) Data Source (Sec.4.1) Construction Method (Sec.4.2) Evaluation Method (Sec.5.1) Date
Med-PaLM (Singhal et al., 2023a) Decoder-Only Flan-Palm 540 MultiMedQA IFT AEM, Human 2022/12
ChatDoctor (Li et al., 2023e) Decoder-Only LLaMA 7 Alpaca-52K, HealthCareMagic-100k IFT AI 2023/03
DoctorGLM (Xiong et al., 2023) Encoder-Decoder ChatGLM-6B 6 ChatDoctor, HearlthcareMagic, MedDialog, CMD. IFT Human 2023/04
Baize-Healthcare (Xu et al., 2023a) Decoder-Only LLaMA 7 Quora, MedQuAD SFT AI 2023/04
BenTsao (Wang et al., 2023d) Decoder-Only LLaMA 7 CMeKG SFT Human 2023/04
MedAlpaca (Han et al., 2023a) Decoder-Only LLaMA 7 / 13 Medical Meadow IFT AEM 2023/04
PMC-LLaMA (Wu et al., 2024a) Decoder-Only LLaMA 7 / 13 MedC-K, MedC-I CPT, IFT AEM 2023/04
Med-PaLM 2 (Singhal et al., 2023b) Decoder-Only PaLM 2 340 MultiMedQA IFT AEM, Human 2023/05
Clinical Camel (Toma et al., 2023) Decoder-Only LLaMA 2 13 / 70 ShareGPT, PubMed, MedQA SFT AEM 2023/05
HuatuoGPT (Zhang et al., 2023a) Decoder-Only BLOOMZ 7 Hybrid Data SFT, RLAIF AEM, Human, AI 2023/05
GatorTronGPT (Peng et al., 2023b) Decoder-Only GPT-3 5 / 20 Clinical Text from UF Health, Pile PT AEM 2023/06
ClinicalGPT (Wang et al., 2023g) Decoder-Only BLOOM 7 cMedQA2, cMedQA-KG, MD-EHR, MEDQA-MCMLE, MedDialog SFT, RLHF AEM 2023/06
Zhongjing (Yang et al., 2024b) Decoder-Only Ziya-LLaMA 13 CMtMedQA, ChatMed, CMeKG CPT, SFT, RLHF Human, AI 2023/08
Radiology-Llama2 (Liu et al., 2023c) Decoder-Only LLaMA 2-7b-chat 7 MIMIC-CXR, OpenI IFT AEM, Human 2023/08
MedChatZH (Tan et al., 2023) Decoder-Only Baichuan 7 Books, med-mix-2M CPT, IFT AEM, AI 2023/09
ChatCounselor (Liu et al., 2023a) Decoder-Only Vicuna 7 Psych8k IFT AI 2023/09
Qilin-Med (Ye et al., 2023) Decoder-Only Baichuan 7 ChiMed CPT, SFT, DPO AEM 2023/10
AlpaCare (Zhang et al., 2023d) Decoder-Only LLaMA 7 / 13 MedInstruct-52k IFT AI 2023/10
BianQue (Chen et al., 2023e) Encoder-Decoder ChatGLM 6 BianQueCorpus IFT AEM 2023/10
SoulChat (Chen et al., 2023f) Encoder-Decoder ChatGLM 6 SoulChatCorpus IFT AEM, Human 2023/11
TCM-GPT (Yang et al., 2023d) Decoder-Only BLOOM 7 TCM-Corpus-1B, TCM-EXAM, TCM-EHR CPT, SFT AEM 2023/11
MEDITRON (Chen et al., 2023a) Decoder-Only LLaMA 2 7 / 70 GAP-Replay, MedMCQA, PubMedQA, MedQA CPT, SFT AEM 2023/11
AMIE (Tu et al., 2024a) Decoder-Only PaLM 2 340 MedQA, MultiMedBench, MIMIC-III, RealWorld Dialogue IFT Human, AI 2024/01
  • 1

    There is no encoder-only LLMs in the table because most encoder-only based language models belong to PLM, not LLM.

  • 2

    ”CPT” means continuous pre-training, ”IFT” means instruction fine-tuning, ”SFT” means supervised fine-tuning, ”RLHF” means reinforcement learning from human feedback, ”RLAIF” means reinforcement learning from AI feedback, ”DPO” means direct preference optimization.

  • 3

    ”AEM” means automatic evaluation metrics.

3.1.3. Encoder-Decoder

Encoder-decoder LLMs directly utilize the Transformer structure, consisting of a stack of Transformer encoders and decoders. The encoder processes the input sequence and outputs representations with contextual information, which the decoder utilizes for text generation (Zhou et al., 2023a). Representative encoder-decoder LLMs include UL2 (Tay et al., 2022), T5 (Raffel et al., 2020) and GLM (Du et al., 2021). Similar to encoder-only and decoder-only architecture, encoder-decoder LLMs have also been extended to the medical domain. For example, SoulChat (Chen et al., 2023f) leverages the empathy dialogue dataset SoulChatCorpus, fine-tuning on the foundation of ChatGLM, and it demonstrates strong empathetic ability, which can guide users to express themselves and provide rational advice in psychological counseling.

Although encoder-decoder LLMs combine the advantages of encoder-only and decoder-only structure, balancing text understanding and generation, Wang et al. (Wang et al., 2022c) has demonstrated that decoder-only LLMs perform best in zero-shot scenarios without any fine-tuning, while encoder-decoder LLMs require multitask fine-tuning on a certain amount of annotated data to achieve optimal performance. Given that current LLM training paradigm is still to do unsupervised learning on large-scale corpus, it is evident that decoder-only architecture, which excel in zero-shot performance, can better utilize such unlabeled data. Therefore, decoder-only remains the mainstream architecture for LLMs at present.

3.2. Structure of MLLMs

As shown in Fig. 4, in this section, we will provide a detailed discuss of three crucial modules of MLLMs: Vision Encoder, LLM Backbone and Modality Alignment Module. We will treat the method of leveraging expert models to construct MLLMs as a prompt augmentation method (Shu et al., 2023) and discuss it alongside other modality alignment modules. To facilitate researchers in building their own medical MLLMs, we provide implementation choices of three modules in Table 2.

Refer to caption
Figure 4. The core modules and pipeline of MLLMs. On the far right are three types of modality alignment modules. We consider the method of leveraging expert models to construct MLLMs as a form of prompt augmentation method, categorized under the modality alignment modules for further explanation.

3.2.1. Vision Encoder

MLLMs is based on LLMs by adding a vision encoder, thereby endowing LLMs with visual capability. Specifically, the role of the vision encoder V𝑉Vitalic_V is to encode visual input Ixsubscript𝐼𝑥I_{x}italic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT into visual features Zxsubscript𝑍𝑥Z_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, as shown below:

(2) Zx=V(Ix)subscript𝑍𝑥𝑉subscript𝐼𝑥Z_{x}=V(I_{x})italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_V ( italic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT )

There are various options for the vision encoder V, such as ResNet (He et al., 2016), a landmark work in CV that achieved state of the art in various downstream tasks at the time, and serves as the vision encoder for the pioneering MLLM work, Flamingo (Alayrac et al., 2022). However, in recent years, researchers have preferred to Transformer-based models like ViT (Dosovitskiy et al., 2020) instead of ResNet. For instance, Qilin-Med-VL (Liu et al., 2023d) utilized the original ViT as its vision encoder, while Med-PaLM M (Tu et al., 2024a) employed ViT-e (Chen et al., 2022) and ViT-22B (Dehghani et al., 2023) as vision encoder. Chen et al. (Chen et al., 2023d) pointed out that pre-trained visual model based on contrastive learning outperform based on classification in various tasks, especially in localization and visual-text understanding, when serving as vision encoder for MLLMs, and thus, more MLLMs opt to utilize visual models trained via contrastive learning as their vision encoder. For example, LLaVA-Med (Li et al., 2024b) utilized CLIP ViT-L/14 (Radford et al., 2021) as the vision encoder, and XrayGLM (Wang et al., 2023a) used EVA-CLIP ViT-G/14 (Fang et al., 2023) as the vision encoder.

In summary, ResNet, as an excellent convolutional neural network, is a good choice for the vision encoder, but Transformer-based ViT models are more favored by researchers. Moreover, contrastive learning-based ViT models, such as CLIP-ViT and EVA-CLIP ViT, are typically superior to classification pre-trained ViT models when serving as vision encoders for MLLMs. Therefore, these ViT models trained via contrastive learning are currently the mainstream choice for vision encoder.

3.2.2. LLM Backbone

The LLM backbone is the core part of the three important modules of MLLMs and the one with the largest number of parameters, which endows MLLMs with capabilities such as text interaction, ICL, and reasoning. The principle of LLM backbone in MLLMs is shown below:

(3) R=L(Hx,Tx)𝑅𝐿subscript𝐻𝑥subscript𝑇𝑥R=L(H_{x},T_{x})italic_R = italic_L ( italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT )

Where R𝑅Ritalic_R denotes the response output of the LLM, L𝐿Litalic_L represents the LLM backbone, Txsubscript𝑇𝑥T_{x}italic_T start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT signifies the embedded tokens of the text input, and Hxsubscript𝐻𝑥H_{x}italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT are visual representations that LLM can understand. The specific meaning of Hxsubscript𝐻𝑥H_{x}italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is explained in Equation (4).

Although powerful LLMs like ChatGPT and PaLM 2 have not been publicly released, there is still a large number of excellent open-source LLMs in the community for researchers to choose. Among these, LLaMA and LLaMA 2 developed by Meta are the most popular open-source LLMs, often serving as the LLM backbone for MLLMs. Additionally, fine-tuned models based on LLaMA are also the choices for the LLM backbone, such as Alpaca and Vicuna (Chiang et al., 2023), with Vicuna-13B achieving performance exceeding 90% of ChatGPT and Bard. Furthermore, Baichuan 2 (Yang et al., 2023e), as a general LLM, exhibits robust performance in medical task even without fine-tuning on specialized medical data, consequently, it stands as a favorable choice for an LLM backbone.

Table 2. Detailed information on existing medical MLLMs.
Model Name Vision Encoder (Sec.3.2.1) LLM Backbone (Sec.3.2.2) Modality Alignment (Sec.3.2.3) Data Source (Sec.4.1) Evaluation Method (Sec.5.1) Date
ChatCAD (Wang et al., 2023h) Expert models ChatGPT Prompt Augmentation MIMIC-CXR, CheXpert AEM 2023/02
Visual Med-Alpaca (Shu et al., 2023) Expert models Med-Alpaca Prompt Augmentation ROCO, BigBIO / 2023/04
CLIP-ViT w/ GPT2 (Van Sonsbeek et al., 2023) CLIP-ViT GPT2-XL MLP Slake, PathVQA, OVQA AEM 2023/05
MedVInt (Zhang et al., 2023e) PMC-CLIP-ViT PMC-LLaMA MLP, Transformer PMC-VQA AEM 2023/05
MedBLIP (Chen et al., 2023b) EVA-CLIP-ViT BioMedLM Q-Former ADNI, NACC, OASIS AEM 2023/05
XrayGLM (Wang et al., 2023a) ViT-G ChatGLM Q-Former MIMIC-CXR, OpenI / 2023/05
PathAsst (Sun et al., 2023) CLIP-ViT Vicuna Linear PathCap, PathInstruct / 2023/05
ChatCAD+ (Zhao et al., 2023b) Expert models ChatGPT Prompt Augmentation CheXpert, MIMIC-CXR AEM 2023/05
LLaVA-Med (Li et al., 2024b) BioMed CLIP-ViT Vicuna, LLaMA Linear PMC-15M, VQA-RAD, SLAKE, PathVQA AEM, AI 2023/06
PCLmed (Yang et al., 2023c) EVA-CLIP-ViT ChatGLM Q-Former ImageCLEF 2023 caption prediction AEM 2023/06
OphGLM (Gao et al., 2023) Expert models ChatGLM Prompt Augmentation Web data, MedDialog AEM 2023/06
XrayGPT (Thawkar et al., 2023) MedCLIP-ViT Vicuna Linear MIMIC-CXR, OpenI AEM, AI 2023/06
Med-Flamingo (Moor et al., 2023b) CLIP-ViT LLaMA Cross-Attention Layers MTB, PMC-OA AEM, Human 2023/07
Med-PaLM M (Tu et al., 2024a) ViT-e, ViT-22B PaLM MLP MultiMedBench AEM, Human 2023/07
RadFM (Wu et al., 2023) 3D ViT MedLLaMA-13B Concat MedMD, RadMD AEM, Human 2023/08
R2GenGPT (Wang et al., 2023c) Swin Transformer LLaMA 2 Linear IU-Xray, MIMIC-CXR AEM 2023/09
Qilin-Med-VL (Liu et al., 2023d) ViT Chinese-LLaMA2-13B-Chat Linear ChiMed-VL / 2023/10
PeFoM-Med (He et al., 2024) EVA-CLIP-ViT LLaMA2-Chat Linear ROCO, VQA-RAD AEM, Human 2024/01

3.2.3. Modality Alignment

While adding a vision encoder to LLMs allows them to process visual input, LLMs trained solely on text datasets are incapable of comprehending the output features Zxsubscript𝑍𝑥{Z}_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT from the vision encoder. Therefore,there is a need for modal alignment, which converts Zxsubscript𝑍𝑥{Z}_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT into a format understandable by LLMs, as illustrated in Equation (4):

(4) Hx=f(Zx)subscript𝐻𝑥𝑓subscript𝑍𝑥H_{x}=f(Z_{x})italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_f ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT )

Where f𝑓fitalic_f denotes the modal alignment method and Hxsubscript𝐻𝑥H_{x}italic_H start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT are visual representations understandable by LLMs. Modality alignment is crucial for MLLMs to understand visual information, significantly influencing the multimodal capability of MLLMs. In the following sections, we will introduce four existing modality alignment methods: Additional Cross-Attention Layers, Query-Based, Projection-Based, and Prompt Augmentation .

Additional Cross-Attention Layers were proposed in Flamingo, where the approach involves interleaving dense cross-attention layers into a pre-trained LLM that is frozen. The input of these cross-attention layers comes from the output of the vision encoder, which is usually passed a Perceiver Resampler (Jaegle et al., 2021), thereby reducing the computational complexity of vision-text cross-attention. Through Additional Cross-Attention Layers, the LLM generates text responses conditioned on visual representations. Subsequent works such as Med-Flamingo (Moor et al., 2023b), which are based on Flamingo, also utilize these cross-attention layers for modality alignment.

Query-Based method, which can be regarded as a multimodal perceiver (Song et al., 2023), involves extracting information from visual representations using a set of learnable query vectors. For example, Q-Former proposed in BLIP-2 (Li et al., 2023d) extracts visually relevant features from a frozen vision encoder to facilitate LLMs in generating text responses aligned with visual information. Based on this, Jian et al. (Jian et al., 2024) introduced P-Former, specifically trained for language data, bypassing the need for image-text pairs, thus offering a modality-agnostic and more flexible approach. Similarly influenced by BLIP-2, in the medical domain, Chen et al. (Chen et al., 2023b) proposed MedBLIP, extending this query mechanism to 3D medical images and text.

Projection-Based method can be considered a form of multimodal converter (Song et al., 2023), which is simpler compared to Query-Based method, as it involves mapping visual representations from the output of the vision encoder to the word embedding space using a simple projection layer, enabling LLMs to comprehend images. For instance, LLaVA-Med, Qilin-Med-VL, and XrayGPT (Thawkar et al., 2023) utilize a simple linear layer to map visual representations, while MedVIntTE (Zhang et al., 2023e) and LLaVA-1.5 (Liu et al., 2023b) use MLP for this purpose. These mapped visual representations, along with textual representations, serve as inputs to the LLM backbone.

Prompt Augmentation typically involves processing images with expert models and combining the processed results with prompt templates to convert them into general text, serving as input prompts for LLMs, thereby linking visual information with text. For instance, VideoChat-Text (Li et al., 2023c) utilizes perception models to explicitly encode video information into textual descriptions. Specifically, it utilizes InternVideo (Wang et al., 2022b) to analyze the target actions, T5 to refine their descriptions for clarity, and Whisper (Radford et al., 2023) to further enhance the richness of video descriptions. After generating detailed textual descriptions of the video, these descriptions are combined with prompt templates as input to LLMs. In the medical field, OphGLM (Gao et al., 2023) extracts information from fundus images using classification and segmentation models, integrates this information into structured text templates to form diagnostic reports, which serve as input to LLMs. Similarly, in ChatCAD (Wang et al., 2023h), X-ray images are first fed into trained Computer-aided Diagnosis (CAD ) models to obtain outputs, which are then transformed into natural language using prompt templates and serve as input to LLMs. Compared to Query-Based and Projection-Based method, Prompt Augmentation method leverages expert models, eliminating the need for additional modality alignment training, but its effectiveness depends on the performance of the expert models.

Despite the differences among the above four approaches, their ideas are all text-centered, i.e., they utilize the unique property of text as a modality space to transform visual information into textual space, thus enabling LLMs to understand the visual input (Tsai et al., 2024). Such methods not only achieve vision-text alignment but also enable broader modal alignment. For instance, OneLLM (Han et al., 2023c) employs a unified framework to map 8 modalities to the textual space, achieving alignment across multiple modalities. This also provides a feasible approach for developing new medical MLLMs by utilizing medical data from more modalities, such as 3D-CT, 2D-X-ray, 1D-ECG data, to train a more comprehensive medical MLLM.

4. Construction of Medical LLMs and MLLMs

In Section 2 and Section 3, we provided a clear exposition of the development background and model architectures of LLMs and MLLMs. Based on this, to facilitate researchers and medical professionals to develop their own medical LLMs and MLLMs, this section summarizes the available medical datasets for training purposes and elaborates on the methods for constructing medical LLMs and MLLMs in detail.

4.1. Training Datasets

Foundation models (Zhou et al., 2023b; Li et al., 2023a), such as GPT-3, LLaMA, and PaLM, typically gather training data from various sources such as web pages, books, research papers, and code repositories to enhance the model’s general capability. Similarly, in the medical domain, there are various forms of datasets, primarily including electronic health records (EHRs), scientific literature, QA, dialogue, medical knowledge bases, web data, medical image-text pairs, and high-quality data generated by AI models like ChatGPT or GPT-4. This section provides a brief overview of these types of medical datasets, and more information of the datasets can be found in Table 3.

Table 3. Summary of medical datasets for pre-training and fine-tuning.
Datasets Type Description AI Synthesis
MIMIC-III (Johnson et al., 2016) EHR Approximately 2M de-identified notes.
MIMIC-IV (Johnson et al., 2023) EHR About 300K patients, 430K admissions.
CPRD (Herrett et al., 2015) EHR Anonymized medical records for over 11.3M patients.
OpenI (Demner-Fushman et al., 2016) EHR & Multimodal 7,470 images and 3,955 reports.
PubMed Literature Over 34M citations and abstracts of biomedical literature, about 4.5B words.
PMC Literature Provides free full-text access to PubMed, about 13.5B words.
CORD-19 (Wang et al., 2020) Literature More than 140K papers, with more than 72K full text.
PubMedQA (Jin et al., 2019) QA 1K labeled, 612K unlabeled and 211.3K manually generated QA.
MedQA (USMLE) (Jin et al., 2021) QA 61,097 multiple-choice QA pairs.
MedMCQA (Pal et al., 2022) QA 194K multiple-choice QA pairs.
cMedQA2 (Zhang et al., 2018) QA 100K questions and 200k answers.
MultiMedQA (Singhal et al., 2023a) QA Includes six existing datasets and one new dataset.
MedQuAD (Ben Abacha and Demner-Fushman, 2019) QA 47,457 question-answer pairs from trusted medical sources.
Medical Meadow (Han et al., 2023a) QA Over 160K QA pairs.
Huatuo-26M (Li et al., 2023f) QA 26M QA pairs.
Psych8k (Liu et al., 2023a) QA 8,187 query-answer pairs.
PMC-VQA (Zhang et al., 2023e) QA & Multimodal Contains 149K images, 227K VQA pairs.
VQA-RAD (Lau et al., 2018) QA & Multimodal 315 radiology images and 3515 QA pairs generated by clinicians.
Slake (Liu et al., 2021) QA & Multimodal 642 radiology images and over 7000 diverse QA pairs.
PathVQA (He et al., 2020b) QA & Multimodal 4,998 pathology images with 32,799 QA pairs.
ChiMed-VL-Instruction (Liu et al., 2023d) QA & Multimodal 469,441 question-answer pairs.
MedC-I (Wu et al., 2024a) QA & Instructions 202M tokens.
CMtMedQA (Yang et al., 2024b) QA & Dialogue 70K multi-round conversation datasets from real doctor-patient conversations.
MedInstruct-52k (Zhang et al., 2023d) Instructions 52K instruction-response pairs generated by GPT-4.
ChiMed (Ye et al., 2023) Multiple Composed of various data such as QA, books, dialogues, etc.
GAP-REPLAY (Chen et al., 2023a) Multiple Includes data from clinical practice guidelines, abstracts, and original articles.
MedDialog (Zeng et al., 2020) Dialogue 3.4M Chinese conversations and 0.6 million English conversations.
HealthCareMagic-100k (Li et al., 2023e) Dialogue 100K authentic patient-doctor conversations.
GenMedGPT-5k (Li et al., 2023e) Dialogue 5K generated conversations between patients and physicians from ChatGPT.
UMLS (Bodenreider, 2004) Knowledge Base 2M entities for 900K concepts.
CMeKG (Byambasuren et al., 2019) Knowledge Base Chinese medical knowledge graph.
COMETA (Basaldella et al., 2020) Web Data consisting of 20K English biomedical entity mentions.
TCM-Corpus-1B (Yang et al., 2023d) Web Data 20GB dataset collected from Baidu Baike, Wikipedia and other sources.
MIMIC-CXR (Johnson et al., 2019) Multimodal 227,835 imaging studies for 65,379 patients.
ROCO (Pelka et al., 2018) Multimodal Contains more than 81K radiologic images, each with a corresponding title, keywords.
OpenPath (Huang et al., 2023a) Multimodal 208,414 pathology images paired with natural language descriptions.
MedICaT (Subramanian et al., 2020) Multimodal 160K images with captions and inline references.
CheXpert (Irvin et al., 2019) Multimodal 224,316 chest X-rays with reports.
PathCap (Sun et al., 2023) Multimodal 142K high quality pathology image-caption pairs.
MedMD (Wu et al., 2023) Multimodal 15.5M 2D scans, 180k 3D scans, with corresponding captions or diagnosis labels.
PMC-OA (Lin et al., 2023) Multimodal 1.6M image-caption pairs.
PMC-15M (Zhang et al., 2023f) Multimodal 15M figure-caption pairs from over 3M articles.
ChiMed-VL-Alignment (Liu et al., 2023d) Multimodal 580,014 images and context information or descriptions.
PathInstruct (Sun et al., 2023) Multimodal & Instructions 180K instruction-following data.
LLaVA-Med-Instruct (Li et al., 2024b) Multimodal & Instructions 600K image-text pairs and converted to instruction-following data.
  • 1

    ”EHR” means electronic health record; ”QA” means question-answer; ”Multiple” means that the dataset is a mixture of multiple types of data.

  • 2

    ”Instructions” denotes instruction-tuning data or instruction-following data, see Section 4.2.2 or Fig. 5 for details.

  • 3

    ”AI Synthesis” indicates that generative AI such as chatGPT and GPT-4 were used during the development of the dataset to assist in generating the data.

Electronic Health Records: EHRs contain personal health records, including basic information, summaries of major diseases and health problems, and primary health service records. The Medical Information Mart for Intensive Care III (MIMIC-III) (Johnson et al., 2016) is one of the largest, publicly available, and most commonly used EHRs datasets, which comprises approximately 2M de-identified notes covering 13 types of specialties such as cardiology, respiratory, and radiology. The MIMIC-III dataset provides significant convenience for building medical LLMs, as demonstrated by works like AMIE (Tu et al., 2024b) and GatorTron (Yang et al., 2022), both of which utilized MIMIC-III for training. In addition to MIMIC-III, other commonly used EHRs datasets include the Clinical Practice Research Datalink (CPRD) (Herrett et al., 2015) and the updated version of MIMIC-III, MIMIC-IV (Johnson et al., 2023).

Scientific Literature: Scientific literature containing accurate and authoritative medical knowledge, serves as one of the sources of medical datasets. PubMed is the most commonly used repository for biomedical and life science literature, providing access to major resources such as MEDLINE, PubMed Central (PMC), and NCBI Bookshelf. It indexes citations from over 34M biomedical literature articles. PubMed abstracts comprise approximately 4.5B words, making it a high-quality medical training dataset. PubMedQA (Jin et al., 2019) is an example of a biomedical QA dataset collected from PubMed abstracts. In addition to PubMed, PMC is a popular scientific literature resource that provides free full-text access to PubMed, with full-text articles containing approximately 13.5B words. PubMed and PMC offer high-quality medical literature, often used as sources for other datasets. For instance, PMC-OA (Lin et al., 2023), PMC-VAQ (Zhang et al., 2023e), and PMC-15M (Zhang et al., 2023f) are three biomedical multimodal datasets extracted from PMC, significantly facilitating the development of medical LLMs (Toma et al., 2023; Wu et al., 2024a; Chen et al., 2023a) and MLLMs (Li et al., 2024b; Moor et al., 2023b).

Question-Answer: QA datasets consist of two types: discriminative QA (Jin et al., 2021; Pal et al., 2022) and generative QA (Zhang et al., 2023e). Discriminative QA datasets mostly comprise multiple-choice questions, while generative QA involves open-ended questions. Typical QA datasets include PubMedQA (Jin et al., 2019), MedQA (Jin et al., 2021), PMC-VQA (Zhang et al., 2023e), and MultiMedQA (Singhal et al., 2023a), etc., among which MultiMedQA is a comprehensive medical QA dataset contains 7 medical QA datasets, covering both multiple-choice and open-ended questions, to comprehensively evaluate the authenticity, helpfulness, accuracy, and potential harm of LLMs’ responses. Because QA datasets not only contain specialized medical knowledge but also possess characteristics such as conciseness and relevance to clinical QA scenarios, they are primarily used not only as training datasets for model learning but also as benchmarks to test the medical capabilities of medical LLMs and MLLMs.

Dialogue: High-quality pre-training corpus such as EHRs, scientific literature, can significantly enhance the medical performance of LLMs and MLLMs. However, these datasets only provide fundamental theoretical knowledge, training models solely on these datasets may result in models lacking interactive capability. Fine-tuning these models on dialogue data can enhance their ability to interact and understand patient queries and needs (Li et al., 2023e), and thus, researchers are dedicated to constructing high-quality dialogue datasets to fine-tune the model. For instance, HealthCareMagic100k (Li et al., 2023e) comprises approximately 100K authentic patient-doctor conversations collected from the online medical consultation website HealthCareMagic, both ChatDoctor (Li et al., 2023e) and DoctorGLM (Xiong et al., 2023) utilized this dataset for fine-tuning. To avoid the tedious process of collecting such real dialogue datasets, including large-scale filtering and de-duplication, researchers have attempted to use ChatGPT or GPT-4 to simulate real dialogue scenarios and generate dialogue datasets. For example, GenMedGPT-5k (Li et al., 2023e) is a 5K doctor-patient dialogues generated by ChatGPT.

Medical Knowledge Bases: Medical knowledge bases, such as medical libraries, also contain medical data for model training, among which Unified Medical Language System (UMLS) (Bodenreider, 2004) is one of the most popular knowledge bases, which is a giant medical terminology system developed by the U.S. National Library of Medicine for more than 20 years and contains about 900K medical concepts and 2M medical entities. Furthermore, Chinese medical knowledge graph (CMeKG) (Byambasuren et al., 2019) provides medical knowledge about diseases, drugs, symptoms. Although it contains some structured data that do not conform to the training data format, it can be processed into general text form using ChatGPT or GPT-4. For example, BenTsao (Wang et al., 2023d) utilized the OpenAI API to process the CMeKG, resulting in the generation of 8K instruction data for SFT.

Web Data: General foundation models like LLaMA and GPT-3 extensively utilize web data for training. Similarly, in the medical domain, there exists a large amount of web medical data suitable for training, with Reddit, Twitter and Wikipedia being the sources of these data. For instance, TCM-Corpus-1B (Yang et al., 2023d) is a traditional medicine dataset collected from Baidu Baike and Wikipedia. After undergoing data cleaning processes, TCM-Corpus-1B contains approximately 20GB of textual information and serves as one of the training datasets for TCM-GPT (Yang et al., 2023d).

Multimodal Medical Image-Text Pairs: Medical image-text pairs are primarily utilized for training medical MLLMs. For instance, the PMC-OA dataset, as previously mentioned, consists of 1.65M medical image-text pairs collected from PMC, which are employed for training models such as PMC-CLIP (Lin et al., 2023) and Med-Flamingo. PMC-VAQ builds upon PMC-OA by leveraging ChatGPT to generate a large number of diverse and high-quality questions, after filtering, it culminates in 227K VQA pairs. PMC-15M, also derived from PMC articles, contains 15M figure-caption pairs, surpassing the scale of MIMIC-CXR (Johnson et al., 2019) by two orders of magnitude. Additionally, there exist several other multimodal medical datasets such as ChiMed-VL (Liu et al., 2023d), RadMD (Wu et al., 2023), and Open-I (Demner-Fushman et al., 2016), which have contributed to the development of medical MLLMs.

AI-Generated Datasets: It has been demonstrated that fine-tuning models with a large quantity of high-quality synthetic data generated by ChatGPT can significantly enhance the performance of models in downstream tasks (Tang et al., 2023). Similarly, in the medical domain, there are also efforts underway to explore the use of powerful general models like ChatGPT or GPT-4 for generating medical data. These data encompass various formats including dialogues, QA pairs, instruction-tuning data (Zhang et al., 2023d; Li et al., 2024b), etc., and the data modality is not limited to text, but also includes multimodality. For instance, the Psych8k (Liu et al., 2023a) was created by converting 260 real-life counseling recordings into text and then extracting query-answer pairs from this text using GPT-4, which also generates a summary of important information for each conversation to provide more contextual information, thus helping the model to generate better responses. Llava-Med-Instruct (Li et al., 2024b) is a biomedical multimodal instruction-following dataset generated by GPT-4 based on image-text pairs from PMC-15M, which is utilized by LLaVA-Med for fine-tuning to achieve SOTA on multiple benchmarks.

4.2. Construction Methods

While a small portion of medical LLMs (Peng et al., 2023b) and MLLMs are trained from scratch using large-scale medical datasets, such an approach requires extensive computational resources, costs, and time, especially considering that medical MLLMs involve not only an LLM backbone but also additional components such as vision encoder and alignment module, making the training costs even higher. Therefore, the mainstream approach to constructing medical LLMs or MLLMs is to use medical datasets to fine-tune the general foundation model. To provide a detailed overview of the entire process of constructing medical LLMs and MLLMs, this subsection first reviews the classic pre-training methods of general foundation models, then summarizes the fine-tuning methods for transferring general foundation models to the medical domain. Finally, considering the significant computational costs associated with pre-training, we additionally introduce the scaling law (Kaplan et al., 2020) to assist researchers in designing and training LLMs and MLLMs more efficiently, thus avoiding unnecessary waste of computational resources.

4.2.1. Pre-Training Methods

For LLMs, all pre-training methods aim to equip the model with excellent abilities in comprehension, reasoning, generation, etc.; For MLLMs, the goal of pre-training is to align vision features with text features (Caffagni et al., 2024) to bridge the gap between modalities. Next, we will proceed to introduce the pre-training methods for LLMs and MLLMs respectively.

Masked Language Modeling: Masked language modeling (MLM) was first introduced in BERT, where the idea is to randomly mask a certain percentage of input tokens and then have the model to predict these masked tokens. This training method enables the model to learn token-level representations and inherently makes the model bidirectional, as the representation of masked tokens can be learned from both surrounding words. Additionally, the MLM task is also applicable in the multimodal domain, where, given an image-text pair, a portion of the text tokens are randomly masked, and the model is tasked with reconstructing the masked tokens based on the image representations and the unmasked tokens.

Next Sentence Prediction: Next sentence prediction (NSP) was initially introduced in BERT, and the idea is to have the model predict whether two segments follow each other in the original text. This training method enables the model to learn sentence-level representations and understand the relationship between two sentences. Although experiments with BERT demonstrated the effectiveness of the NSP task in QA and natural language inference tasks, Liu et al. (Liu et al., 2019) demonstrated that removing the NSP could slightly improve the performance in the downstream task, while the NSP task was gradually replaced in the development of subsequent LLMs.

Next Token Prediction: Next token prediction (NTP) is the core task in the GPT series and is currently the mainstream pre-training task for LLMs.The idea of NTP is that the model predicts the next token based on the context of the input, specifically, when given the input text, the model assigns probabilities to all tokens in the vocabulary and selects the token with the highest probability as the predicted output. Because NTP has been shown to be more efficient (He et al., 2023), as well as more helpful in improving model’s generative capacity, researchers prefer to use NTP as the pre-training task rather than MLM.

Image-Text Matching: Image-text matching (ITM) is a binary classification task that requires the model to predict whether an image and text match, aiming to force the model to learn fine-grained alignment between image and text representations (Li et al., 2023d). The key to this task is to fuse image features and text features into a single vector, that is convenient as an input to the classifier. In order to achieve this, BLIP-2 employs bidirectional self-attention masks, where all queries and texts can attend to each other, so that the output query embeddings integrate information from both the image and text and are fed into the classifier to obtain the matching probability.

Image-Text Contrastive Learning: The purpose of image-text contrastive learning (ITCL) is to align image representations with text representations to maximize information interaction. Specifically, the main idea of ITCL is to input multiple image-text pairs into a vision encoder and a text encoder, respectively, and then compute the similarity between them after obtaining the corresponding visual and textual representations, with the goal of maximizing the similarity of the paired positive samples, and minimizing the similarity of the rest of the unpaired negative samples (Radford et al., 2021). BLIP-2 introduces the ITCL task in Q-Former to assist Q-Former in forcing queries to extract visual representations most relevant to the text.

Image-Text Generation: Image-text generation (ITG) is a mainstream multimodal pre-training task, where the core idea is to introduce images as contextual conditions for text generation based on the NTP task. For instance, Flamingo utilizes cross-attention layers to pass visual representations from the Perceiver Resampler to the LLM, incorporating visual information into the NTP task. In BLIP-2, the multimodal causal self-attention mask is used to pass the query vectors with visual representations to text tokens, enabling text generation combined with image information. The purpose of pre-training MLLMs is to achieve feature alignment, so in the pre-training phase, the ITG task usually only requires simple image descriptions. For example, during pre-training of LLaVA, the model is given an image and a instruction prompt, and it is required to provide a brief description of the image, with the predicted answer being the original caption.

4.2.2. Fine-Tuning Methods

Although large models pre-trained on large-scale datasets perform well in general domains, they do not perform well in the medical field due to the lack of domain-specific knowledge. However, the cost of training medical LLMs or MLLMs from scratch is prohibitively high, so fine-tuning is a key technique for constructing such models (Li et al., 2024b; Singhal et al., 2023a; Gema et al., 2023). Next, we will introduce several typical fine-tuning methods. Particularly, since the emphasis of the fine-tuning methods in this subsection is on how to adapt general foundation models to the medical domain, we classify continuous pre-training (CPT) as a fine-tuning method.

Continuous Pre-Training: CPT (Wang et al., 2023f; Wu et al., 2024b) refers to the continuation of pre-training of pre-trained foundation models on medical datasets using methods such as NTP and ITG. Because these foundation models already exhibit good performance, CPT achieves satisfactory results with less data and training time. In the medical domain, classic models utilizing CPT include MEDITRON-70B (Chen et al., 2023a), which builds upon LLaMA 2 by employing a medical hybrid dataset consisting of Clinical Guidelines, PubMed papers and abstracts for CPT. In the multimodal domain, Med-Flamingo conducts CPT using an image-text interleaved dataset MTB and an image-text paired dataset PMC-OA on the basis of OpenFlamingo (Awadalla et al., 2023). Additionally, medical MLLMs such as LLaVA-Med, XrayGPT, XrayGLM, and Qilin-Med-VL undergo CPT on biomedical datasets to expand the vocabulary of aligned image-text tokens to the biomedical domain or inject biomedical knowledge into base models.

Refer to caption
Figure 5. Example of instruction data.

Instruction Fine-Tuning: Although LLMs and MLLMs are able to understand and output biomedically relevant knowledge after CPT on biomedical datasets, these models often lack instruction-following capability (Li et al., 2024b; Wei et al., 2021) or exhibit uncontrolled behavior. The purpose of instruction fine-tuning (IFT) (Wei et al., 2021) is to enhance the ability of LLMs or MLLMs to follow various human instructions by fine-tuning them on instruction data. The instruction data, as illustrated in Fig. 5, consists of three key components: the instruction, input and output, with the input being optional. These instruction data are usually generated by using ChatGPT or GPT-4 based on manually curated seed instruction data (Zhang et al., 2023d; Wang et al., 2022a) or instruction templates (Wei et al., 2021). Fine-tuning on these instruction data can significantly improve the model’s ability to comprehend and follow instructions, thereby enhancing zero-shot performance (Li et al., 2024b; Singhal et al., 2023a; Wei et al., 2021). For instance, Flan-PaLM, fine-tuned with instruction datasets, outperforms the baseline PaLM on MedQA, MedMCQA and PubMedQA.

Supervised Fine-Tuning: Although the LLMs and MLLMs can significantly improve their ability to follow user instructions after IFT, they may still generate useless, unsafa or biased responses, as shown in Fig. 6. Therefore, it is necessary to use high-quality datasets for SFT of the model, while ensuring that these datasets are both useful and harmless. Here, it is essential to emphasize the distinctions among CPT, IFT and SFT: CPT focuses on further training the foundation model on large-scale medical datasets to incorporate medical knowledge into the model. SFT and IFT are not strictly differentiated (He et al., 2023), but subtle differences between SFT and IFT can be found in several literatures (Li et al., 2024b; Liu et al., 2024a; He et al., 2023; Yang et al., 2023e). IFT emphasizes fine-tuning the model using instruction data to strengthen its ability to follow user instructions, while considering the diversity of medical tasks and medical scenarios, each of which also has different instructions, so the instruction data should to be versatile (Zhou et al., 2023a). SFT, on the other hand, emphasizes fine-tuning the model using high-quality human-annotated datasets to further enhance its professional capability, most importantly, to drive the model to align with human preferences and ethical norms. In summary, CPT emphasizes injecting medical expertise into the model, IFT emphasizes strengthening the model’s ability to follow instructions, and SFT focuses on aligning the model with human preferences and ethical norms.

Refer to caption
Figure 6. Example of models that have not gone through SFT that produce unsafe responses.

Reinforcement Learning from Human Feedback: Reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022; Casper et al., 2023) is a method that further aligns the model’s behavior with human preferences and instructions. Compared to the previous three fine-tuning methods, RLHF is more complex and can be divided into three specific steps (Ouyang et al., 2022; Casper et al., 2023; Stiennon et al., 2020): collecting human feedback, training the reward model, and policy optimization, as illustrated in Fig. 7. In the stage of collecting human feedback, the main task is to collect comparison data . Typically, a pre-trained model or a supervised baseline model is given a prompt, and after generating multiple outputs, these outputs are evaluated and annotated by expert labelers based on their relative quality (Stiennon et al., 2020), and these prompts and annotated outputs constitute the comparison data. For instance, Zhongjing (Yang et al., 2024b) employed 6 medical graduate students or clinical doctors as labelers to rank the model’s outputs based on dimensions such as safety, professionalism, and fluency, forming a comparison dataset. In the training the reward model phase, a reward model needs to be trained on the collected comparison data, the output of which is a scalar reward that numerically corresponds to the human preference. In the policy optimization phase, a new prompt is typically used as the input for the model to be optimized, and based on the response of the model to be optimized, the reward model outputs a scalar reward, and finally the model is fine-tuned based on these scalar rewards through Proximal Policy Optimization (PPO). It is worth noting that the data quality of reward model is lower than that of the data used for SFT (He et al., 2023), thus RLHF is usually performed after IFT and SFT (Touvron et al., 2023b; Casper et al., 2023), if jumping directly from pre-training to RLHF, relying on these relatively low quality data may not be sufficient to achieve the desired fine-tuning results.

Refer to caption
Figure 7. Pipeline of RLHF. Left is the Collect Human Feedback phase: Given the model a prompt at a time, the labeler ranks multiple responses from the model and collects the promt and the labeled responses. Mid is phase of training the reward model: A prompt and two responses are randomly sampled from the collected dataset, and they are utilized for training the reward model. Right is the policy optimization phase: Given a new prompt, the reward model outputs a scalar reward based on the model’s response, which is then used for policy optimization.

Reinforcement Learning from AI Feedback: Reinforcement learning from AI feedback (RLAIF) can be seen as a cost-effective alternative to the RLHF, its reward model learns from AI feedback without the need for human annotation (Bai et al., 2022). In the medical domain, Zhang et al. (Zhang et al., 2023a) sampled multiple responses from the fine-tuned model after IFT and SFT, and uses ChatGPT to rate the responses based on the dimensions of informativeness, coherence, adherence to human preferences, and accuracy, and then uses this comparison data with ratings to train a reward model. This way of training reward models through AI feedback solves the trouble of needing to label data manually in RLHF and significantly reduces labor costs.

Direct Preference Optimization: Although RLHF and RLAIF enable models to align with human preferences and ethical norms, they typically require fitting a reward model that reflects human preferences and then combining reinforcement learning to fine-tune LLMs and MLLMs, which is a complex and often unstable process. Direct preference optimization (DPO) (Rafailov et al., 2024) is a simpler and more efficient training paradigm for aligning human preferences, which skips fitting a reward model and optimizes the model directly using preference data. The core idea of DPO is to leverage an analytical mapping from the reward function to the optimal policy, converting the loss on the reward function into a loss on the policy, thereby skipping the explicit reward modeling step. For instance, Qilin-Med (Ye et al., 2023), after SFT, directly employs two publicly available preference datasets to optimize the model through DPO, ensuring stable and efficient training while aligning the model with human preferences.

Parameter-Efficient Fine-Tuning: The aim of methods such as CPT, IFT, SFT, RLHF, RLAIF, and DPO introduced above is to transfer general foundation models into the medical domain while aligning with user instructions and human preferences. Although the amount of training data required by these methods is much less than that required for pre-training of the foundation model, they still require high computational cost and overhead for full parameter fine-tuning. To alleviate this issue, a series of parameter-efficient fine-tuning (PEFT) methods have been proposed, which update only a small portion of model parameters while keeping the majority of pre-training weights frozen, thereby reducing computational costs. Some mainstream PEFT approaches are described next.

Prefix-tuning (Li and Liang, 2021) involves adding learnable tokens to the input sequence as a prefix and freezing other pre-training parameters. Adapter-tuning (Houlsby et al., 2019; Hu et al., 2023) inserts the neural network module into Transformer blocks and freezes the remaining pre-training parameters during fine-tuning, training only the inserted module. LoRA (Hu et al., 2021) employs low-rank matrix approximation of full-rank weight matrices for parameter updates, which not only has fewer training parameters and higher training throughput, but also solves the problem of inference delay that exists in adapter- tuning. Prompt-tuning (Lester et al., 2021) is similar to prefix-tuning, but it only adds learnable tokens before the input tokens of the first Transformer layer. LayerNorm-tuning (Zhao et al., 2023a) adjusts LayerNorm within each attention block, significantly reducing trainable parameters. In comparison to LoRA, models using LayerNorm-tuning achieved an average performance improvement of over 20% across five multimodal tasks (Zhao et al., 2023a). P-tuning (Liu et al., 2023g) is also similar to prefix-tuning, but it only incorporates learnable virtual tokens in the sequence of the input layer, and the location of token insertion is optional and not limited to prefixes. These PEFT methods focus on efficiently updating model parameters, while previously discussed IFT, SFT, and other methods focus on effectively enhancing model performance, and they do not conflict with each other. Typically, PEFT methods are combined with IFT, SFT, and similar methods to fine-tune models, achieving better performance under economic efficiency constraints.

4.2.3. Scaling Law

The scaling law (Kaplan et al., 2020) is the Moore’s Law of the era of LLMs, first proposed by OpenAI in 2020. It describes the relationship between a model’s performance and three factors: model size, dataset size, and the amount of compute used for training. Specifically, the scaling law states that model performance increases smoothly with model size, data size, and training computation, and empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two (Kaplan et al., 2020). To achieve optimal performance, these three factors need to scale simultaneously, and research (Hoffmann et al., 2022) has demonstrated that both model and dataset size should increase in proportion. By adhering to the scaling law, researchers can initially train smaller-scale models and then extrapolate performance to larger models (Anil et al., 2023; Brown et al., 2020; Achiam et al., 2023; Yang et al., 2023e). For instance, OpenAI used the scaling law to predict and validate the final loss of GPT-4 at a cost of less than one-ten-thousandth. Meanwhile, the scaling law reveals the relationship between performance and model size, dataset size, and training computation, aiding researchers in designing and training large models more effectively and allocating resources sensibly.

5. Evaluation Methods and Usage Tips

With the emergence of the capabilities of medical LLMs and MLLMs, the question of how to comprehensively evaluate their performance has become a key issue. And considering the various ethical and safety issues of generative models (Deshpande et al., 2023), there is an urgent need for more comprehensive benchmarks and evaluation methods to evaluate the various capabilities of medical LLMs and MLLMs that are not limited to the quality of the generated text. Furthermore, researchers are continuing to explore the hidden capabilities of LLMs and MLLMs, such as using a series of prompting methods (Wei et al., 2022b; Wang et al., 2022d; Yao et al., 2024; Madaan et al., 2024; Zhou et al., 2022) to enhance model performance. To further assist researchers and medical practitioners in understanding the entire process of developing medical LLMs and MLLMs, we will discuss the final and indispensable step, which is evaluating medical LLMs and MLLMs, in Section 5.1. Additionally, to help users unleash the deeper professional capabilities of medical LLMs and MLLMs and utilize them in clinical settings, we will introduce some practical usage tips in Section 5.2.

5.1. Evaluation Methods

Due to the diversity of tasks and capabilities of medical LLMs and MLLMs, benchmark datasets (Jin et al., 2021; Zhang et al., 2023e; Li et al., 2023f) and evaluation methods for medical LLMs and MLLMs have become increasingly diverse. For discriminative tasks (including single and multiple-choice questions) (Jin et al., 2021, 2019), accuracy is commonly used to measure model performance. For generative tasks, automatic evaluation metrics (Papineni et al., 2002; Lin, 2004; Li et al., 2015) are often employed to evaluate aspects such as the accuracy, fluency, and diversity of the responses generated by the model. Nevertheless, this approach overlooks additional concerns in the medical domain, such as reliability, safety, and consistency with human values. Therefore, in addition to using automatic evaluation metrics to evaluate medical LLMs and MLLMs, researchers have also introduced human evaluation and AI evaluation. It is important to note that this section does not involve the introduction of benchmark datasets but emphasizes the three evaluation paradigms: automatic evaluation metrics, human evaluation, and AI evaluation.

5.1.1. Automatic Evaluation Metrics

For medical LLMs and MLLMs, accuracy is commonly utilized to evaluate their performance on choice question datasets such as MedQA (Jin et al., 2021) and MedMCQA (Pal et al., 2022). However, accuracy is not a measure of the generative capacity of medical LLMs and MLLMs and therefore needs to rely on the following metrics for a comprehensive evalution.

Bilingual Evaluation Understudy (BLEU) (Papineni et al., 2002) metric evaluates the quality of generated text by computing the similarity of n-grams (sequences of consecutive words of length n𝑛nitalic_n) between the generated text and reference text. Depending on the value of n𝑛nitalic_n, BLEU is divided into BLEU-1, BLEU-2, BLEU-3 and BLEU-4, which measure the n-grams similarity of different lengths, e.g., BLEU-1 measures the word-level accuracy, and BLUE-4 focuses more on the continuity of the text. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin, 2004) include ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. Similar to BLEU, ROUGE-N measures the similarity of n-grams between the generated text and reference text, but ROUGE-N computes n-gram recall, BLEU focuses more on the accuracy. ROUGE-L measures the similarity between the generated text and reference text by calculating the length of the longest common subsequence, emphasizing textual coherence. ROUGE-W builds upon ROUGE-L by incorporating a weighted method for computing common subsequences, assigning larger weight to continuous matching text that is correct. ROUGE-S is an extension of ROUGE-N that allows non-contiguous words in n-grams. Google BLEU (GLEU) (Wu et al., 2016) is a variant of BLEU that considers factors such as lexical overlap and order between generated and reference text, providing a better reflection of the fluency and naturalness of generated text. The Distinct-n (Li et al., 2015) metric measures the diversity of generated text by calculating the proportion of unique n-grams to total n-grams. CIDEr (Vedantam et al., 2015) is designed specifically to evaluate the quality of image captions which considers both n-gram recall and precision, weighting rare n-grams to assess whether the model captures key information when generating image descriptions. BERTScore (Zhang et al., 2019) utilizes pre-trained BERT context embeddings to compute similarity scores between each token in a sentence and each token in a reference sentence. Compared to n-gram-based evaluation metrics, BERTScore better measures vocabulary and combination diversity.

In the medical domain, most models such as HuaTuoGPT, ClinicalGPT (Wang et al., 2023g), SoulChat and BianQue (Chen et al., 2023e) utilize the aforementioned metrics to evaluate the performance of models in terms of generative ability. Although these automatic evaluation metrics partially reflect the accuracy and fluency of model-generated text, they fail to capture the clinical quality of dialogue (Tu et al., 2024b) and cannot evaluate whether the generated text aligns with human values consistently, and therefore human evaluation is introduced.

5.1.2. Human Evaluation

Human evaluation is a crucial method for assessing the performance of medical LLMs and MLLMs, as it can consider aspects that automatic evaluation metrics may overlook. For instance, Tu et al. (Tu et al., 2024b) argued that metrics like BLEU and ROUGE fail to capture the clinical quality of medical consultations, and therefore invited 23 medical experts from the United States, the United Kingdom, and India to evaluate the model-generated responses in terms of accuracy, appropriateness, and comprehensiveness. Similarly, Yang et al. (Yang et al., 2024b) employed human experts to evaluate the safety, accuracy, and ethical implications of the model responses. Chen et al. (Chen et al., 2023f) requested evaluators to evaluate generated responses based on content naturalness, level of empathy, helpfulness, and safety.

It is evident that human evaluation can encompass various aspects such as safety and helpfulness, which are crucial for medical LLMs and MLLMs. Although human evaluation can evaluate various capabilities of medical LLMs and MLLMs, they are inherently subjective due to the lack of standardized evaluation criteria among experts, additionally, hiring medical experts incurs extra costs, so AI evaluation is a feasible alternative to human evaluation.

5.1.3. AI Evaluation

Using a high-performing AI model which aligns with human values, such as ChatGPT and GPT-4, to evaluate the response of medical LLMs and MLLMs is currently the dominant evaluation method (Wang et al., 2023b; Luo et al., 2023). Wang et al. (Wang et al., 2023b) conducted experiments on five natural language generation evaluation datasets, demonstrating that ChatGPT, as an evaluation tool, outperformed automatic evaluation metrics in most cases and was comparable to human evaluation. In the medical field, Li et al. (Li et al., 2024b) presented medical questions to GPT-4 and LLava-Med and then asked GPT-4 to rate responses from GPT-4 and LLava-Med based on helpfulness, relevance, accuracy, and level of detail. Liu et al. (Liu et al., 2023a) prompted GPT-4 to consider whether responses from LLMs are acceptable and if their tone resembles that of human counselors.

Although AI evaluation offers scalability and reduces the need for human involvement, it still has limitations. Researches (Zheng et al., 2024a; Xu et al., 2023a) have shown that as an evaluation tool, GPT-4 tends to prefer the first answer, meaning that when multiple answers are presented in sequence, GPT-4 often considers the first answer to be superior. Additionally, GPT-4 also favors longer answers and answers generated by itself (Liu et al., 2023a). Therefore, to address the issues associated with the aforementioned three methods, combining multiple evaluation methods may yield more reliable results. Moreover, leveraging reinforcement learning or other methods to train specialized LLMs or MLLMs that align with human judgment criteria as evaluation tools may be able to overcome the limitations of AI evaluation.

5.2. Usage Tips

Researches have found that by simply adjusting the form and structure of input, deeper professional capabilities of medical LLMs can be unlocked (Nori et al., 2023b). Thus, a new research field emerged, Prompt Engineering (Meskó, 2023; Heston and Khun, 2023), which aims at enhancing the quality of model responses through various efficient prompt strategies, which do not require further training and can be flexibly integrated into any medical LLMs and MLLMs. To maximize the medical expertise of models when researchers and medical practitioners utilize medical LLMs and MLLMs to handle relevant medical tasks, this subsection combines the ICL ability and prompt engineering of LLMs and MLLMs, and summarizes seven commonly used and efficiently usage tips as shown in Fig. 8: zero-shot, few-shot, chain of thought, self-consistency, tree of thoughts, self-refine, and last-to-most, which are referred to as prompting methods in the field of LLMs and MLLMs.

Zero and Few-Shot Prompting: Zero-shot prompting is the simplest prompt strategy, aiming to instruct models on how to perform tasks through a single instrution. Although zero-shot prompting is straightforward, requiring only a brief description of the task instruction, the lack of information in these instructions limits the extent to which the model’s capabilities can be exploited. Few-shot prompting builds upon zero-shot prompting by providing additional context instances as demonstrations, addressing the issue of insufficient information in zero-shot prompting. Through few-shot prompting, the model can engage in analogical learning from instance demonstrations to accurately execute new tasks (Liévin et al., 2023), effectively improving the model’s performance across various tasks. It is worth noting that this few-shot capability emerges as a novel ability only when the model exceeds a certain scale and does not exist in smaller models (Wei et al., 2022a). Therefore, the standard few-shot prompting strategy was introduced in GPT-3, precisely due to GPT-3’s powerful ICL and few-shot capability, we consider GPT-3 as the beginning of LLMs.

Refer to caption
Figure 8. Examples of the 7 prompting methods. We conclude that these methods were inspired by Kaddour et al. (Kaddour et al., 2023).

Chain of Thought Prompting: Chain of thought (CoT) prompting is a method used to enhance the accuracy and interpretability of responses generated by LLMs or MLLMs by prompting them to generate a series of intermediate reasoning steps (Wei et al., 2022b), which aims to simulate the cognitive and reasoning processes of humans when solving new problems. CoT, as a prompting strategy, does not conflict with zero and few-shot prompting and is often combined with them. For example, zero-shot CoT prompting significantly improves model performance by adding ”Let’s think step by step” to the instruction without providing example demonstrations (Kojima et al., 2022). Few-shot CoT prompting provides examples with reasoning steps to facilitate the model in learning reasoning methods and thereby improving accuracy on new tasks. In the medical field, CoT prompting is employed in models such as Med-PaLM, Med-PaLM 2, and MEDITRON-70B to request LLMs to think step by step and provide reasoning processes, thus offering more explanatory diagnostic results. Additionally, the concept of CoT can be extended to the training phase, such as introducing CoT datasets during model fine-tuning (Lu et al., 2022), fundamentally enhancing the model’s logical reasoning abilities. Unfortunately, such CoT datasets have not yet been discovered in the medical domain.

Self-Consistency Prompting: Based on CoT Prompting, Wang et al. (Wang et al., 2022d) proposed self-consistency (SC) prompting, which involves sampling a set of different reasoning paths and then selecting the most consistent answer by marginalizing out the sampled reasoning paths. During the reasoning process, the correct answer may be derived from multiple reasoning paths, and the goal is to select the consistent answer among all the paths, even if there is a wrong reasoning path, it does not affect the final consistent answer. SC prompting is particularly suitable for tasks with complex reasoning paths, such as mathematics (Wang et al., 2022d) and medicine (Nori et al., 2023b), and has been demonstrated to be effective. In the medical domain, for instance, the utilization of SC prompting resulted in MEDITRON-70B achieving the highest average accuracy, and Flan-PaLM also exhibited significant improvements compared to standard few-shot prompting.

Tree of Thoughts Prompting: Tree of thoughts (ToT) prompting (Yao et al., 2024) extends CoT prompting to a thought tree containing multiple thought paths, where each path can be seen as a thought and serves as an intermediate step in problem-solving, with the potential for further subdivisions. The ToT prompting allows the model to look ahead or backtracking when necessary to make global choices, thus addressing the problem of poor model performance during reasoning due to the constraints of a left-to-right decision-making process in tasks that require exploration, strategic lookahead, or where the initial decisions play a pivotal role.

Self-Refine Prompting: Humans constantly refine themselves through continuous self-feedback, while neural networks improve their performance by iteratively backpropagating errors and updating parameters. Drawing on this idea of continuous feedback and improving itself, Madaan et al. (Madaan et al., 2024) proposed self-refine prompting, which aims to prompt the model to provide feedback for its own response and improve the previously generated response based on the feedback, and improve it through several iterations to get the final response.

Least-to-Most Prompting: While CoT prompting provides reasoning examples to assist models in learning reasoning methods and efficiently solving problems, it often struggles when faced with problems more challenging than those presented in the prompts. To solve this problem, Zhou et al. (Zhou et al., 2022) proposed least-to-most prompting, which is based on the idea of decomposing a complex problem into a series of simpler sub-problems, and generating the final output step by step by solving these sub-problems sequentially and using the answers of the solved sub-problems as prompts for the subsequent sub-problems. Experimental results have shown that least-to-most prompting enables models to tackle more difficult problems than those presented in the prompts and significantly outperforms CoT prompting in some tasks (Zhou et al., 2022).

6. Applications of LLMs and MLLMs in Medicine

The excellent performance of GPT-4 and Med-PaLM 2 in medical tasks highlights the immense potential of these powerful general or medical LLMs and MLLMs in medical applications (Singhal et al., 2023b; Nori et al., 2023a; Lee et al., 2023a). In order to help relevant practitioners to quickly understand the developmental orientation of LLMs and MLLMs in medicine, in this secion, we primarily summarize the current potential applications of LLMs and MLLMs in medicine and healthcare, as shown in Fig. 9, and briefly discuss how these models can be leveraged to perform various medical tasks.

Refer to caption
Figure 9. Applications, challenges and future directions of LLMs and MLLMs in medicine.

6.1. Medical Diagnosis

The development of AI in medical diagnosis has been several decades (Szolovits et al., 1988; Kononenko, 2001; Bakator and Radosav, 2018), and despite achieving some breakthroughs, its role has primarily been limited to assisting tasks within the diagnostic process, such as medical image segmentation (Zhou et al., 2018; Xiao et al., 2023), lesion detection and classification (Albahri et al., 2020; Tiu et al., 2022). Until recent years with the development of LLMs and MLLMs, doctors and patients are expected to rely on these large models for end-to-end diagnosis. Specifically, doctors or patients can provide the models with subjective descriptions of the disease symptoms (Li et al., 2023e; Wang et al., 2023d; Xiong et al., 2023) or medical images such as X-rays (Li et al., 2024b; Wang et al., 2023a; Thawkar et al., 2023), and the models can rely on this information and embedded medical knowledge to directly make a diagnosis which would greatly increase the flexibility of diagnosis.

Currently, Med-PaLM 2, as one of the top-performing medical LLMs, generates answers to consumer medical questions and adversarial questions that outperform physician-generated answers on multiple assessment axes (Singhal et al., 2023b), demonstrating the viability of medical LLMs as medical diagnostic assistants. To further broaden the application scope of LLMs as medical diagnostic assistants, researchers have fine-tuned these models on Chinese datasets (Wang et al., 2023d; Zhang et al., 2023a; Chen et al., 2023e; Yang et al., 2024b, 2023d; Xiong et al., 2023; Wang et al., 2023g), enhancing diagnostic performance in Chinese contexts. Particularly, TCM-GPT (Yang et al., 2023d) excels in traditional Chinese medicine, outperforming other models in tasks related to traditional Chinese medical examinations and diagnostics, contributing to the advancement of traditional medicine. Additionally, inspired by general MLLMs (Achiam et al., 2023; Liu et al., 2024a; Zhu et al., 2023), researchers have developed multimodal medical diagnostic assistants (Li et al., 2024b; Wang et al., 2023a, h; Tu et al., 2024a; Shu et al., 2023; Liu et al., 2023h; Thawkar et al., 2023), expanding diagnostic basis from text to medical images, thereby improving diagnostic reliability. Furthermore, to enhance the diagnostic accuracy of medical LLMs and MLLMs as diagnostic assistants, researchers have attempted to incorporate retrieval mechanisms (Li et al., 2023e; Zhao et al., 2023b), enabling models to retrieve reference information from medical websites, Wikipedia, or offline medical knowledge bases.

Medical LLMs and MLLMs as medical diagnostic assistants offer users remote consultation and diagnosis, providing a more flexible approach to medical diagnosis. However, due to some limitations of LLMs and MLLMs themselves (Rawte et al., 2023; Deshpande et al., 2023), now currently these medical LLMs and MLLMs can only be used as an auxiliary way for doctors’ diagnosis, and the generated diagnosis results can only be used as a reference, not as a final diagnosis result.

6.2. Clinical Report Generation

Clinical reports are various standardized documents written by doctors for patients. Manual drafting of clinical reports is typically a tedious, time-consuming but crucial task, undeniably increasing the workload of clinicians and diminishing work efficiency. Medical LLMs and MLLMs, possess extensive medical knowledge and excel at generative tasks, stand as efficient tools for clinical report generation.

For example, during medical diagnosis, doctors usually record important information in their communication with patients so as to serve as a basis for judging the condition or as a source of other report contents, and medical LLMs and MLLMs can be used as a clinical note-taking tool to do this job instead of doctors (Toma et al., 2023). Specifically, doctors merely need to provide the model with recordings of interactions with patients, and after brief processing of the recordings, the model can generate medical notes for the doctors (Lee et al., 2023a), while the doctors can also prompt the model to simplify medical notes, removing intricate details and generating summaries for easy review and analysis (Van Veen et al., 2023b). Subsequent to medical diagnosis, doctors typically write a corresponding diagnostic report, such as a radiology report. Leveraging medical LLMs and MLLMs, doctors only need to provide the template for the diagnostic report and patient diagnostic information, and the model automatically generates the corresponding diagnostic report (Van Veen et al., 2023a; Wu et al., 2023; Yang et al., 2023c). During the patient’s treatment, doctors will explain the cause of the disease and the treatment process, as well as a variety of more detailed clinical information to the patient through clinic letters. By generating clinic letters with the help of LLMs, clinicians can eliminate this tedious process, and the clinic letters generated by LLMs are similar to human-generated clinic letters in terms of coherence, accuracy, and humaneness (Ali et al., 2023). After the patient has recovered, clinicians will spend a lot of time writing discharge summaries for the patient, which may lead to delayed discharge. By utilizing LLMs, clinicians can obtain complete discharge summaries in seconds by simply providing a template and some necessary requirements (Patel and Lam, 2023), and the quality of summaries generated by these LLMs even exceeds the quality of summaries generated by junior doctors to some extent (Clough et al., 2024).

By utilizing powerful LLMs and MLLMs, various clinical reports from patient admission to discharge can be automatically generated, and they are more comprehensive and accurate than reports generated by humans (Van Veen et al., 2023b; Clough et al., 2024), this significantly alleviates the workload of doctors, allowing them to dedicate more time to patient care (Patel and Lam, 2023). However, we expect these powerful LLMs and MLLMs to serve solely as auxiliary tools for generating clinical reports. They can draft, modify, and summarize reports, but the final reports still need to be reviewed, edited and approved by the clinicians and held accountable for the reports (Thirunavukarasu et al., 2023; Moor et al., 2023a).

6.3. Medical Education

The GPT-4 and Med-PaLM 2 passed the USMLE with scores of over 86% (Nori et al., 2023a), and the GPT-4V (Yang et al., 2023b) reached 90.7%, outperforming most medical students on medical image-related questions (Yang et al., 2023f). This indicates that some LLMs and MLLMs are equipped to provide knowledge services to medical students, which provides an important opportunity to enhance medical education (Karabacak et al., 2023; Kung et al., 2023).

For example, Khanmigo (Khan, 2023) and Duolingo (Team, 2023)] are considering the utilization of tools such as GPT-4 to optimize online teaching, which not only addresses medical students’ questions but also offers explanations and novel insights. Apart from simply answering questions, medical LLMs and MLLMs can create more complex scenarios for medical students to practice, such as generating diverse exam content, simulating clinical scenarios, and creating digital patients (Karabacak et al., 2023; De Choudhury et al., 2023), thereby enhancing students’ professional competence and practical skills. Additionally, based on students’ performance in simulated exercises, medical LLMs and MLLMs can tailor personalized learning plans for them, which is typically time-consuming in reality, but LLMs and MLLMs can achieve this more economically and efficiently (Karabacak et al., 2023). In summary, leveraging powerful LLMs and MLLMs can provide medical students with rich medical content, create highly realistic and diverse medical scenarios, broaden students’ horizons in the medical field, thus laying a solid foundation for students to enter clinical practice.

The potential of powerful LLMs and MLLMs in medical education surpasses that of some regular medical training courses, as teachers in these courses often cannot interact with students at all times, or provide personalized learning plans. Although such models hold significant potential in medical education, they can only serve as auxiliary tools in teaching and cannot replace medical educators, because the inherent biases and hallucinations within LLMs and MLLMs make it difficult for students to assess the accuracy of the content generated by the models (Han et al., 2023b; Ahn, 2023), if the models consistently provide medical students with wrong content that are difficult to detect over time, they may easily misguide students. Therefore, LLMs and MLLMs can only play a supportive role in medical education, and students need to utilize these tools under the guidance and supervision of teachers.

6.4. Mental Health Services

With increasing societal pressures, there is a growing demand for mental health services globally (Prince et al., 2007), while there is a severe shortage of mental health specialists in some regions due to limited development and resources (van Heerden et al., 2023; Thomas et al., 2009). In mental health services, the main focus is on conversation-driven psychological counseling, so chatbots based on LLMs (Liu et al., 2023a; Chen et al., 2023f) may serve as one of the ways to provide mental health services in the future.

Due to the particularity of patients with mental illnesses, who tend to be more vulnerable and psychologically sensitive, so these mental health service chatbots usually also need to be empathetic, trusting, understanding, and comfortable during conversations, rather than just providing advice (Chen et al., 2023f). Compared to professional mental health experts, such LLMs serving as mental health chatbots offer better accessibility and can provide mental health services to remote areas or areas with a shortage of mental health professionals. Additionally, a characteristic of these LLM-based chatbots is that they can provide more personalized interaction styles based on patients’ historical conditions and interaction records, such as specific emotional patterns, styles, or tones (De Choudhury et al., 2023). Furthermore, the high cost of psychological counseling and therapy may deter many individuals from seeking mental health services, but LLM-based mental health chatbots can significantly reduce the cost of patients receiving mental health services (Zhou et al., 2023a; Stock et al., 2023), thereby lowering the threshold for accessing services. Moreover, research has shown that people are more likely to disclose their negative emotions when interacting with chatbots, as some topics might be awkward to discuss with humans but may feel more comfortable to share with a robot (Chaves and Gerosa, 2021). Therefore, LLM-based mental health chatbots, in terms of convenience, cost, and acceptability, outperform mental health professionals, which may motivate more individuals with mental illnesses to seek mental health services (De Choudhury et al., 2023).

Mental health services are characterized by trust, mutual respect and emotional connection, and although research is improving the empathy of LLMs (Chen et al., 2023f), they still lack in empathy compared to humans. Moreover, despite efforts to align LLMs with human concepts and ethical norms through approaches such as SFT and RLHF, they may still generate content that is aggressive or psychologically harmful (Deshpande et al., 2023), which is unacceptable for psychologically vulnerable mental health patients. Before integrating LLMs as mental health chatbots into practical applications, more work is needed to address these issues, and greater control measures need to be implemented for such products.

6.5. Medical Language Translation

The language barrier is a key obstacle to global cultural exchange, as it is in medicine, but with the help of LLMs, this barrier will be overcome, because LLMs are usually trained on a large corpus containing multiple languages, and therefore can master multiple languages with a power translation capability (Javaid et al., 2023). In addition to cross-language translation, LLMs also enable the translation of texts containing medical terms into understandable plain texts (Zhou et al., 2023a; Lyu et al., 2023).

In recent years, machine translation has been an important tool for addressing language barriers in the medical field, which has been shown to be 7% more accurate than traditional services (Karabacak et al., 2023), and powerful LLMs such as ChatGPT and GPT-4 have raised the level of machine translation to a higher level (Siu, 2023). With the support of such LLMs, medical professionals from diverse regions can engage in medical communication in a more inclusive environment, thereby fostering the advancement of global medicine (Karabacak et al., 2023). Additionally, medical LLMs possess extensive medical knowledge, enabling them to translate reports containing numerous medical terms into plain texts to facilitate patient to further understand their condition and promote their compliance (Lyu et al., 2023). Moreover, translating medical texts containing medical terms into plain language, such as translating traditional Chinese medicine texts, aids in disseminating valuable medical knowledge within societal communities, thus contributing to its preservation and popularization.

Using LLMs as medical language translation tools is a promising application, but they still have some limitations. For instance, translating reports may overlook key points, resulting in incompleteness. Another issue is the uncertainty in the model’s responses, even with the same prompts, LLMs may provide inconsistent translations and present information in variable formats (Lyu et al., 2023). Therefore, before deploying LLMs as medical language translation tools, certain works need to be implemented, such as further fine-tuning to enhance the completeness of model translations and reduce uncertainty.

6.6. Surgical Assistance

Medical robots have seen rapid development in the past few decades, particularly playing a significant role in enhancing the capabilities of surgeons and expanding the potential for minimally invasive surgery (Barua, 2024). In recent years, medical robots has entered a new phase with the development of MLLMs, which can not only endow medical robots with visual capability, but also provide better interactivity and a more friendly interaction environment compared to traditional medical robots.

Currently, efforts have begun to explore the application of MLLMs in surgical procedures (Seenivasan et al., 2023), and integrating MLLMs into surgical robots can enable them to perform crucial auxiliary tasks during surgery, such as assisting in endoscopic examinations (Moor et al., 2023a), where the powerful visual capability and specialized knowledge of MLLMs can provide valuable diagnostic conclusions and feasible surgical solutions based on endoscopic images. In addition, when surgeons are performing surgical procedures, MLLMs can combine video streams to annotate the surgical process, analyze and summarize the steps taken during the procedure, as well as record non-compliant operations to facilitate the surgeon’s post-surgical review and examination.

Although medical MLLMs show promising potential in surgical assistance and they may play a role in certain medical scenarios, they are not suitable for emergency surgeries yet. This is because erroneous information provided by MLLMs could adversely affect the surgeon’s judgment, leading to irreversible consequences. Additionally, current MLLMs research predominantly focuses on the vision-text modality, and we anticipate future work to explore other modalities, such as audio and time series, to enable surgical robots to perform more comprehensive and accurate auxiliary tasks and provide more flexible interaction methods. 

In this section, we combine the characteristics of LLMs and MLLMs to discuss their potential applications in the medicine and healthcare. No matter which task LLMs and MLLMs are applied to in the medicine, we want to emphasize that these models can only be used as assistants of medical practitioners to complete some auxiliary tasks, rather than as the final decision maker. The content generated by LLMs and MLLMs requires scrutiny and modification by medical practitioners before it can be applied in clinical settings, and that the medical practitioners need to be responsible for the final content.

7. Challenges and Future Directions of LLMs and MLLMs in Medicine

Although LLMs and MLLMs have caused a wave in the AI community and made initial achievements in medicine, the unique characteristics of the medicine pose numerous challenges and risks to the development and deployment of LLMs and MLLMs. In this section, we will discuss and analyze the current challenges of LLMs and MLLMs in the medical field in detail, and provide some possible solutions to these challenges.

7.1. Hallucinations

Hallucinations refer to the generation of seemingly plausible but unverified or incorrect information by LLMs and MLLMs (Rawte et al., 2023; Ji et al., 2023), which will lead to issues such as the generation of radiology reports containing misleading information and the dissemination of incorrect medical knowledge in medical education (Lee et al., 2023a). These false responses due to hallucinations can be difficult to distinguish because the model is often present in a convincing way and the response seems reasonable (Lee et al., 2023a). Therefore, hallucinations pose a key challenge to the practical application of LLMs and MLLMs in medicine. The hallucination problem of LLMs and MLLMs may arise from various factors, such as unclear instructions from users, a lack of relevant knowledge in the training data, etc., while autoregressive models such as ChatGPT predict the subsequent tokens based on the previous content, which may lead to the phenomenon of cumulative propagation of hallucinations (Zhang et al., 2023c). Considering the particularity of the medical domain, misdiagnoses caused by hallucinations could result in severe medical incidents, solving the hallucination problem of LLMs and MLLMs is a key step to accelerate the landing of the application of medical LLMs and MLLMs.

To address this challenge, some efforts have proposed a new benchmark dataset for medical LLMs and MLLMs for hallucination testing (Umapathi et al., 2023). However, such benchmark datasets can only be used to detect hallucination phenomena in models and do not directly mitigate the hallucination problem. Other research has pointed out that the knowledge of LLMs is primarily acquired during the pre-training phase (Zhou et al., 2024), and the presence of noisy data such as error messages in the training dataset may encourage hallucinations, so the most fundamental approach to reducing hallucinations is to manually or automatically clean unreliable data from the pre-training corpus (Ji et al., 2023). However, the pre-training corpus of LLMs and MLLMs typically consist of vast amounts of data, including data crawled directly from the web, which is very difficult to clean and requires the design of effective selection and filtering strategies. Therefore, it may be advisable to use high-quality medical datasets to reduce hallucinations in LLMs and MLLMs during fine-tuning stage such as SFT and RLHF (Zhang et al., 2023c; Elaraby et al., 2023). The amount of data needed for the fine-tuning phase is much less than that needed for the pre-training phase, making it more feasible to manually design and clean these datasets, and by fine-tuning on these high-quality datasets, LLMs and MLLMs can exhibit higher levels of authenticity and factual accuracy (Cao et al., 2023; Chen et al., 2023c). To further reduce the cost of mitigating hallucinations, existing efforts have attempted to address hallucinations during the inference stage. For example, prompting LLMs or MLLMs to verify their own responses has been proved to be effective in alleviating hallucinations (Lee et al., 2023a, b), where Chain-of-Verification(CoVe) (Dhuliawala et al., 2023) is an efficient validation method where the model first drafts an initial response, then plans verification questions based on the response and answers these verification questions to check the draft, and finally generates an optimized answer. Experiments have shown that self-verification methods like CoVe can reduce hallucinations in various tasks. Additionally, retrieval-augmented generation (RAG) has also proven to be an effective approach to reducing hallucinations (Shuster et al., 2021), which allows the model to retrieve relevant knowledge from external webpages or knowledge bases for reference during the response generation phase (Li et al., 2023e; Sun et al., 2023), thus significantly solving the hallucination problem.

7.2. Visual Perception Limitations

Although MLLMs possess visual ability, their visual perception ability is still limited, particularly in distinguishing spatial localization (Zhu et al., 2023). The limited visual perception ability of MLLMs may be caused by two factors. One is the loss of visual information during the modality alignment process, e.g., mapping visual features directly to word embedding space using simple linear layers (Li et al., 2024b; Liu et al., 2023d) or MLP (Zhang et al., 2023e) will lose information. Additionally, approaches like Q-former, which only utilize 32 learnable vectors to represent an image, may also lead to information loss. Second, MLLMs are trained on relatively simplistic tasks, often in the form of VQA, lacking more challenging training tasks such as object detection and image segmentation.

To address the aforementioned factors, a possible solution is to introduce large vision models like SAM (Kirillov et al., 2023), which can not only captures visual information more effectively but also excels at more challenging tasks such as image segmentation. For instance, LISA (Lai et al., 2023), building upon LLaVA, incorporates ViT-H SAM (Kirillov et al., 2023) as its visual backbone and introduces additional vision decoder for generating masks, which not only inherits the language generation capability of MLLMs but also enhances visual perception ability to output segmentation masks given complex and implicit query texts. Building upon this foundation, GLaMM (Rasheed et al., 2023) can provide denser pixel-wise object grounding, i.e., it is capable of accomplishing multi-target segmentation, further enhancing visual perception ability. Additionally, u-LLaVA (Xu et al., 2023b) utilizes Vicuna and CLIP ViT-L/14 as the LLM backbone and visual encoder, respectively, and also incorporates ViT-H SAM, Grounding DINO (Liu et al., 2023f), and Stable Diffusion (Rombach et al., 2022) as the segmentation, grounding, and in-painting modules, respectively, which unifies the multimodal task while improving the visual perception of the model.

All of the above models are based on MLLM and add additional visual modules to improve visual perception ability and accomplish diverse visual tasks, which perfectly solves the problem of limited perception ability and difficulty in distinguishing spatial localization in current MLLMs. However, LISA, GLaMM and u-LLaVA are all general models, and we expect medical MLLMs with multiple capabilities such as segmentation, grounding and in-painting to appear in the medical field in the future.

7.3. Training and Deployment Challenges

Although large-scale datasets and model parameters endow LLMs and MLLMs with powerful capabilities, they likewise increase the requirement for computational resources, which leads to high computational costs, such as LLaMA-65B was trained on 2048 A100 GPUs for 21 days. While the common strategy for medical LLMs and MLLMs is to fine-tune the general foundation model, it still necessitates a substantial amount of computational resources. For example, MEDITRON-70B utilized 128 A100 GPUs, and the smaller LLaVA-Med employed 8 A100 GPUs, rendering it difficult for general hospitals to independently undertake the training and fine-tuning of medical LLMs and MLLMs, and they usually need to rely on additional computing support. Furthermore, even upon the completion of training and fine-tuning medical LLMs and MLLMs, deployment and inference remain costly due to the larger model scales (Qiu et al., 2023), making it extremely challenging for most hospitals to locally deploy and apply medical LLMs and MLLMs in practical applications. To enable the training and deployment of medical LLMs and MLLMs in hospitals with limited computational resources, this section proposes four solutions: optimizing the training process, reducing model parameters, modifying model architectures, and optimizing hardware devices.

The series of PEFT (Li and Liang, 2021; Houlsby et al., 2019; Hu et al., 2023, 2021; Lester et al., 2021; Zhao et al., 2023a; Liu et al., 2023g) methods mentioned in Section 4.2.2 address the problem of the high training cost and overhead of medical LLMs and MLLMs by using a number of strategies that keep most of the pre-training parameters frozen and update only a small number of them. However, the PEFT methods only achieve efficient training and cannot solve the problem of deployment difficulties, for this problem, lightweighting is a feasible solution (Chen et al., 2024; Chu et al., 2023; Yuan et al., 2023; Chu et al., 2024). For example, MobileVLM (Chu et al., 2023) is a customized MLLM for mobile scenarios, which reduces the training and inference budget by reducing the size of LLaMA and designing an efficient projector, while being able to run on mobile devices, and also remaining competitive with other MLLMs on most tasks. Additionally, medical LLMs and MLLMs fine-tuned from general foundation models typically retain some medically irrelevant knowledge, which is stored in different parameters of the model. Knowledge distillation can distill the medical knowledge of medical LLMs and MLLMs into a more compact model (Liu et al., 2024b), discarding medically irrelevant knowledge, thus reducing the model parameter count, which is more conducive to deployment.

Currently, all LLMs and MLLMs are built based on the Transformer architecture, which inevitably leads to a quadratic increase in computational complexity with sequence length, resulting in low computational efficiency for long sequences. To fundamentally address the challenges of training and deploying medical LLMs and MLLMs, choosing model architectures that are more efficient in computation and inference is a viable option (Peng et al., 2023a; Gu and Dao, 2023). For example, RWKV (Peng et al., 2023a) combines Transformer’s efficient parallel training with effective inference from RNN, ensuring a constant computational and memory complexity during inference while maintaining comparable performance to the similarly scaled Transformer models. Furthermore, Mamba (Gu and Dao, 2023), based on the State Space Model (SSM), outperforms Transformer models in terms of both performance and inference speed, surpassing Transformer by five times in inference speed while being comparable in scale. Extending these computationally and inference-efficient model architectures to general or medical LLMs and MLLMs will help overcome the current deployment difficulties of medical LLMs and MLLMs.

In addition to improving computational and inference efficiency at the model level, further advancements in specialized hardware accelerators are desired within the community (Miao et al., 2023). For example, NVIDIA’s Hopper GPU (Choquette, 2023), coupled with NVIDIA Grace CPU (Elster and Haugdahl, 2022) through NVLink-C2C interconnect, enhances communication speed between CPU and GPU by more than 7 times compared to PCIe 5.0, thus hardware-wise enhancing the computational and inference efficiency of models.

7.4. Lack of Recency

Once medical LLMs and MLLMs are trained, the knowledge they acquire becomes fixed. However, since the knowledge of medicine is constantly being updated, the lack of new medical concepts and knowledge will exacerbate the inaccuracy and hallucination problems of the models, especially when encountering new terms that do not exist in the training corpus, the models will be unable to comprehend this knowledge (Thirunavukarasu et al., 2023). Therefore, the lack of recency will seriously hinder the landing of the medical LLMs and MLLMs in the real-world applications.

In order to address the lack of recency due to offline learning of medical LLMs and MLLMs, continual parameter updates through fine-tuning methods to keep them synchronized with human knowledge is a feasible solution (Wu et al., 2024b). While fine-tuning is able to inject new medical concepts and knowledge into the model, it also introduces two challenges while updating the parameters, one is catastrophic forgetting, where the model forgets previously learned knowledge after acquiring new knowledge (French, 1999; Zhai et al., 2023). The second is negative forward transfer, wherein the performance on unseen tasks deteriorates when learning new tasks (Zheng et al., 2024b). To address the above issues, researchers have introduced model editing (Yao et al., 2023), such as introducing additional trainable parameters to correct erroneous responses due to outdated knowledge while keeping the original parameters of the model unchanged (Huang et al., 2023b; Hartvigsen et al., 2024), or locating the parameters related to certain knowledge in the model and updating them accordingly to integrate and edit the relevant new knowledge (Meng et al., 2022a, b; Li et al., 2024a). In addition to model editing, RAG can also be used as a means of updating the knowledge of a medical LLMs and MLLMs by connecting the model to an information retrieval component, enabling the model to retrieve relevant content from external knowledge bases as a reference (Li et al., 2023e; Sun et al., 2023), and thus generating a more reliable response.

7.5. Privacy and Security

Medical LLMs and MLLMs are trained on a large-scale medical corpus, where some of the data, such as EHRs, doctor-patient dialogues, and other data may involve patient privacy, such as name, phone number, and email address, which can be retrieved from LLMs or MLLMs by using direct prompt (Carlini et al., 2021), leading to serious privacy and security concerns. Despite the extra efforts made by developers in modeling conversational security, such as specialized SFT and RLHF for security, however, it is still possible to use tactics such as multi-step jailbreaking prompt (Li et al., 2023b) to obtain personal privacy data from training data.

To facilitate the practical implementation of medical LLMs and MLLMs, protecting patients’ personal privacy is crucial. Currently, to enhance the protection of patients’ personal privacy, it is common practice to either remove personal information from datasets (Li et al., 2023e; Liu et al., 2023a) or add controlled noise or randomness to the data to safeguard privacy without compromising data analysis (Turgay et al., 2023). In addition, the use of high-quality synthetic data generated by ChatGPT or GPT-4 for training (Tang et al., 2023) ensures both the controllability and diversity of the training datasets while mitigating the risk of privacy breaches. Furthermore, we expect to further refine relevant legal regulations, strengthen oversight of the acquisition and usage of training datasets, and prohibit users from accessing patient privacy data from models through any means.

7.6. Bias and Toxicity

Large-scale corpora, especially data obtained from the internet, inevitably contain a variety of biased viewpoints, and LLMs and MLLMs may learn any biases (Qiu et al., 2023; Ferrara, 2023) from these corpora, such as biases in race (Yang et al., 2024a), gender (Kotek et al., 2023), politics (Liu et al., 2022). At the same time, language models may produce toxic responses, such as aggressive and hurtful views, and specific groups are more likely to be targeted due to the presence of biases (Deshpande et al., 2023). These biases and toxicities extend to the LLMs and MLLMs, with potential implications and threats to patients, and may have serious consequences for patients with mental illness.

Reducing bias in training data is essentially a way to address the presence of bias in models. Specifically, careful curation and screening of more diverse, balanced, and representative training data ensure that models learn from a broader range of perspectives and experiences, leading to a more comprehensive understanding and reduced biases in various aspects (Ferrara, 2023). And for model toxicity, utilizing empathetic data has been shown to reduce the output of toxic content from models (Lahnala et al., 2022). However, re-screening the pre-training datasets and re-training a model with less biases and toxicities is expensive, so screening some high-quality datasets with anti-bias and anti-toxicity to reduce medical LLMs and MLLMs’ bias and toxicity in the SFT and RLHF phases is a much more cost-effective approach. Apart from training, there is a need for further enhancement in evaluating model bias and toxicity. Designing a comprehensive benchmark for model bias and toxicity facilitates the detection of these issues, enabling developers to regularly review models (Cui et al., 2023; Ferrara, 2023).

8. Conclusion

In recent years, the development of LLMs has led to breakthroughs in the NLP, and then researchers have taken a significant step towards AGI by extending LLMs to the multimodal domain and forming MLLMs. Meanwhile the rapid development and strong performance of LLMs and MLLMs have facilitated the birth of a large number of medical LLMs and MLLMs. To aid researchers and medical practitioners in understanding the current technological details and developmental status of medical LLMs and MLLMs, this survey centers on the paradigm shift of LLMs and MLLMs, delineates the entire development background, emphasizing the evolution from initial feature engineering to structure engineering, objective engineering, and now, the focus of the research is prompt engineering and data engineering. To furnish comprehensive foundational knowledge of medical LLMs and MLLMs, this survey has summarized the mainstream architectures of current LLMs and MLLMs and has assembled a list of existing medical LLMs and MLLMs. Furthermore, this survey offers a comprehensive guide, encompassing existing medical datasets, model construction methods, evaluation methods, and usage tips to assist relevant researchers and medical practitioners in developing, deploying, and utilizing their own medical LLMs and MLLMs. Moreover, this survey explores the application prospects of medical LLMs and MLLMs in medical diagnosis, clinical report generation, medical education, mental health services, medical language translation, and surgical assistance, and analyze the great potential of medical LLMs and MLLMs in various clinical applications. Despite the notable achievements of medical LLMs and MLLMs in the medical domain, several significant challenges and limitations persist, hindering their practical deployment in clinical settings. Consequently, this survey discusses these challenges faced by current medical LLMs and MLLMs, such as hallucinations, visual perception limitations, training and deployment challenges, lack of recency, privacy and security, bias and toxicity, and provides potential solutions to address these issues, thus facilitating the practical application of subsequent medical LLMs and MLLMs.

In conclusion, this survey provides a comprehensive analysis of medical LLMs and MLLMs, from background, principles to applications, aiming to accelerate the development of LLMs and MLLMs in clinical medicine-related products and further promote the integration of AI and the medical field. We expect that there will be more intelligent AI products based on LLMs and MLLMs in the future, such as medical agent and embodied intelligence, to further promote the innovation of AI in medicine. Finally, we emphasize that the advent of medical LLMs and MLLMs is intended to enhance the quality of medical services and physician efficiency, alleviate workload, rather than replace healthcare professionals.

9. Acknowledge

This work was sponsored by Natural Science Foundation of Chongqing (Grant No. CSTB2023TIAD-STX0020, CSTB2023NSCQ-LZX0068, CSTB2022NSCQ-MSX0837). This study does not involve any ethical issue.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Ahn (2023) Sangzin Ahn. 2023. The impending impacts of large language models on medical education. Korean Journal of Medical Education 35, 1 (2023), 103.
  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
  • Albahri et al. (2020) OS Albahri, AA Zaidan, AS Albahri, BB Zaidan, Karrar Hameed Abdulkareem, ZT Al-Qaysi, AH Alamoodi, AM Aleesa, MA Chyad, RM Alesa, et al. 2020. Systematic review of artificial intelligence techniques in the detection and classification of COVID-19 medical images in terms of evaluation and benchmarking: Taxonomy analysis, challenges, future solutions and methodological aspects. Journal of infection and public health 13, 10 (2020), 1381–1396.
  • Ali et al. (2023) Stephen R Ali, Thomas D Dobbs, Hayley A Hutchings, and Iain S Whitaker. 2023. Using ChatGPT to write patient clinic letters. The Lancet Digital Health 5, 4 (2023), e179–e181.
  • Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
  • Anthropic (2024) Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family.
  • Awadalla et al. (2023) Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023).
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
  • Bakator and Radosav (2018) Mihalj Bakator and Dragica Radosav. 2018. Deep learning and medical diagnosis: A review of literature. Multimodal Technologies and Interaction 2, 3 (2018), 47.
  • Barua (2024) Ranjit Barua. 2024. Innovations in Minimally Invasive Surgery: The Rise of Smart Flexible Surgical Robots. In Emerging Technologies for Health Literacy and Medical Practice. IGI Global, 110–131.
  • Basaldella et al. (2020) Marco Basaldella, Fangyu Liu, Ehsan Shareghi, and Nigel Collier. 2020. COMETA: A corpus for medical entity linking in the social media. arXiv preprint arXiv:2010.03295 (2020).
  • Ben Abacha and Demner-Fushman (2019) Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics 20 (2019), 1–23.
  • Bhayana (2024) Rajesh Bhayana. 2024. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 310, 1 (2024), e232756.
  • Bodenreider (2004) Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), D267–D270.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Byambasuren et al. (2019) Odma Byambasuren, Yunfei Yang, Zhifang Sui, Damai Dai, Baobao Chang, Sujian Li, and Hongying Zan. 2019. Preliminary study on the construction of Chinese medical knowledge graph. Journal of Chinese Information Processing 33, 10 (2019), 1–9.
  • Caffagni et al. (2024) Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The (R) Evolution of Multimodal Large Language Models: A Survey. arXiv preprint arXiv:2402.12451 (2024).
  • Cao et al. (2023) Yihan Cao, Yanbin Kang, and Lichao Sun. 2023. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290 (2023).
  • Carlini et al. (2021) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
  • Caruana and Niculescu-Mizil (2006) Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning. 161–168.
  • Casper et al. (2023) Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023).
  • Chaves and Gerosa (2021) Ana Paula Chaves and Marco Aurelio Gerosa. 2021. How should my chatbot interact? A survey on social characteristics in human–chatbot interaction design. International Journal of Human–Computer Interaction 37, 8 (2021), 729–758.
  • Chen et al. (2023c) Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023c. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023).
  • Chen et al. (2023b) Qiuhui Chen, Xinyue Hu, Zirui Wang, and Yi Hong. 2023b. Medblip: Bootstrapping language-image pre-training from 3d medical images and texts. arXiv preprint arXiv:2305.10799 (2023).
  • Chen et al. (2024) Xiaodong Chen, Yuxuan Hu, and Jing Zhang. 2024. Compressing large language models by streamlining the unimportant layer. arXiv preprint arXiv:2403.19135 (2024).
  • Chen et al. (2023d) Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. 2023d. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199 (2023).
  • Chen et al. (2022) Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022).
  • Chen et al. (2023e) Yirong Chen, Zhenyu Wang, Xiaofen Xing, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, Xiangmin Xu, et al. 2023e. Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt. arXiv preprint arXiv:2310.15896 (2023).
  • Chen et al. (2023f) Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. 2023f. Soulchat: Improving llms’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023. 1170–1183.
  • Chen et al. (2023a) Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. 2023a. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023).
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  • Choquette (2023) Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro (2023).
  • Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  • Chu et al. (2023) Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. 2023. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023).
  • Chu et al. (2024) Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. 2024. MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint arXiv:2402.03766 (2024).
  • Clough et al. (2024) Reece Alexander James Clough, William Anthony Sparkes, Oliver Thomas Clough, Joshua Thomas Sykes, Alexander Thomas Steventon, and Kate King. 2024. Transforming healthcare documentation: harnessing the potential of AI to generate discharge summaries. BJGP open (2024).
  • Cui et al. (2023) Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, and Tingwen Liu. 2023. Fft: Towards harmlessness evaluation and analysis for llms with factuality, fairness, toxicity. arXiv preprint arXiv:2311.18580 (2023).
  • Cunningham et al. (2008) Pádraig Cunningham, Matthieu Cord, and Sarah Jane Delany. 2008. Supervised learning. In Machine learning techniques for multimedia: case studies on organization and retrieval. Springer, 21–49.
  • Dai et al. (2022) Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. 2022. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. arXiv preprint arXiv:2212.10559 (2022).
  • Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  • De Choudhury et al. (2023) Munmun De Choudhury, Sachin R Pendse, and Neha Kumar. 2023. Benefits and harms of large language models in digital mental health. arXiv preprint arXiv:2311.14693 (2023).
  • Dehghani et al. (2023) Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. 2023. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning. PMLR, 7480–7512.
  • Demner-Fushman et al. (2016) Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. 2016. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23, 2 (2016), 304–310.
  • Deshpande et al. (2023) Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335 (2023).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dhuliawala et al. (2023) Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • Du et al. (2021) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360 (2021).
  • Elaraby et al. (2023) Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, and Shizhu Liu. 2023. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764 (2023).
  • Elster and Haugdahl (2022) Anne C Elster and Tor A Haugdahl. 2022. Nvidia hopper gpu and grace cpu highlights. Computing in Science & Engineering 24, 2 (2022), 95–100.
  • Fang et al. (2023) Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2023. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19358–19369.
  • Ferrara (2023) Emilio Ferrara. 2023. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738 (2023).
  • French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3, 4 (1999), 128–135.
  • Gao et al. (2023) Weihao Gao, Zhuo Deng, Zhiyuan Niu, Fuju Rong, Chucheng Chen, Zheng Gong, Wenze Zhang, Daimin Xiao, Fang Li, Zhenjie Cao, et al. 2023. Ophglm: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue. arXiv preprint arXiv:2306.12174 (2023).
  • Gema et al. (2023) Aryo Gema, Luke Daines, Pasquale Minervini, and Beatrice Alex. 2023. Parameter-efficient fine-tuning of llama for the clinical domain. arXiv preprint arXiv:2307.03042 (2023).
  • Gu and Dao (2023) Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023).
  • Gu et al. (2021) Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
  • Han et al. (2023c) Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. 2023c. Onellm: One framework to align all modalities with language. arXiv preprint arXiv:2312.03700 (2023).
  • Han et al. (2023a) Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. 2023a. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:2304.08247 (2023).
  • Han et al. (2023b) Zhiyong Han, Fortunato Battaglia, Abinav Udaiyar, Allen Fooks, and Stanley R Terlecky. 2023b. An explorative assessment of ChatGPT as an aid in medical education: Use it with caution. Medical Teacher (2023), 1–8.
  • Hartvigsen et al. (2024) Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. 2024. Aging with grace: Lifelong model editing with discrete key-value adaptors. Advances in Neural Information Processing Systems 36 (2024).
  • He et al. (2024) Jinlong He, Pengfei Li, Gang Liu, Zixu Zhao, and Shenjun Zhong. 2024. PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering. arXiv preprint arXiv:2401.02797 (2024).
  • He et al. (2023) Kai He, Rui Mao, Qika Lin, Yucheng Ruan, Xiang Lan, Mengling Feng, and Erik Cambria. 2023. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694 (2023).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • He et al. (2020a) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020a. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).
  • He et al. (2020b) Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. 2020b. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020).
  • Hendricks et al. (2021) Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, and Aida Nematzadeh. 2021. Decoupling the role of data, attention, and losses in multimodal transformers. Transactions of the Association for Computational Linguistics 9 (2021), 570–585.
  • Herrett et al. (2015) Emily Herrett, Arlene M Gallagher, Krishnan Bhaskaran, Harriet Forbes, Rohini Mathur, Tjeerd Van Staa, and Liam Smeeth. 2015. Data resource profile: clinical practice research datalink (CPRD). International journal of epidemiology 44, 3 (2015), 827–836.
  • Heston and Khun (2023) Thomas F Heston and Charya Khun. 2023. Prompt engineering in medical education. International Medical Education 2, 3 (2023), 198–205.
  • Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International conference on machine learning. PMLR, 2790–2799.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Hu et al. (2023) Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Ka-Wei Lee. 2023. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933 (2023).
  • Huang et al. (2023a) Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. 2023a. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 29, 9 (2023), 2307–2316.
  • Huang et al. (2023b) Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2023b. Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785 (2023).
  • Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 590–597.
  • Jaegle et al. (2021) Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In International conference on machine learning. PMLR, 4651–4664.
  • Javaid et al. (2023) Mohd Javaid, Abid Haleem, and Ravi Pratap Singh. 2023. ChatGPT for healthcare services: An emerging stage for an innovative perspective. BenchCouncil Transactions on Benchmarks, Standards and Evaluations 3, 1 (2023), 100105.
  • Ji et al. (2021) Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2021. Mentalbert: Publicly available pretrained language models for mental healthcare. arXiv preprint arXiv:2110.15621 (2021).
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  • Jian et al. (2024) Yiren Jian, Chongyang Gao, and Soroush Vosoughi. 2024. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. Advances in Neural Information Processing Systems 36 (2024).
  • Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 14 (2021), 6421.
  • Jin et al. (2019) Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019).
  • Johnson et al. (2023) Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. 2023. MIMIC-IV, a freely accessible electronic health record dataset. Scientific data 10, 1 (2023), 1.
  • Johnson et al. (2019) Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6, 1 (2019), 317.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1–9.
  • Kaddour et al. (2023) Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169 (2023).
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  • Karabacak et al. (2023) Mert Karabacak, Burak Berksu Ozkara, Konstantinos Margetis, Max Wintermark, and Sotirios Bisdas. 2023. The advent of generative language models in medical education. JMIR Medical Education 9 (2023), e48163.
  • Khan (2023) Sal Khan. 2023. Harnessing GPT-4 so that all students benefit. A nonprofit approach for equal access. https://blog.khanacademy.org/harnessing-ai-so-that-all-students-benefit-a-nonprofit-approach-for-equal-access/. Khan Academy Blog (2023).
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
  • Kononenko (2001) Igor Kononenko. 2001. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine 23, 1 (2001), 89–109.
  • Kotek et al. (2023) Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference. 12–24.
  • Kung et al. (2023) Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. 2023. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS digital health 2, 2 (2023), e0000198.
  • Lahnala et al. (2022) Allison Lahnala, Charles Welch, Béla Neuendorf, and Lucie Flek. 2022. Mitigating toxic degeneration with empathetic data: Exploring the relationship between toxicity and empathy. arXiv preprint arXiv:2205.07233 (2022).
  • Lai et al. (2023) Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2023. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023).
  • Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
  • Lau et al. (2018) Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5, 1 (2018), 1–10.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436–444.
  • Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  • Lee et al. (2023a) Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023a. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine 388, 13 (2023), 1233–1239.
  • Lee et al. (2023b) Seongyun Lee, Sue Hyun Park, Yongrae Jo, and Minjoon Seo. 2023b. Volcano: mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362 (2023).
  • Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
  • Li et al. (2023a) Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. 2023a. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 1, 2 (2023), 2.
  • Li et al. (2024b) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2024b. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024).
  • Li et al. (2023b) Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023b. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197 (2023).
  • Li et al. (2015) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055 (2015).
  • Li et al. (2023d) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023d. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  • Li et al. (2021) Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34 (2021), 9694–9705.
  • Li et al. (2023f) Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023f. Huatuo-26m, a large-scale chinese medical qa dataset. arXiv preprint arXiv:2305.01526 (2023).
  • Li et al. (2023c) KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023c. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
  • Li et al. (2024a) Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, and Jie Yu. 2024a. Pmet: Precise model editing in a transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18564–18572.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
  • Li et al. (2023e) Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023e. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus 15, 6 (2023).
  • Liévin et al. (2023) Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2023. Can large language models reason about medical questions? Patterns (2023).
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  • Lin et al. (2023) Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Pmc-clip: Contrastive language-image pre-training using biomedical documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 525–536.
  • Liu et al. (2021) Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, 1650–1654.
  • Liu et al. (2023h) Fenglin Liu, Tingting Zhu, Xian Wu, Bang Yang, Chenyu You, Chenyang Wang, Lei Lu, Zhangdaihong Liu, Yefeng Zheng, Xu Sun, et al. 2023h. A medical multimodal large language model for future pandemics. NPJ Digital Medicine 6, 1 (2023), 226.
  • Liu et al. (2023b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023b. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023).
  • Liu et al. (2024a) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024a. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  • Liu and Yu (2005) Huan Liu and Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering 17, 4 (2005), 491–502.
  • Liu et al. (2023d) Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua. 2023d. Qilin-med-vl: Towards chinese large vision-language model for general healthcare. arXiv preprint arXiv:2310.17956 (2023).
  • Liu et al. (2023a) June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023a. Chatcounselor: A large language models for mental health support. arXiv preprint arXiv:2309.15461 (2023).
  • Liu et al. (2023e) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023e. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  • Liu et al. (2024b) Qidong Liu, Xian Wu, Xiangyu Zhao, Yuanshao Zhu, Zijian Zhang, Feng Tian, and Yefeng Zheng. 2024b. Large Language Model Distilling Medication Recommendation Model. arXiv preprint arXiv:2402.02803 (2024).
  • Liu et al. (2022) Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu, and Soroush Vosoughi. 2022. Quantifying and alleviating political bias in language models. Artificial Intelligence 304 (2022), 103654.
  • Liu et al. (2023f) Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. 2023f. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023).
  • Liu et al. (2023g) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2023g. GPT understands, too. AI Open (2023).
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al. (2023c) Zhengliang Liu, Yiwei Li, Peng Shu, Aoxiao Zhong, Longtao Yang, Chao Ju, Zihao Wu, Chong Ma, Jie Luo, Cheng Chen, et al. 2023c. Radiology-llama2: Best-in-class large language model for radiology. arXiv preprint arXiv:2309.06419 (2023).
  • Lu et al. (2022) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35 (2022), 2507–2521.
  • Luo et al. (2023) Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. 2023. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621 (2023).
  • Lyu et al. (2023) Qing Lyu, Josh Tan, Michael E Zapadka, Janardhana Ponnatapura, Chuang Niu, Kyle J Myers, Ge Wang, and Christopher T Whitlow. 2023. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Visual Computing for Industry, Biomedicine, and Art 6, 1 (2023), 9.
  • Madaan et al. (2024) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2024).
  • Meng et al. (2022a) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022a. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 35 (2022), 17359–17372.
  • Meng et al. (2022b) Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2022b. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229 (2022).
  • Meng et al. (2020) Xing Meng, Craig H Ganoe, Ryan T Sieberg, Yvonne Y Cheung, and Saeed Hassanpour. 2020. Self-supervised contextual language representation of radiology reports to improve the identification of communication urgency. AMIA Summits on Translational Science Proceedings 2020 (2020), 413.
  • Meskó (2023) Bertalan Meskó. 2023. Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of Medical Internet Research 25 (2023), e50638.
  • Miao et al. (2023) Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. 2023. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234 (2023).
  • Moor et al. (2023a) Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. 2023a. Foundation models for generalist medical artificial intelligence. Nature 616, 7956 (2023), 259–265.
  • Moor et al. (2023b) Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. 2023b. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H). PMLR, 353–367.
  • Nori et al. (2023a) Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023a. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023).
  • Nori et al. (2023b) Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. 2023b. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452 (2023).
  • Och et al. (2004) Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, Anoop Sarkar, Kenji Yamada, Alexander Fraser, Shankar Kumar, Libin Shen, David A Smith, Katherine Eng, et al. 2004. A smorgasbord of features for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004. 161–168.
  • Omiye et al. (2023) Jesutofunmi A Omiye, Haiwen Gui, Shawheen J Rezaei, James Zou, and Roxana Daneshjou. 2023. Large language models in medicine: the potentials and pitfalls. arXiv preprint arXiv:2309.00087 (2023).
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  • Pal et al. (2022) Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning. PMLR, 248–260.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  • Patel and Lam (2023) Sajan B Patel and Kyle Lam. 2023. ChatGPT: the future of discharge summaries? The Lancet Digital Health 5, 3 (2023), e107–e108.
  • Pelka et al. (2018) Obioma Pelka, Sven Koitka, Johannes Rückert, Felix Nensa, and Christoph M Friedrich. 2018. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3. Springer, 180–189.
  • Peng et al. (2023a) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. 2023a. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023).
  • Peng et al. (2023b) Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al. 2023b. A study of generative large language model for medical research and healthcare. NPJ Digital Medicine 6, 1 (2023), 210.
  • Prince et al. (2007) Martin Prince, Vikram Patel, Shekhar Saxena, Mario Maj, Joanna Maselko, Michael R Phillips, and Atif Rahman. 2007. No health without mental health. The lancet 370, 9590 (2007), 859–877.
  • Qiu et al. (2023) Jianing Qiu, Lin Li, Jiankai Sun, Jiachuan Peng, Peilun Shi, Ruiyang Zhang, Yinzhao Dong, Kyle Lam, Frank P-W Lo, Bo Xiao, et al. 2023. Large ai models in health informatics: Applications, challenges, and the future. IEEE Journal of Biomedical and Health Informatics (2023).
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  • Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  • Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36 (2024).
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
  • Rasheed et al. (2023) Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S Khan. 2023. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356 (2023).
  • Rawte et al. (2023) Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  • Seenivasan et al. (2023) Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan, and Hongliang Ren. 2023. SurgicalGPT: End-to-end language-vision GPT for visual question answering in surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 281–290.
  • Shu et al. (2023) Chang Shu, Baian Chen, Fangyu Liu, Zihao Fu, Ehsan Shareghi, and Nigel Collier. 2023. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities. https://github.com/cambridgeltl/visual-med-alpaca.
  • Shuster et al. (2021) Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567 (2021).
  • Singhal et al. (2023a) Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023a. Large language models encode clinical knowledge. Nature 620, 7972 (2023), 172–180.
  • Singhal et al. (2023b) Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. 2023b. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).
  • Siu (2023) Sai Cheong Siu. 2023. Chatgpt and GPT-4 for professional translators: Exploring the potential of large language models in translation. Available at SSRN 4448091 (2023).
  • Song et al. (2023) Shezheng Song, Xiaopeng Li, and Shasha Li. 2023. How to bridge the gap between modalities: A comprehensive survey on multimodal large language model. arXiv preprint arXiv:2311.07594 (2023).
  • Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
  • Stock et al. (2023) Anna Stock, Stephan Schlögl, and Aleksander Groth. 2023. Tell me, what are you most afraid of? Exploring the Effects of Agent Representation on Information Disclosure in Human-Chatbot Interaction. In International Conference on Human-Computer Interaction. Springer, 179–191.
  • Subramanian et al. (2020) Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Hannaneh Hajishirzi. 2020. Medicat: A dataset of medical images, captions, and textual references. arXiv preprint arXiv:2010.06000 (2020).
  • Sun et al. (2023) Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Zhongyi Shui, Xiaoxuan Yu, Yizhi Zhao, Honglin Li, Yunlong Zhang, Ruojia Zhao, et al. 2023. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv preprint arXiv:2305.15072 (2023).
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
  • Szolovits et al. (1988) Peter Szolovits, Ramesh S Patil, and William B Schwartz. 1988. Artificial intelligence in medical diagnosis. Annals of internal medicine 108, 1 (1988), 80–87.
  • Tan et al. (2023) Yang Tan, Mingchen Li, Zijie Huang, Huiqun Yu, and Guisheng Fan. 2023. Medchatzh: a better medical adviser learns from better instructions. arXiv preprint arXiv:2309.01114 (2023).
  • Tang et al. (2023) Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. 2023. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 (2023).
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  • Tay et al. (2022) Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, et al. 2022. Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131 (2022).
  • Team (2023) Duolingo Team. 2023. Introducing Duolingo Max, a learning experience powered by GPT-4. Retrieved March 15 (2023), 2023.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  • Thapa and Adhikari (2023) Surendrabikram Thapa and Surabhi Adhikari. 2023. ChatGPT, bard, and large language models for biomedical research: opportunities and pitfalls. Annals of biomedical engineering 51, 12 (2023), 2647–2651.
  • Thawkar et al. (2023) Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. 2023. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 (2023).
  • Thirunavukarasu et al. (2023) Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940.
  • Thomas et al. (2009) Kathleen C Thomas, Alan R Ellis, Thomas R Konrad, Charles E Holzer, and Joseph P Morrissey. 2009. County-level estimates of mental health professional shortage in the United States. Psychiatric services 60, 10 (2009), 1323–1328.
  • Tiu et al. (2022) Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar. 2022. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering 6, 12 (2022), 1399–1406.
  • Toma et al. (2023) Augustin Toma, Patrick R Lawler, Jimmy Ba, Rahul G Krishnan, Barry B Rubin, and Bo Wang. 2023. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031 (2023).
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Tsai et al. (2024) Yun-Da Tsai, Ting-Yu Yen, Pei-Fu Guo, Zhe-Yan Li, and Shou-De Lin. 2024. Text-centric Alignment for Multi-Modality Learning. arXiv preprint arXiv:2402.08086 (2024).
  • Tu et al. (2024a) Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. 2024a. Towards generalist biomedical ai. NEJM AI 1, 3 (2024), AIoa2300138.
  • Tu et al. (2024b) Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, et al. 2024b. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654 (2024).
  • Turgay et al. (2023) Safiye Turgay, İlker İlter, et al. 2023. Perturbation Methods for Protecting Data Privacy: A Review of Techniques and Applications. Automation and Machine Learning 4, 2 (2023), 31–41.
  • Umapathi et al. (2023) Logesh Kumar Umapathi, Ankit Pal, and Malaikannan Sankarasubbu. 2023. Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343 (2023).
  • van Heerden et al. (2023) Alastair C van Heerden, Julia R Pozuelo, and Brandon A Kohrt. 2023. Global mental health services and the impact of artificial intelligence–powered large language models. JAMA psychiatry 80, 7 (2023), 662–664.
  • Van Sonsbeek et al. (2023) Tom Van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees GM Snoek, and Marcel Worring. 2023. Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 726–736.
  • Van Veen et al. (2023a) Dave Van Veen, Cara Van Uden, Maayane Attias, Anuj Pareek, Christian Bluethgen, Malgorzata Polacin, Wah Chiu, Jean-Benoit Delbrouck, Juan Manuel Zambrano Chaves, Curtis P Langlotz, et al. 2023a. RadAdapt: Radiology report summarization via lightweight domain adaptation of large language models. arXiv preprint arXiv:2305.01146 (2023).
  • Van Veen et al. (2023b) Dave Van Veen, Cara Van Uden, Louis Blankemeier, Jean-Benoit Delbrouck, Asad Aali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerova, et al. 2023b. Clinical text summarization: Adapting large language models can outperform human experts. Research Square (2023).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575.
  • Wang et al. (2023f) Benyou Wang, Qianqian Xie, Jiahuan Pei, Zhihong Chen, Prayag Tiwari, Zhao Li, and Jie Fu. 2023f. Pre-trained language models in biomedical domain: A systematic survey. Comput. Surveys 56, 3 (2023), 1–52.
  • Wang et al. (2023g) Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. 2023g. ClinicalGPT: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968 (2023).
  • Wang et al. (2023d) Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. 2023d. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975 (2023).
  • Wang et al. (2023b) Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023b. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048 (2023).
  • Wang et al. (2023e) Jiaqi Wang, Enze Shi, Sigang Yu, Zihao Wu, Chong Ma, Haixing Dai, Qiushi Yang, Yanqing Kang, Jinru Wu, Huawen Hu, et al. 2023e. Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670 (2023).
  • Wang et al. (2020) Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Douglas Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Kinney, et al. 2020. Cord-19: The covid-19 open research dataset. ArXiv (2020).
  • Wang et al. (2023a) Rongsheng Wang, Yaofei Duan, Junrong Li, Patrick Pang, and Tao Tan. 2023a. XrayGLM: The first Chinese Medical Multimodal Model that Chest Radiographs Summarization. https://github.com/WangRongsheng/XrayGLM.
  • Wang et al. (2023h) Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. 2023h. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257 (2023).
  • Wang et al. (2022c) Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. 2022c. What language model architecture and pretraining objective works best for zero-shot generalization?. In International Conference on Machine Learning. PMLR, 22964–22984.
  • Wang et al. (2022d) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022d. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
  • Wang et al. (2022a) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022a. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022).
  • Wang et al. (2022b) Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2022b. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022).
  • Wang et al. (2023c) Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. 2023c. R2gengpt: Radiology report generation with frozen llms. Meta-Radiology 1, 3 (2023), 100033.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021).
  • Wei et al. (2022a) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022a. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  • Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  • White et al. (2023) Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  • Wu et al. (2024a) Chaoyi Wu, Weixiong Lin, Xiaoman Zhang, Ya Zhang, Weidi Xie, and Yanfeng Wang. 2024a. PMC-LLaMA: toward building open-source language models for medicine. Journal of the American Medical Informatics Association (2024), ocae045.
  • Wu et al. (2023) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463 (2023).
  • Wu et al. (2024b) Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. 2024b. Continual learning for large language models: A survey. arXiv preprint arXiv:2402.01364 (2024).
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  • Xiao et al. (2023) Hanguang Xiao, Li Li, Qiyuan Liu, Xiuhong Zhu, and Qihang Zhang. 2023. Transformers in medical image segmentation: A review. Biomedical Signal Processing and Control 84 (2023), 104791.
  • Xiong et al. (2023) Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Linlin Huang, Qian Wang, and Dinggang Shen. 2023. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097 (2023).
  • Xu et al. (2023a) Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023a. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196 (2023).
  • Xu et al. (2023b) Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, and Yaqian Li. 2023b. u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model. arXiv preprint arXiv:2311.05348 (2023).
  • Yang et al. (2023e) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023e. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
  • Yang et al. (2023c) Bang Yang, Asif Raza, Yuexian Zou, and Tong Zhang. 2023c. Customizing general-purpose foundation models for medical report generation. arXiv preprint arXiv:2306.05642 (2023).
  • Yang et al. (2023d) Guoxing Yang, Jianyu Shi, Zan Wang, Xiaohong Liu, and Guangyu Wang. 2023d. TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine. arXiv preprint arXiv:2311.01786 (2023).
  • Yang et al. (2023a) Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2023a. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data (2023).
  • Yang et al. (2024b) Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, and Hongying Zan. 2024b. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19368–19376.
  • Yang et al. (2022) Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, Anthony B Costa, Mona G Flores, et al. 2022. A large language model for electronic health records. NPJ digital medicine 5, 1 (2022), 194.
  • Yang et al. (2024a) Yifan Yang, Xiaoyu Liu, Qiao Jin, Furong Huang, and Zhiyong Lu. 2024a. Unmasking and Quantifying Racial Bias of Large Language Models in Medical Report Generation. arXiv preprint arXiv:2401.13867 (2024).
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
  • Yang et al. (2023b) Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023b. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023), 1.
  • Yang et al. (2023f) Zhichao Yang, Zonghai Yao, Mahbuba Tasmin, Parth Vashisht, Won Seok Jang, Feiyun Ouyang, Beining Wang, Dan Berlowitz, and Hong Yu. 2023f. Performance of multimodal gpt-4v on usmle with image: Potential for imaging diagnostic support with explanations. medRxiv (2023), 2023–10.
  • Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024).
  • Yao et al. (2023) Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. 2023. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172 (2023).
  • Ye et al. (2023) Qichen Ye, Junling Liu, Dading Chong, Peilin Zhou, Yining Hua, and Andrew Liu. 2023. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv preprint arXiv:2310.09089 (2023).
  • Yin et al. (2023) Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
  • Yuan et al. (2023) Zhengqing Yuan, Zhaoxu Li, and Lichao Sun. 2023. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862 (2023).
  • Zellers et al. (2021) Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. 2021. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems 34 (2021), 23634–23651.
  • Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
  • Zeng et al. (2020) Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi Zhang, Meng Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, et al. 2020. MedDialog: Large-scale medical dialogue datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 9241–9250.
  • Zhai et al. (2023) Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. 2023. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313 (2023).
  • Zhang et al. (2024b) Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. 2024b. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601 (2024).
  • Zhang et al. (2023a) Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, et al. 2023a. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075 (2023).
  • Zhang et al. (2023b) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, et al. 2023b. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792 (2023).
  • Zhang et al. (2023f) Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. 2023f. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915 2, 3 (2023), 6.
  • Zhang et al. (2018) Sheng Zhang, Xin Zhang, Hui Wang, Lixiang Guo, and Shanshan Liu. 2018. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6 (2018), 74061–74071.
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
  • Zhang et al. (2023d) Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. 2023d. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558 (2023).
  • Zhang et al. (2023e) Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023e. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023).
  • Zhang et al. (2024a) Yunkun Zhang, Jin Gao, Zheling Tan, Lingfeng Zhou, Kexin Ding, Mu Zhou, Shaoting Zhang, and Dequan Wang. 2024a. Data-centric foundation models in computational healthcare: A survey. arXiv preprint arXiv:2401.02458 (2024).
  • Zhang et al. (2023c) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023c. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
  • Zhang and Wallace (2015) Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015).
  • Zhao et al. (2023a) Bingchen Zhao, Haoqin Tu, Chen Wei, Jieru Mei, and Cihang Xie. 2023a. Tuning LayerNorm in Attention: Towards efficient multi-modal llm finetuning. arXiv preprint arXiv:2312.11420 (2023).
  • Zhao et al. (2023b) Zihao Zhao, Sheng Wang, Jinchen Gu, Yitao Zhu, Lanzhuju Mei, Zixu Zhuang, Zhiming Cui, Qian Wang, and Dinggang Shen. 2023b. Chatcad+: Towards a universal and reliable interactive cad using llms. arXiv preprint arXiv:2305.15964 (2023).
  • Zheng et al. (2024b) Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, and Huawen Feng. 2024b. Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer. arXiv preprint arXiv:2401.09181 (2024).
  • Zheng et al. (2024a) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024a. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
  • Zhou et al. (2023b) Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. 2023b. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419 (2023).
  • Zhou et al. (2024) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024).
  • Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022).
  • Zhou et al. (2023a) Hongjian Zhou, Boyang Gu, Xinyu Zou, Yiru Li, Sam S Chen, Peilin Zhou, Junling Liu, Yining Hua, Chengfeng Mao, Xian Wu, et al. 2023a. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112 (2023).
  • Zhou et al. (2018) Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. 2018. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer, 3–11.
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).