Logo GMAI-MMBench

A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Pengcheng Chen*1,2, Jin Ye*†,1,3, Guoan Wang*1,4, Yanjun Li1,4,
Zhongying Deng5, Wei Li1,6, Tianbin Li1, Haodong Duan1, Ziyan Huang1,6, Yanzhou Su1, Benyou Wang7,8, Shaoting Zhang1, Bin Fu9, Jianfei Cai3, Bohan Zhuang3, Eric J Seibel2, Junjun He†,1, Yu Qiao†,1,

1Shanghai AI Laboratory, 2University of Washington, 3Monash University, 4East China Normal University,
5University of Cambridge, 6Shanghai Jiao Tong University, 7The Chinese University of Hong Kong, Shenzhen 8Shenzhen Research Institute of Big Data 9Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences

*Core Contributors
†Corresponding to: jin.ye@monash.edu, hejunjun@pjlab.org.cn, qiaoyu@pjlab.org.cn,
geometric reasoning

Overview of the GMAI-MMBench. The benchmark is meticulously designed for testing LVLMs’ abilities in real-world clinical scenarios with three key features: (1) Comprehensive medical knowledge: It consists of 285 diverse clinical-related datasets from worldwide sources, covering 39 modalities. (2) Well-categorized data structure: It features 18 clinical VQA tasks and 18 clinical departments, meticulously organized into a lexical tree. (3) Multi-perceptual granularity: Interactive methods span from image to region level, offering varying degrees of perceptual details.

Abstract

Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals, and can be applied in various fields. In the medical field, LVLMs have a high potential to offer substantial assistance for diagnosis and treatment. Before that, it is crucial to develop benchmarks to evaluate LVLMs’ effectiveness in various medical applications. Current benchmarks are often built upon specific academic literature, mainly focusing on a single domain, and lacking varying perceptual granularities. Thus, they face specific challenges, including limited clinical relevance, incomplete evaluations, and insufficient guidance for interactive LVLMs. To address these limitations, we developed the GMAI-MMBench, the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 285 datasets across 39 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format. Additionally, we implemented a lexical tree structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 52%, indicating significant room for improvement. Moreover, we identified five key insufficiencies in current cutting-edge LVLMs that need to be addressed to advance the development of better medical applications. We believe that GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI.

geometric reasoning

Examples of GMAI-MMBench. The benchmark covers a variety of clinical tasks, departments, and perceptual granularities from worldwide data sources.

Logo GMAI-MMBench

Overview

We propose GMAI-MMBench, an innovative benchmark meticulously designed for the medical field, capable of providing comprehensive evaluations of LVLMs across various aspects of healthcare. We collect 285 datasets from public sources and hospitals, covering medical imaging tasks of detection, classification, and segmentation, to form the data fuel for establishing such a benchmark. The detailed datasets are listed in the supplementary. Based on the data foundation, we design a reliable pipeline to generate question-answering pairs and organize them from different perspectives with manual validation. Finally, we carefully select approximately 26K questions with varying levels of perceptual granularity from the manually validated cases to construct the final GMAI-MMBench.

algebraic reasoning

Statistics

Lexical Tree

To make the GMAI-MMBench more intuitive and user-friendly, we have systematized our labels and structured the entire dataset into a lexical tree, which is presented in HTML format. Users can freely select the test contents based on this lexical tree. We believe that this customizable benchmark will effectively guide the improvement of models in specific areas. For instance, as mentioned in the main text, most models perform poorly at the bounding box level perception. Users can then update their models and test the accuracy at the bounding box level using this lexical tree, thereby achieving targeted improvements in model performance.

algebraic reasoning

Example of how to use the Lexical Tree for customizing evaluations for the ophthalmology department and fundus photography modality. The process involves selecting the department (ophthalmology), choosing the modality (fundus photography), filtering questions using relevant keywords, and evaluating different models based on their accuracy in answering the filtered questions.

Experiment Results

Leaderboard

Model name Overall (val) Overall (test) AR BVR B CR C DD IQG MR M NT OR-A OR-HN OR-P OR-T SG SAR SIR SWR
Random 25.70 25.94 38.20 22.73 22.92 22.72 24.06 26.66 27.13 27.00 20.00 24.75 21.37 22.93 22.33 21.18 32.43 24.23 21.39 23.71
Medical Special Model
MedVInT 2.29 1.96 5.75 0.00 0.00 0.00 2.56 2.11 4.05 0.00 0.00 0.00 0.11 0.00 0.00 0.12 7.36 0.00 1.88 0.00
Med-Flamingo 12.74 11.64 6.67 10.14 9.23 11.27 6.62 13.43 12.15 6.38 8.00 18.18 9.26 18.27 11.00 11.53 12.16 5.19 8.47 11.43
LLaVA-Med 20.54 19.60 24.51 17.83 17.08 19.86 15.04 19.81 20.24 21.51 13.20 15.15 20.42 23.73 17.67 19.65 21.70 19.81 14.11 20.86
Qilin-Med-VL-Chat 22.34 22.06 29.57 19.41 16.46 23.79 15.79 24.19 21.86 16.62 7.20 13.64 24.00 14.67 12.67 15.53 26.13 24.42 17.37 25.71
RadFM 22.95 22.93 27.16 20.63 13.23 19.14 20.45 24.51 23.48 22.85 15.60 16.16 14.32 24.93 17.33 21.53 29.73 17.12 19.59 31.14
MedDr 41.95 43.69 41.20 50.70 37.85 29.87 28.27 52.53 36.03 31.45 29.60 47.47 33.37 51.33 32.67 44.47 35.14 25.19 25.58 32.29
Open-Source LVLMs
CogVLM-grounding-generalist 5.20 5.66 3.11 4.02 2.92 3.22 10.83 7.98 9.72 0.15 0.00 11.11 8.32 1.87 1.67 2.00 1.65 0.00 4.02 0.57
XComposer 8.92 7.67 1.38 7.69 8.31 12.34 22.86 7.31 6.07 5.49 2.80 16.16 5.05 8.67 2.00 9.76 11.94 7.31 3.17 4.00
PandaGPT 13B 16.69 16.27 24.51 23.60 22.15 23.61 14.29 14.95 13.36 12.17 18.40 28.79 18.63 27.33 18.67 16.71 11.04 9.23 13.43 9.71
Flamingo v2 25.58 26.34 37.74 21.50 20.62 22.00 22.41 27.29 25.91 27.45 18.00 28.79 25.16 22.13 22.00 22.00 34.61 22.88 20.44 27.43
VisualGLM-6B 29.58 30.45 40.16 33.92 24.92 25.22 24.21 32.99 29.96 29.53 21.20 37.88 30.32 24.80 13.33 29.88 33.11 19.62 19.16 37.43
Idefics-9B-Instruct 29.74 31.13 40.39 30.59 26.46 33.63 22.56 34.38 25.51 26.71 21.60 27.78 27.47 32.80 24.67 23.41 32.66 23.08 21.39 30.57
InstructBLIP-7B 31.80 30.95 42.12 26.92 24.92 28.09 21.65 34.58 31.58 29.23 22.40 30.30 28.95 27.47 23.00 24.82 32.88 19.81 21.64 26.57
Mini-Gemini-7B 32.17 31.09 29.69 39.16 31.85 28.26 10.38 35.58 29.96 28.78 20.80 34.34 29.58 36.53 24.00 31.76 22.45 25.96 18.56 29.43
MMAlaya 32.19 32.30 41.20 35.14 32.15 34.17 27.82 35.09 28.34 30.27 18.00 46.97 20.21 31.20 16.00 34.59 32.28 23.65 22.93 30.29
Qwen-VL 34.80 36.05 37.05 37.24 35.85 28.98 24.81 43.60 24.70 30.12 19.20 44.44 29.68 31.87 25.00 31.18 30.26 21.54 20.10 26.86
Yi-VL-6B 34.82 34.31 41.66 39.16 26.62 30.23 31.88 38.01 26.72 24.93 25.20 37.37 29.58 31.20 32.33 30.59 36.71 24.81 23.18 31.43
LLaVA-NeXT-vicuna-7B 34.86 35.42 40.62 38.64 21.08 35.42 23.91 41.22 32.39 28.04 20.53 44.95 27.92 34.98 20.22 32.82 33.63 23.08 25.06 34.86
Qwen-VL-Chat 35.07 36.96 38.09 40.56 38.00 32.20 25.71 44.07 24.70 30.56 24.00 40.91 29.37 36.53 26.00 27.29 35.14 16.54 20.10 34.00
CogVLM-Chat 35.23 36.08 40.97 30.77 27.69 32.74 19.40 41.10 36.84 34.72 24.00 40.91 36.74 37.33 26.00 33.65 36.56 20.19 23.95 26.57
Monkey 35.48 36.39 38.32 35.31 35.54 34.53 23.16 43.40 31.98 30.12 19.20 33.33 30.00 32.53 25.33 31.65 34.46 20.00 20.27 30.29
mPLUG-Owl2 35.62 36.21 37.51 41.08 30.92 38.10 27.82 41.59 28.34 32.79 22.40 40.91 24.74 38.27 23.33 36.59 33.48 20.58 23.01 32.86
ShareCaptioner 36.37 36.19 42.35 32.69 31.08 27.19 30.83 41.19 30.36 33.23 28.40 42.93 27.79 33.73 28.33 40.71 29.58 20.96 28.83 30.00
Emu2-Chat 36.50 37.59 43.27 47.73 26.31 40.07 28.12 44.00 36.44 28.49 20.40 31.82 26.74 37.60 26.67 29.76 33.63 23.27 26.43 29.43
XComposer2-4KHD 36.66 38.54 41.89 39.86 28.77 40.43 20.60 44.25 35.22 33.53 22.80 42.42 34.84 29.60 44.00 39.53 35.21 21.54 27.20 38.00
ShareGPT4V-7B 36.71 36.70 43.96 37.59 21.54 37.57 18.80 43.26 32.39 27.30 22.80 43.43 29.47 37.33 22.00 31.76 34.98 24.42 25.06 30.00
LLaVA-NeXT-mistral-7B 37.20 37.16 38.43 27.98 20.31 29.16 20.60 47.19 30.36 32.64 22.40 55.56 32.75 25.58 17.56 34.04 28.38 23.27 24.12 37.43
LLAVA-V1.5-13b-xtuner 37.82 38.74 44.65 29.02 27.08 38.28 28.87 45.32 32.79 30.12 20.40 45.96 33.47 42.53 44.33 37.53 33.48 19.62 22.58 35.43
OmniLMM-12B 37.89 39.30 39.82 40.56 32.62 37.57 24.81 46.68 35.63 35.01 27.60 57.58 28.42 34.00 25.00 29.18 34.46 24.42 27.54 40.29
InternVL-Chat-V1.1 38.16 39.41 42.46 43.88 35.23 45.08 23.31 45.96 38.87 29.23 29.60 40.40 31.68 41.87 26.67 38.82 32.13 19.42 25.58 30.29
LLAVA-V1.5-7B 38.23 37.96 45.45 34.27 30.92 41.32 21.65 44.68 34.01 27.74 23.60 43.43 28.00 42.13 29.00 35.06 33.41 22.12 23.61 29.14
Monkey-Chat 38.39 39.50 40.62 41.43 37.08 35.24 23.76 47.73 29.96 32.94 26.00 37.88 34.84 32.67 24.67 33.18 34.91 21.73 22.24 34.00
LLAVA-V1.5-7B-xtuner 38.68 38.22 38.90 40.03 28.00 40.25 30.08 44.08 33.60 32.49 21.20 40.91 29.47 40.40 30.33 38.59 31.46 23.85 26.95 36.86
XComposer2 38.68 39.20 41.89 37.59 33.69 40.79 22.26 45.87 36.44 32.94 27.20 58.59 26.11 36.40 43.67 37.29 32.06 23.46 27.80 32.86
LLAVA-InternLM-7b 38.71 39.11 36.36 36.54 32.62 38.10 30.68 46.53 34.82 28.19 25.20 48.99 28.11 40.53 33.33 36.00 34.08 26.73 24.12 29.71
TransCore-M 38.86 38.70 40.74 41.78 20.77 35.06 34.74 45.69 32.39 32.94 24.40 44.95 31.05 38.93 27.00 33.76 33.86 23.46 25.49 31.14
InternVL-Chat-V1.5 38.86 39.73 43.84 44.58 34.00 33.99 31.28 45.59 33.20 38.28 32.40 42.42 31.89 42.80 27.00 36.82 34.76 23.27 24.72 32.57
InternVL-Chat-V1.2-Plus 39.41 40.79 42.58 42.31 32.46 37.03 31.43 47.49 42.51 35.01 21.20 50.51 34.95 42.93 22.67 42.47 35.74 22.31 24.98 28.29
InternVL-Chat-V1.2 39.52 40.01 41.66 44.06 27.38 38.46 34.29 46.99 33.60 34.42 21.20 47.98 30.63 42.80 27.67 35.88 35.59 23.85 24.98 28.00
LLAVA-InternLM2-7b 40.07 40.45 39.82 37.94 30.62 35.24 29.77 48.97 34.01 25.96 20.80 53.03 30.95 42.67 32.00 39.88 32.43 21.73 24.38 38.00
DeepSeek-VL-1.3B 40.25 40.77 38.55 35.14 38.92 40.07 27.97 48.12 35.63 31.75 22.80 46.97 40.74 44.93 31.00 40.47 33.33 22.31 21.39 31.71
MiniCPM-V 40.95 41.05 39.70 46.50 36.31 39.36 22.26 48.09 34.82 35.76 24.00 45.45 34.11 44.80 23.00 44.47 36.19 21.15 23.95 35.14
DeepSeek-VL-7B 41.73 43.43 38.43 47.03 42.31 37.03 26.47 51.11 33.20 31.16 26.00 44.95 36.00 58.13 36.33 47.29 34.91 18.08 25.49 39.43
MiniCPM-V2 41.79 42.54 40.74 43.01 36.46 37.57 27.82 51.08 28.74 29.08 26.80 47.47 37.05 46.40 25.33 46.59 35.89 22.31 23.44 31.71
Proprietary LVLMs
Claude3-Opus 32.37 32.44 1.61 39.51 34.31 31.66 12.63 39.26 28.74 30.86 22.40 37.37 25.79 41.07 29.33 33.18 31.31 21.35 23.87 4.00
Qwen-VL-Max 41.34 42.16 32.68 44.58 31.38 40.79 10.68 50.53 32.79 44.36 29.20 51.52 41.37 58.00 30.67 41.65 26.95 25.00 24.64 39.14
GPT-4V 42.50 44.08 29.92 48.95 44.00 37.39 12.93 52.88 32.79 44.21 32.80 63.64 39.89 54.13 37.00 50.59 27.55 23.08 25.75 37.43
Gemini 1.0 44.38 44.93 42.12 45.10 46.46 37.57 20.45 53.29 35.22 36.94 25.20 51.01 34.74 59.60 34.00 50.00 36.64 23.65 23.87 35.43
Gemini 1.5 47.42 48.36 43.50 56.12 51.23 47.58 2.26 55.33 38.87 48.07 30.00 76.26 51.05 75.87 46.33 62.24 20.57 27.69 30.54 40.57
GPT-4o 53.53 53.96 38.32 61.01 57.08 49.02 46.62 61.45 46.56 56.38 34.00 75.25 53.79 69.47 48.67 65.88 33.93 22.88 29.51 39.43

Analysis

In our analysis of cases, we categorize the process from input to output into five major classes: Question Misunderstanding, Perceptual Error, Knowledge Deficiency, Irrelevant Responses, and Refusal to Answer. We posit that a model, upon receiving visual and textual information, undergoes two distinct stages: perception and inference. Initially, the text and visual data are input into the model. If the model comprehends the semantic content of the text, it proceeds to scrutinize the image based on the textual information. Failure to grasp the text’s meaning constitutes a Question Misunderstanding. Subsequently, the model perceives the content and details of the image. An inability to accurately perceive the image’s content is classified as a Perceptual Error. Within the domain of Perceptual Error, if the model overlooks critical details, this error is termed Perceptual Error - Detail Missing (PE - Detail Missing). Conversely, if the model fails to recognize key information despite not overlooking details, it is categorized as Perceptual Error - Misinterpretation (PE - Misinterpretation).

If the model correctly perceives the image, it advances to the inference stage. Errors occurring during inference are classified as Knowledge Deficiency. If the model derives an incorrect or outdated response based on accurate perception, this constitutes Lack of Knowledge. If the model erroneously deems the information provided by the image insufficient to formulate a correct response, despite adequate information being available, this error is termed Unable to Determine. Additionally, if the model produces responses devoid of logical coherence, rendering the response logic unanalyzable, such errors fall under Irrelevant Responses. Another scenario arises when the model refrains from answering due to adherence to information security protocols, precluding analysis of its response logic. This error is designated as Refusal to Answer.

error distribution

The illustration of the entire logical process from input to output in our case study.

Error Examples

Correct Examples

BibTeX

@misc{chen2024gmaimmbenchcomprehensivemultimodalevaluation,
        title={GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI}, 
        author={Pengcheng Chen and Jin Ye and Guoan Wang and Yanjun Li and Zhongying Deng and Wei Li and Tianbin Li and Haodong Duan and Ziyan Huang and Yanzhou Su and Benyou Wang and Shaoting Zhang and Bin Fu and Jianfei Cai and Bohan Zhuang and Eric J Seibel and Junjun He and Yu Qiao},
        year={2024},
        eprint={2408.03361},
        archivePrefix={arXiv},
        primaryClass={eess.IV},
        url={https://arxiv.org/abs/2408.03361}, 
  }