Publications

arXiv · 2025

MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations

Jiyao Liu, Junzhi Ning, Chenglong Ma, Wanying Qu, Jianghan Shen, Siqi Luo, Jinjie Wei, Jin Ye, Pengze Li, Tianbin Li, Jiashi Lin, Hongming Shan, Xinzhe Luo, Xiaohong Liu, Lihao Liu, Junjun He, Ningsheng Xu

We present MedQ-Deg, a comprehensive benchmark for evaluating medical multimodal large language models (MLLMs) under clinically realistic image quality degradations. MedQ-Deg spans 18 degradation types, 30 fine-grained capability dimensions, and 7 imaging modalities with 24,894 question-answer pairs. We introduce the Calibration Shift metric to quantify the gap between model confidence and actual performance, revealing the AI Dunning-Kruger Effect where models maintain inappropriately high confidence despite severe accuracy collapse. Evaluation of 40 mainstream MLLMs shows systematic performance degradation as severity increases.

Paper

arXiv · 2025

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He

We propose UniMedVL, the first medical unified multimodal model that combines image understanding and generation within a single architecture without manually reloading model checkpoints. Built on the Observation-Knowledge-Analysis (OKA) framework, we construct UniMed-5M comprising over 5.6M multimodal medical samples and design Progressive Curriculum Learning for systematic capability building. UniMedVL achieves superior performance on 5 medical image understanding benchmarks while matching specialized models in generation quality across 8 imaging modalities, demonstrating that bidirectional knowledge sharing improves both understanding and generation.

arXiv · 2026

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Zhongying Deng, et al.

A comprehensive survey and open-source platform integrating 1,000+ open-access medical imaging datasets. We propose a Metadata-Driven Fusion Paradigm (MDFP) to consolidate fragmented small datasets into large, coherent data resources, and build an interactive discovery portal for end-to-end automated dataset integration.

Paper Code Dataset HuggingFace

CVPR · 2025

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, Junjun He

We present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images (WSIs). We construct SlideInstruction — a dataset of 4,181 slide-report pairs and 175,753 QA pairs across 13 pathology categories, 20× larger than prior datasets. SlideChat achieves 81.17% accuracy on SlideBench-VQA (TCGA) and surpasses state-of-the-art results on 18 of 22 tasks, establishing a data–model– benchmark framework for next-generation medical vision-language models.

Paper Code Dataset

MICCAI · 2025 · Oral

Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model

Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijing Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, Junjun He

We propose Ophora, a text-guided surgical video generation model trained on Ophora-160K (160K video-text pairs). A privacy-preserving fine-tuning pipeline using LVLM-based frame filtering produces Ophora-28K. Ophora outperforms existing methods on FID, FVD, and CLIPScore metrics, and synthesized videos significantly improve downstream surgical workflow understanding (e.g., MViTv2 Top-1: 37.92% → 42.24%).

Paper Code HuggingFace

MICCAI · 2025

RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

Junzhi Ning, Cheng Tang, Kaijing Zhou, Diping Song, Lihao Liu, Ming Hu, Wei Li, Huihui Xu, Yanzhou Su, Tianbin Li, Jiyao Liu, Jin Ye, Sheng Zhang, Yuanfeng Ji, Junjun He

RetinaLogos addresses the chronic shortage of annotated retinal images by generating high-resolution fundus photographs conditioned on free-form natural language. We build RetinaLogos-1400k (14M image-text pairs) and train a three-stage progressive pipeline at 1024×1024 resolution. Expert blind evaluation shows 62.07% of synthetic images are judged as real. DR grading accuracy improves 5–10%, glaucoma detection F1 reaches 0.93.

Paper Code

MICCAI · 2025

Towards Interpretable Counterfactual Generation via Multimodal Autoregression

Chenglong Ma, Yuanfeng Ji, Jin Ye, Lu Zhang, Ying Chen, Tianbin Li, Mingjie Li, Junjun He, Hongming Shan

We introduce the Interpretable Counterfactual Generation (ICG) task for medical imaging, requiring a model to jointly predict future images and textual radiological explanations given disease progression hypotheses. We build ICG-CXR (11,439 high-quality samples) and propose a multimodal autoregressive model (ProgEmu) that achieves FID 29.21 and ROUGE-L 0.2606, surpassing all existing SOTA methods.

Paper Code

MICCAI · 2025

Multi-modal MRI Translation via Evidential Regression and Distribution Calibration

Jiyao Liu, Shangqi Gao, Yuxin Li, Lihao Liu, Xin Gao, Zhaohu Xing, Junzhi Ning, Yanzhou Su, Xiao-Yong Zhang, Junjun He, Ningsheng Xu, Xiahai Zhuang

We propose a novel MRI cross-modal translation framework combining Evidential Regression and Distribution Calibration to address missing modality sequences in clinical settings. Each source modality predicts a Normal-Inverse Gamma distribution, fused via MoNIG for uncertainty-aware synthesis. Validated on BraTS2023 and low-field African MRI datasets, achieving PSNR 28.48 dB and UCE 0.089 on cross-center generalization.

Paper

CVPR · 2025

Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

Junlong Cheng, Bin Fu, Jin Ye, Guoan Wang, Tianbin Li, Haoyu Wang, Ruoyu Li, He Yao, Yanzhou Su, Junjun He

We introduce IMIS-Bench, a comprehensive benchmark for interactive medical image segmentation with over 361 million masks across diverse imaging modalities. We propose a baseline method and evaluation protocol for interactive segmentation in the medical domain, establishing a standard framework for comparing promptable segmentation approaches in clinical applications.

Paper

arXiv · 2025

GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

Yanzhou Su, Tianbin Li, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He

GMAI-VL-R1 introduces reinforcement learning to medical vision-language models, achieving approximately 30% average accuracy improvement across eight imaging modalities. The model demonstrates that RL-based reasoning can surpass models 36× larger on medical multimodal benchmarks, establishing a new paradigm for efficient medical AI.

Paper Code

arXiv · 2025

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu

MedQ-Bench introduces a novel perception-reasoning paradigm for medical image quality assessment covering 5 modalities and 40+ quality attributes (3,308 samples). It integrates real clinical images, physical degradation simulations, and AI-synthesized data across three evaluation tracks. Zero-shot evaluation of 14 leading MLLMs shows GPT-4 achieves 68.97% perception accuracy, still 13.5% below expert performance.

Paper Code Dataset

arXiv · 2025

MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation

Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan

MedITok is the first unified visual tokenizer designed for medical images, addressing the conflict between structural fidelity and semantic richness in joint synthesis and understanding tasks. A two-stage training framework first pre-trains on 30M+ unlabelled images for visual alignment, then fine-tunes on 2M image-text pairs for semantic alignment. MedITok achieves SOTA across reconstruction, classification, generation, and VQA tasks spanning 9 imaging modalities and 30+ datasets.

Paper Code HuggingFace

ICCV · 2025

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He, Zongyuan Ge

OphCLIP is a hierarchical retrieval-augmented vision-language pre-training framework designed for understanding ophthalmic surgical workflows. Built on OphVL — a large-scale dataset of 375,000+ hierarchically structured video-text pairs covering surgical phases, instruments, and clinical outcomes — OphCLIP simultaneously learns fine-grained and long-horizon visual representations. Evaluated on 11 datasets for phase recognition and multi-instrument detection, demonstrating robust generalisation and superior performance.

Paper Code HuggingFace

IEEE JBHI · 2026

MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation

Ziyan Huang, Haoyu Wang, Jin Ye, Yuanfeng Ji, Xiaowei Hu, Lihao Liu, Zhikai Yang, Wei Li, Ming Hu, Yanzhou Su, Tianbin Li, Yun Gu, Shaoting Zhang, Yu Qiao, Lixu Gu, Junjun He

MedSegAgent is a multi-agent system for instructive medical image segmentation. Instead of training one universal segmentation model, it orchestrates specialized dataset-specific models through natural language understanding, coarse-to-fine dataset matching, and execution-time result integration. The system integrates 23 datasets and supports 343 segmentation targets across CT, MRI, PET/CT, and ultrasound-related scenarios.

Paper Code

arXiv · 2024

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He

GMAI-VL is a general-purpose medical vision-language model trained on GMAI-VL-5.5M — a dataset aggregating hundreds of specialist medical sub-datasets unified into 5.5M high-quality image-text pairs spanning 18 clinical specialties and 10+ imaging modalities. A three-stage training strategy progressively strengthens visual-language alignment and fusion. GMAI-VL achieves or surpasses SOTA on multiple medical multimodal VQA and diagnostic reasoning benchmarks.

Paper Code HuggingFace

NeurIPS · 2024

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao

GMAI-MMBench is the most comprehensive general medical AI evaluation platform, covering 284 datasets, 38 imaging modalities, 18 clinical tasks, 18 clinical departments, and 4-level perception granularity with a hierarchical task taxonomy. Evaluation of 50 LVLMs reveals that even top-performing models (e.g., GPT-4o) achieve only 53.96% accuracy, quantifying the challenge of medical multimodal understanding and identifying five common failure modes of current frontier models.

Paper Code Dataset

MICCAI · 2024

SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation

Guoan Wang, Jin Ye, Junlong Cheng, Tianbin Li, Zhaolin Chen, Jianfei Cai, Junjun He, Bohan Zhuang

SAM-Med3D-MoE extends SAM-Med3D with a Mixture-of-Experts architecture to address catastrophic forgetting during task-specific finetuning. By routing different medical imaging tasks to specialized expert modules, the model maintains general segmentation capability while achieving strong task-specific performance across multiple 3D medical imaging benchmarks.

Paper

CVPR · 2024

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

OmniMedVQA is a large-scale medical multimodal VQA benchmark integrating 73 datasets across 12 imaging modalities and 20+ anatomical regions, built from real clinical scenarios. Large-scale experiments reveal two key findings: current LVLMs generally underperform on medical VQA, and surprisingly many medical-specific models lag behind general-purpose models — exposing systematic weaknesses in cross-modal alignment, knowledge generalisation, and robustness.

Paper Code

IEEE TNNLS · 2025 · ECCV 2024 Workshop Oral

SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images

Haoyu Wang, Sizheng Guo, Jin Ye, Zhongying Deng, Junlong Cheng, Tianbin Li, Jianpin Chen, Yanzhou Su, Ziyan Huang, Yiqing Shen, Bin Fu, Shaoting Zhang, Junjun He, Yu Qiao

SAM-Med3D adapts the Segment Anything Model (SAM) to 3D volumetric medical images. We construct a large-scale 3D dataset with 21K medical volumes and 131K 3D masks covering 247 categories, then train SAM-Med3D to model three-dimensional context at the voxel level. SAM-Med3D significantly improves segmentation accuracy and robustness over 2D slice-based approaches, providing a universally promptable segmentation backbone for clinical and research applications.

Paper Code HuggingFace

arXiv · 2023

SAM-Med2D

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, Yu Qiao

SAM-Med2D adapts the Segment Anything Model (SAM) to 2D medical image segmentation. We construct SA-Med2D-20M — a large-scale benchmark — and train a reliable baseline model that better adapts to medical image structures and distribution shifts. SAM-Med2D significantly improves segmentation accuracy and stability, bridging the gap between natural image segmentation models and clinical medical image analysis.

Paper Code Dataset

arXiv · 2023

STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training

Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaoting Zhang, Yu Qiao

STU-Net is a family of scalable and transferable U-Net models pre-trained on large-scale annotated medical image segmentation datasets. With parameter scales ranging from 14M to 1.4B, STU-Net-H (1.4B parameters) is the largest medical image segmentation model to date. After large-scale supervised pre-training, the models achieve excellent performance on both direct inference and fine-tuning, demonstrating strong scalability and transfer capability.

Paper Code HuggingFace

arXiv · 2025

A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers

Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, et al., Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou

A comprehensive survey of Scientific Large Language Models (Sci-LLMs) produced in collaboration with 20+ leading global institutions, covering 1000+ papers, 600+ key datasets, and SOTA models. The survey systematically reviews the development history, data foundations, model evolution, evaluation frameworks, and agent frontiers of Sci-LLMs, and proposes a roadmap for AI-assisted scientific discovery ecosystems.

Paper Code