Publications
Google Scholar →
MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations
arXiv · 2025
MedQ-Deg: A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations
Jiyao Liu, Junzhi Ning, Chenglong Ma, Wanying Qu, Jianghan Shen, Siqi Luo, Jinjie Wei, Jin Ye, Pengze Li, Tianbin Li, Jiashi Lin, Hongming Shan, Xinzhe Luo, Xiaohong Liu, Lihao Liu, Junjun He, Ningsheng Xu
We present MedQ-Deg, a comprehensive benchmark for evaluating medical multimodal large language models (MLLMs) under clinically realistic image quality degradations. MedQ-Deg spans 18 degradation types, 30 fine-grained capability dimensions, and 7 imaging modalities with 24,894 question-answer pairs. We introduce the Calibration Shift metric to quantify the gap between model confidence and actual performance, revealing the AI Dunning-Kruger Effect where models maintain inappropriately high confidence despite severe accuracy collapse. Evaluation of 40 mainstream MLLMs shows systematic performance degradation as severity increases.
UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis
arXiv · 2025
UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis
Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He
We propose UniMedVL, the first medical unified multimodal model that combines image understanding and generation within a single architecture without manually reloading model checkpoints. Built on the Observation-Knowledge-Analysis (OKA) framework, we construct UniMed-5M comprising over 5.6M multimodal medical samples and design Progressive Curriculum Learning for systematic capability building. UniMedVL achieves superior performance on 5 medical image understanding benchmarks while matching specialized models in generation quality across 8 imaging modalities, demonstrating that bidirectional knowledge sharing improves both understanding and generation.
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
arXiv · 2026
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
Zhongying Deng, et al.
A comprehensive survey and open-source platform integrating 1,000+ open-access medical imaging datasets. We propose a Metadata-Driven Fusion Paradigm (MDFP) to consolidate fragmented small datasets into large, coherent data resources, and build an interactive discovery portal for end-to-end automated dataset integration.
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
CVPR · 2025
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, Junjun He
We present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images (WSIs). We construct SlideInstruction — a dataset of 4,181 slide-report pairs and 175,753 QA pairs across 13 pathology categories, 20× larger than prior datasets. SlideChat achieves 81.17% accuracy on SlideBench-VQA (TCGA) and surpasses state-of-the-art results on 18 of 22 tasks, establishing a data–model– benchmark framework for next-generation medical vision-language models.
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
MICCAI · 2025  · Oral
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijing Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, Junjun He
We propose Ophora, a text-guided surgical video generation model trained on Ophora-160K (160K video-text pairs). A privacy-preserving fine-tuning pipeline using LVLM-based frame filtering produces Ophora-28K. Ophora outperforms existing methods on FID, FVD, and CLIPScore metrics, and synthesized videos significantly improve downstream surgical workflow understanding (e.g., MViTv2 Top-1: 37.92% → 42.24%).
RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions
MICCAI · 2025
RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions
Junzhi Ning, Cheng Tang, Kaijing Zhou, Diping Song, Lihao Liu, Ming Hu, Wei Li, Huihui Xu, Yanzhou Su, Tianbin Li, Jiyao Liu, Jin Ye, Sheng Zhang, Yuanfeng Ji, Junjun He
RetinaLogos addresses the chronic shortage of annotated retinal images by generating high-resolution fundus photographs conditioned on free-form natural language. We build RetinaLogos-1400k (14M image-text pairs) and train a three-stage progressive pipeline at 1024×1024 resolution. Expert blind evaluation shows 62.07% of synthetic images are judged as real. DR grading accuracy improves 5–10%, glaucoma detection F1 reaches 0.93.
Towards Interpretable Counterfactual Generation via Multimodal Autoregression
MICCAI · 2025
Towards Interpretable Counterfactual Generation via Multimodal Autoregression
Chenglong Ma, Yuanfeng Ji, Jin Ye, Lu Zhang, Ying Chen, Tianbin Li, Mingjie Li, Junjun He, Hongming Shan
We introduce the Interpretable Counterfactual Generation (ICG) task for medical imaging, requiring a model to jointly predict future images and textual radiological explanations given disease progression hypotheses. We build ICG-CXR (11,439 high-quality samples) and propose a multimodal autoregressive model (ProgEmu) that achieves FID 29.21 and ROUGE-L 0.2606, surpassing all existing SOTA methods.
Multi-modal MRI Translation via Evidential Regression and Distribution Calibration
MICCAI · 2025
Multi-modal MRI Translation via Evidential Regression and Distribution Calibration
Jiyao Liu, Shangqi Gao, Yuxin Li, Lihao Liu, Xin Gao, Zhaohu Xing, Junzhi Ning, Yanzhou Su, Xiao-Yong Zhang, Junjun He, Ningsheng Xu, Xiahai Zhuang
We propose a novel MRI cross-modal translation framework combining Evidential Regression and Distribution Calibration to address missing modality sequences in clinical settings. Each source modality predicts a Normal-Inverse Gamma distribution, fused via MoNIG for uncertainty-aware synthesis. Validated on BraTS2023 and low-field African MRI datasets, achieving PSNR 28.48 dB and UCE 0.089 on cross-center generalization.
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline
CVPR · 2025
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline
Junlong Cheng, Bin Fu, Jin Ye, Guoan Wang, Tianbin Li, Haoyu Wang, Ruoyu Li, He Yao, Yanzhou Su, Junjun He
We introduce IMIS-Bench, a comprehensive benchmark for interactive medical image segmentation with over 361 million masks across diverse imaging modalities. We propose a baseline method and evaluation protocol for interactive segmentation in the medical domain, establishing a standard framework for comparing promptable segmentation approaches in clinical applications.
GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning
arXiv · 2025
GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning
Yanzhou Su, Tianbin Li, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He
GMAI-VL-R1 introduces reinforcement learning to medical vision-language models, achieving approximately 30% average accuracy improvement across eight imaging modalities. The model demonstrates that RL-based reasoning can surpass models 36× larger on medical multimodal benchmarks, establishing a new paradigm for efficient medical AI.
MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
arXiv · 2025
MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs
Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu
MedQ-Bench introduces a novel perception-reasoning paradigm for medical image quality assessment covering 5 modalities and 40+ quality attributes (3,308 samples). It integrates real clinical images, physical degradation simulations, and AI-synthesized data across three evaluation tracks. Zero-shot evaluation of 14 leading MLLMs shows GPT-4 achieves 68.97% perception accuracy, still 13.5% below expert performance.
MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
arXiv · 2025
MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan
MedITok is the first unified visual tokenizer designed for medical images, addressing the conflict between structural fidelity and semantic richness in joint synthesis and understanding tasks. A two-stage training framework first pre-trains on 30M+ unlabelled images for visual alignment, then fine-tunes on 2M image-text pairs for semantic alignment. MedITok achieves SOTA across reconstruction, classification, generation, and VQA tasks spanning 9 imaging modalities and 30+ datasets.
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
ICCV · 2025
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He, Zongyuan Ge
OphCLIP is a hierarchical retrieval-augmented vision-language pre-training framework designed for understanding ophthalmic surgical workflows. Built on OphVL — a large-scale dataset of 375,000+ hierarchically structured video-text pairs covering surgical phases, instruments, and clinical outcomes — OphCLIP simultaneously learns fine-grained and long-horizon visual representations. Evaluated on 11 datasets for phase recognition and multi-instrument detection, demonstrating robust generalisation and superior performance.
MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation
IEEE JBHI · 2026
MedSegAgent: A Universal and Scalable Multi-Agent System for Instructive Medical Image Segmentation
Ziyan Huang, Haoyu Wang, Jin Ye, Yuanfeng Ji, Xiaowei Hu, Lihao Liu, Zhikai Yang, Wei Li, Ming Hu, Yanzhou Su, Tianbin Li, Yun Gu, Shaoting Zhang, Yu Qiao, Lixu Gu, Junjun He
MedSegAgent is a multi-agent system for instructive medical image segmentation. Instead of training one universal segmentation model, it orchestrates specialized dataset-specific models through natural language understanding, coarse-to-fine dataset matching, and execution-time result integration. The system integrates 23 datasets and supports 343 segmentation targets across CT, MRI, PET/CT, and ultrasound-related scenarios.
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
arXiv · 2024
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He
GMAI-VL is a general-purpose medical vision-language model trained on GMAI-VL-5.5M — a dataset aggregating hundreds of specialist medical sub-datasets unified into 5.5M high-quality image-text pairs spanning 18 clinical specialties and 10+ imaging modalities. A three-stage training strategy progressively strengthens visual-language alignment and fusion. GMAI-VL achieves or surpasses SOTA on multiple medical multimodal VQA and diagnostic reasoning benchmarks.
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
NeurIPS · 2024
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, Yu Qiao
GMAI-MMBench is the most comprehensive general medical AI evaluation platform, covering 284 datasets, 38 imaging modalities, 18 clinical tasks, 18 clinical departments, and 4-level perception granularity with a hierarchical task taxonomy. Evaluation of 50 LVLMs reveals that even top-performing models (e.g., GPT-4o) achieve only 53.96% accuracy, quantifying the challenge of medical multimodal understanding and identifying five common failure modes of current frontier models.
SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation
MICCAI · 2024
SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation
Guoan Wang, Jin Ye, Junlong Cheng, Tianbin Li, Zhaolin Chen, Jianfei Cai, Junjun He, Bohan Zhuang
SAM-Med3D-MoE extends SAM-Med3D with a Mixture-of-Experts architecture to address catastrophic forgetting during task-specific finetuning. By routing different medical imaging tasks to specialized expert modules, the model maintains general segmentation capability while achieving strong task-specific performance across multiple 3D medical imaging benchmarks.
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
CVPR · 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo
OmniMedVQA is a large-scale medical multimodal VQA benchmark integrating 73 datasets across 12 imaging modalities and 20+ anatomical regions, built from real clinical scenarios. Large-scale experiments reveal two key findings: current LVLMs generally underperform on medical VQA, and surprisingly many medical-specific models lag behind general-purpose models — exposing systematic weaknesses in cross-modal alignment, knowledge generalisation, and robustness.
SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images
IEEE TNNLS · 2025  · ECCV 2024 Workshop Oral
SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images
Haoyu Wang, Sizheng Guo, Jin Ye, Zhongying Deng, Junlong Cheng, Tianbin Li, Jianpin Chen, Yanzhou Su, Ziyan Huang, Yiqing Shen, Bin Fu, Shaoting Zhang, Junjun He, Yu Qiao
SAM-Med3D adapts the Segment Anything Model (SAM) to 3D volumetric medical images. We construct a large-scale 3D dataset with 21K medical volumes and 131K 3D masks covering 247 categories, then train SAM-Med3D to model three-dimensional context at the voxel level. SAM-Med3D significantly improves segmentation accuracy and robustness over 2D slice-based approaches, providing a universally promptable segmentation backbone for clinical and research applications.
SAM-Med2D
arXiv · 2023
SAM-Med2D
Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, Yu Qiao
SAM-Med2D adapts the Segment Anything Model (SAM) to 2D medical image segmentation. We construct SA-Med2D-20M — a large-scale benchmark — and train a reliable baseline model that better adapts to medical image structures and distribution shifts. SAM-Med2D significantly improves segmentation accuracy and stability, bridging the gap between natural image segmentation models and clinical medical image analysis.
STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training
arXiv · 2023
STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training
Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaoting Zhang, Yu Qiao
STU-Net is a family of scalable and transferable U-Net models pre-trained on large-scale annotated medical image segmentation datasets. With parameter scales ranging from 14M to 1.4B, STU-Net-H (1.4B parameters) is the largest medical image segmentation model to date. After large-scale supervised pre-training, the models achieve excellent performance on both direct inference and fine-tuning, demonstrating strong scalability and transfer capability.
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
arXiv · 2025
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, et al., Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou
A comprehensive survey of Scientific Large Language Models (Sci-LLMs) produced in collaboration with 20+ leading global institutions, covering 1000+ papers, 600+ key datasets, and SOTA models. The survey systematically reviews the development history, data foundations, model evolution, evaluation frameworks, and agent frontiers of Sci-LLMs, and proposes a roadmap for AI-assisted scientific discovery ecosystems.