← Projects
Medical AI · Unified Multimodal Modeling · ICML 2026

UniMedVL

Unifying Medical Multimodal Understanding and Generation through Observation–Knowledge–Analysis

A unified medical multimodal AI project for jointly learning image understanding, medical image generation, and interleaved visual-textual reasoning within a single model. Developed by the UniMedVL authors across collaborating research and academic institutions.
GitHub arXiv Paper 🤗 Dataset
UniMedVL task overview across understanding, generation, interleaved, and traditional medical imaging tasks
UniMedVL overview. The model covers understanding tasks, image generation tasks, and interleaved medical workflows across 5.6M samples, 8 imaging modalities, 5 understanding benchmarks, and 8 medical imaging modalities.

Medical AI systems increasingly need to do more than classify an image or answer a single question. A realistic diagnostic workflow often requires a model to read medical images, integrate domain knowledge, explain findings in text, localize abnormalities, compare modalities, and sometimes generate clinically meaningful visual outputs. Existing medical multimodal systems usually divide these capabilities across separate models: one model for visual question answering, another for report generation, another for segmentation, and another for image synthesis. This fragmentation creates a mismatch with clinical workflows, where reasoning and output generation are tightly coupled.

UniMedVL (accepted at ICML 2026) addresses this gap by treating medical understanding and generation as mutually reinforcing capabilities rather than isolated tasks. The project introduces a unified medical vision-language model trained with one set of parameters, together with UniMedVL-5M, a large-scale multimodal medical corpus containing more than 5.6 million instances across 8 medical imaging modalities. Through an Observation–Knowledge–Analysis framework and a three-stage progressive curriculum, UniMedVL learns to process multimodal medical inputs and produce textual, visual, and interleaved multimodal outputs in a single inference framework.

Comparison between task-specific medical AI systems and UniMedVL shared multimodal representation
Motivation. Conventional medical AI pipelines decouple VQA, reporting, segmentation, and generation, while UniMedVL uses a shared multimodal representation for unified medical understanding and generation.

Core Highlights

01 — UniMedVL-5M: A Large-Scale Medical Multimodal Corpus

UniMedVL-5M reformulates fragmented medical resources into standardized multimodal input-output pairs. Instead of treating image-caption data, medical VQA data, image generation data, and image translation data as separate silos, the dataset organizes them into a unified training substrate for understanding, generation, and interleaved multimodal tasks. It covers eight major medical imaging modalities, including color fundus photography, chest X-ray, CT, histopathology, MRI, OCT, ultrasound, and endoscopy, allowing the model to learn broad cross-modal medical correspondences rather than overfitting to a single modality.

The curation pipeline combines coarse modality-specific filtering, text-length and image-resolution checks, medical image-text alignment scoring, and expert audit. For alignment, candidate captions are generated for each image and compared with original text through semantic embeddings and medical-specific MedSigLIP similarity. The retained high-quality subset is then further enriched with interleaved task supervision, including medical prompt segmentation, super-resolution, counterfactual generation, virtual immunohistochemistry staining, and cross-modal synthesis.

Observation-Knowledge-Analysis framework for UniMedVL data curation and progressive curriculum training
Observation and Knowledge levels. UniMedVL-5M is built from heterogeneous medical data through quality filtering, alignment scoring, and interleaved-task construction; training then proceeds through foundation training, instruction tuning, and unified multimodal training.

02 — Observation–Knowledge–Analysis: A Framework for Medical Unification

The central design of UniMedVL is the Observation–Knowledge–Analysis framework. At the Observation level, diverse medical datasets are converted into aligned multimodal samples. At the Knowledge level, the model is trained through progressive curriculum learning: foundation training establishes basic medical vision-language alignment; instruction tuning improves task following with high-quality medical instructions; and unified multimodal training couples understanding and generation through interleaved inputs and outputs. At the Analysis level, the resulting model performs both comprehension and generation with a single parameter set.

This design is important because naive multitask training can easily cause task interference. UniMedVL instead stages the learning process so that low-level cross-modal alignment is learned before more complex instruction following and interleaved reasoning. The final stage exposes the model to tasks where textual and visual outputs must be produced together, encouraging the model to learn shared representations useful for both diagnostic reasoning and visual synthesis.

03 — One Model for Understanding, Generation, and Interleaved Outputs

UniMedVL adopts a unified architecture with dual visual encoders and a Transformer backbone. A semantic vision encoder extracts tokens for medical image understanding, while a VAE-based visual pathway supports image generation. These visual tokens are integrated with text tokens inside a shared sequence modeling framework. Specialized feed-forward layers handle understanding and generation-specific representations, while shared self-attention layers allow cross-task information exchange. Text outputs are optimized with next-token prediction, and visual outputs are optimized with rectified flow matching in the VAE latent space.

The resulting system supports three families of tasks: understanding tasks such as medical VQA, image captioning, diagnostic reasoning, and report generation; generation tasks such as text-guided medical image synthesis; and interleaved tasks such as virtual staining, super-resolution, counterfactual generation, and cross-modal image synthesis where the model must jointly produce visual and textual outputs.

Qualitative UniMedVL examples for text-driven generation, virtual staining, super-resolution, counterfactual generation, and cross-modal synthesis
Qualitative capability overview. UniMedVL supports text-to-image generation, virtual staining, super-resolution, counterfactual generation, and cross-modal synthesis under one unified medical multimodal framework.

04 — Competitive Understanding and Strong Multi-Modality Generation

UniMedVL is evaluated on five medical visual understanding benchmarks: VQA-RAD, SLAKE, PathVQA, OmniMedVQA, and GMAI-MMBench. Despite being a unified understanding-and-generation model rather than an understanding-only specialist, UniMedVL reaches a 67.47 average score across these benchmarks and achieves strong results on challenging settings such as OmniMedVQA and GMAI-MMBench. Compared with prior unified medical models that rely on task-specific checkpoints, UniMedVL keeps inference unified under a single model.

On generation, UniMedVL reports an average FID of 96.29 across eight medical imaging modalities, improving over the generation-only variant and general unified baselines. The model also achieves an average BioMedCLIP Score of 0.706, indicating stronger semantic alignment between medical prompts and generated images. External held-out generation evaluation further suggests that the gains are not limited to the training distribution.

Comparison table of UniMedVL against LVLMs and unified multimodal models on medical visual understanding tasks
Medical visual understanding benchmarks. UniMedVL reaches a 67.47 average score across VQA-RAD, SLAKE, PathVQA, OmniMedVQA, and GMAI-MMBench while preserving a unified understanding-and-generation architecture.
BioMedCLIP score radar chart across eight medical imaging modalities
Multi-modality generation performance. UniMedVL achieves strong BioMedCLIP alignment across eight medical imaging modalities, showing that unified training can improve generation fidelity rather than degrade it.

05 — Evidence of Bidirectional Transfer between Understanding and Generation

A key empirical finding is that medical understanding and medical generation do not necessarily compete. Ablation studies show that adding generation training improves understanding performance, while adding understanding supervision improves generation fidelity. In the foundation stage, joint training lifts GMAI-MMBench accuracy from 0.505 to 0.593 compared with the understanding-only variant. Across generation experiments, incorporating understanding supervision reduces average FID compared with generation-only training.

The interleaved task results further support this conclusion. UniMedVL achieves 20.27 PSNR for H&E-to-IHC virtual staining, 27.29 PSNR / 0.890 SSIM for MRI super-resolution, and 25.07 PSNR / 0.882 SSIM on average for bidirectional T2-FLAIR MRI translation. These results suggest that unified multimodal training can preserve task-specific utility while enabling broader medical workflow coverage.

Bidirectional transfer and progressive curriculum trajectory in UniMedVL
Bidirectional transfer. Ablations show that generation supervision can improve understanding, while understanding supervision can improve generation quality; progressive curriculum learning further strengthens this cross-task synergy.

Additional Results and Ablations

Ablation tables for understanding-generation synergy and progressive training stages
Training-stage ablations. Joint training outperforms single-task variants, and the progressive stages bring cumulative gains across understanding and generation metrics.
Ablation table on data augmentation for medical understanding tasks
Understanding-task augmentation. Interleaved supervision improves medical visual understanding scores across the reported benchmarks.
Ablation table on data augmentation for generation quality
Generation augmentation. Caption-augmented and interleaved data improve generation quality, reducing gFID while increasing BioMedCLIP score.
External generation, modality-specialized generator, virtual staining, and super-resolution result tables
Interleaved and generation results. UniMedVL is evaluated on held-out generation datasets, modality-specialized generation, virtual staining, and MRI super-resolution.
Medical image translation and counterfactual generation result tables
Medical image translation and counterfactual generation results. UniMedVL reports competitive bidirectional MRI translation and counterfactual generation performance.
Conclusion

UniMedVL is a step toward unified medical multimodal modeling: a single model that can understand medical images, generate medical images, and handle interleaved visual-textual workflows. Its main message is not merely that one model can cover many tasks, but that carefully aligned data, progressive curriculum design, and joint objectives can make understanding and generation reinforce each other. The current system remains a research model rather than a deployable clinical solution: it focuses on 2D medical imaging, relies on automatic evaluation metrics, and still requires further clinical validation before real-world use. Nevertheless, UniMedVL provides a reusable dataset, training recipe, and model design for future work on general-purpose medical multimodal AI.

Key Contributions

Authors

Junzhi Ning*, Wei Li*, Cheng Tang*, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su†, Jin Ye, Shixiang Tang, Zhongying Deng, Lihao Liu, Ming Hu, Junjun He
* Equal contribution  ·  † Corresponding authors

GitHub Repository arXiv Paper 🤗 Dataset ← Back to Projects