Clinical diagnosis demands models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities.
To this end, we propose a multi-level framework that mirrors clinical diagnosis through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6 million samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL—a medical unified multimodal model designed within the OKA paradigm for comprehensive analysis of image understanding and generation tasks within a single architecture.
UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing—generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse clinical scenarios.
Consider a radiologist examining suspected lung pathology: they systematically process chest X-rays (visual), prior CT scans (cross-modal comparison), and patient history (textual) to generate multiple complementary outputs:
This exemplifies how clinical diagnosis requires unified processing of multimodal inputs to generate diverse multimodal outputs, where neither textual reports alone (lacking spatial localization) nor visual annotations alone (lacking reasoning context) suffice.
Despite multimodal fusion demonstrating substantial improvements in clinical decision-making, current medical AI remains fragmented at three critical levels:
① Data Level: Medical datasets remain predominantly single-modal despite clear evidence that multimodal integration substantially improves diagnostic accuracy. Most datasets lack the paired multimodal structure needed for unified training.
② Feature Level: Current approaches lack systematic progressive training strategies for deep cross-modal relationships. Most methods simply concatenate features rather than progressively building from basic pattern recognition to sophisticated multimodal reasoning.
③ Task Level: While general-domain models have made progress in unified architectures, the medical domain still lacks truly unified models. For instance, HealthGPT demonstrates both understanding and generation capabilities but requires reloading different model checkpoints to switch between task types—a limitation that prevents seamless multi-task operation in clinical workflows.
📊 Performance Gap: Current medical AI systems achieve less than 60% accuracy compared to over 90% for human experts on diagnostic challenges, highlighting the urgent need for unified approaches.
UniMedVL leverages the OKA framework to achieve comprehensive medical multimodal understanding and generation within a single model checkpoint. Once loaded, it seamlessly handles:
Key Advantage: No offline checkpoint switching—single model completes all tasks ✨
5.6M+
Training Samples
9 medical imaging modalities
96.29
Average gFID
Matching specialized generation models
UniMedVL follows a clinical workflow-guided three-level framework that mirrors how physicians process medical information:
Comprehensive data processing pipeline and model architecture overview
Three-Stage Progressive Curriculum Learning:
Single model, comprehensive coverage of understanding, generation, and interleaved tasks
PathVQA
53.5%
vs. HealthGPT-L14 44.4%
OmniMedVQA
85.8%
vs. GMAI-VL 88.5%
VQA-RAD
61.9%
vs. GMAI-VL 66.3%
GMAI-MMBench
60.75%
Comprehensive medical multimodal benchmark
Chest X-ray (CXR)
73.04
gFID
Histopathology (HIS)
149.01
gFID
Fundus Photo (CFP)
53.20
gFID
CT Scan
73.04
gFID
MRI
90.36
gFID
OCT
99.27
gFID
Ultrasound
95.38
gFID
Endoscopy
133.11
gFID
Average gFID: 96.29 BioMedCLIP: 0.706
Virtual Immunohistochemistry Staining
20.27 / 0.456
PSNR / SSIM
MRI Super-Resolution (4×)
27.29 / 0.890
PSNR / SSIM
Cross-Modal Synthesis (T2↔FLAIR)
25.07 / 0.882
Average PSNR / SSIM
Counterfactual Generation (Image + Text Explanation)
gFID: 27.17 | AUROC: 0.797
BLEU-3: 0.2641 | METEOR: 0.4486 | ROUGE-L: 0.4649
Seamless Task Switching
Complete all tasks
without switching checkpoints
Bidirectional Knowledge Sharing
Generation tasks enhance understanding
Understanding tasks optimize generation
Clinical Workflow Integration
Aligning with clinical practice
Observation-Knowledge-Analysis workflow
Comprehensive performance comparison across different training stages and modalities
Comprehensive visualization of UniMedVL's multimodal capabilities
Accuracy: SLAKE 75.4% | PathVQA 53.5%
Generating detailed medical diagnostic reports
High-quality text-driven image generation across 8 medical imaging modalities
Average gFID: 96.29 | BioMedCLIP: 0.706
Qualitative comparison across diverse medical imaging modalities
Eight different medical imaging modalities supported by UniMedVL
Chest X-ray (CXR)
CT Scan
MRI
Ultrasound
OCT (Optical Coherence Tomography)
Retinal Fundus Photo
Histopathology (HIS)
Endoscopy