UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

Junzhi Ning1*, Wei Li1,3*, Cheng Tang1,4*, Jiashi Lin1, Chenglong Ma2,5,
Chaoyang Zhang2, Jiyao Liu1,5, Ying Chen1, Shujian Gao1,5, Lihao Liu1,
Yuandong Pu1,3, Huihui Xu1,11, Chenhui Gou7, Ziyan Huang1, Yi Xin1,2,
Qi Qin1, Zhongying Deng6, Diping Song1, Bin Fu1, Guang Yang9,
Yuanfeng Ji10, Tianbin Li1, Yanzhou Su8, Jin Ye1,7, Shixiang Tang1, Ming Hu1,7,
Junjun He1,2†
1Shanghai Artificial Intelligence Laboratory, 2Shanghai Innovation Institute, 3Shanghai Jiao Tong University,
4Shanghai Institute of Optics and Fine Mechanics, 5Fudan University, 6University of Cambridge, 7Monash University,
8Fuzhou University, 9Imperial College London, 10The University of Hong Kong,
11The Hong Kong University of Science and Technology
*Equal contribution. Corresponding author.
UniMedVL Overview

Abstract

Clinical diagnosis demands models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities.

To this end, we propose a multi-level framework that mirrors clinical diagnosis through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6 million samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL—a medical unified multimodal model designed within the OKA paradigm for comprehensive analysis of image understanding and generation tasks within a single architecture.

UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing—generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse clinical scenarios.

Core Innovation

🏥 Clinical Diagnosis: A Multimodal Process

Consider a radiologist examining suspected lung pathology: they systematically process chest X-rays (visual), prior CT scans (cross-modal comparison), and patient history (textual) to generate multiple complementary outputs:

  • Detailed reports describing findings and reasoning
  • Visual annotations highlighting specific regions of concern
  • Comparative visualizations for treatment planning and surgical guidance

This exemplifies how clinical diagnosis requires unified processing of multimodal inputs to generate diverse multimodal outputs, where neither textual reports alone (lacking spatial localization) nor visual annotations alone (lacking reasoning context) suffice.

🔴 Three-Level Fragmentation in Existing Medical AI

Despite multimodal fusion demonstrating substantial improvements in clinical decision-making, current medical AI remains fragmented at three critical levels:

① Data Level: Medical datasets remain predominantly single-modal despite clear evidence that multimodal integration substantially improves diagnostic accuracy. Most datasets lack the paired multimodal structure needed for unified training.

② Feature Level: Current approaches lack systematic progressive training strategies for deep cross-modal relationships. Most methods simply concatenate features rather than progressively building from basic pattern recognition to sophisticated multimodal reasoning.

③ Task Level: While general-domain models have made progress in unified architectures, the medical domain still lacks truly unified models. For instance, HealthGPT demonstrates both understanding and generation capabilities but requires reloading different model checkpoints to switch between task types—a limitation that prevents seamless multi-task operation in clinical workflows.

📊 Performance Gap: Current medical AI systems achieve less than 60% accuracy compared to over 90% for human experts on diagnostic challenges, highlighting the urgent need for unified approaches.

✅ UniMedVL: Unified Architecture for Medical Data across 8 Modalities

UniMedVL leverages the OKA framework to achieve comprehensive medical multimodal understanding and generation within a single model checkpoint. Once loaded, it seamlessly handles:

  • 📖 Understanding Tasks: Medical VQA, image captioning, diagnostic report generation
  • 🎨 Generation Tasks: Text-to-image synthesis, cross-modal translation (CT↔MRI), virtual staining
  • 🔀 Interleaved Tasks: Counterfactual generation (simultaneous image + explanatory text), super-resolution, segmentation

Key Advantage: No offline checkpoint switching—single model completes all tasks ✨

Key Statistics

5.6M+

Training Samples

9 medical imaging modalities

96.29

Average gFID

Matching specialized generation models

OKA Framework: Observation-Knowledge-Analysis

UniMedVL follows a clinical workflow-guided three-level framework that mirrors how physicians process medical information:

  1. Observation Level (Data): Construct UniMed-5M dataset with quality control and expert validation
  2. Knowledge Level (Features): Progressive curriculum learning and cross-modal knowledge fusion
  3. Analysis Level (Tasks): Unified architecture producing multimodal outputs (reports, images, annotations)

Data Pipeline and Model Architecture

Architecture Overview

Comprehensive data processing pipeline and model architecture overview


Progressive Training Strategy

Three-Stage Progressive Curriculum Learning:

🔧 Stage 1 - Foundation Training

  • 85K training steps
  • Basic medical pattern recognition
  • Visual-language alignment
  • Data ratio: 75% I2T, 25% T2I

📚 Stage 2 - Instruction Tuning

  • 120K training steps
  • Cross-modal understanding enhancement
  • Medical expertise development
  • Data ratio: 40% I2T, 45% T2I, 10% Interleaved

🚀 Stage 3 - Unified Training

  • 70K training steps
  • Advanced multimodal synthesis
  • Interleaved task mastery
  • Data ratio: 37% I2T, 35% T2I, 25% Interleaved

Performance Highlights

Single model, comprehensive coverage of understanding, generation, and interleaved tasks

📖 Medical Image Understanding

PathVQA

53.5%

vs. HealthGPT-L14 44.4%

OmniMedVQA

85.8%

vs. GMAI-VL 88.5%

VQA-RAD

61.9%

vs. GMAI-VL 66.3%

GMAI-MMBench

60.75%

Comprehensive medical multimodal benchmark

🎨 Medical Image Generation

Chest X-ray (CXR)

73.04

gFID

Histopathology (HIS)

149.01

gFID

Fundus Photo (CFP)

53.20

gFID

CT Scan

73.04

gFID

MRI

90.36

gFID

OCT

99.27

gFID

Ultrasound

95.38

gFID

Endoscopy

133.11

gFID

Average gFID: 96.29 BioMedCLIP: 0.706

🔀 Interleaved Tasks (Understanding + Generation)

Virtual Immunohistochemistry Staining

20.27 / 0.456

PSNR / SSIM

MRI Super-Resolution (4×)

27.29 / 0.890

PSNR / SSIM

Cross-Modal Synthesis (T2↔FLAIR)

25.07 / 0.882

Average PSNR / SSIM

Counterfactual Generation (Image + Text Explanation)

gFID: 27.17 | AUROC: 0.797

BLEU-3: 0.2641 | METEOR: 0.4486 | ROUGE-L: 0.4649

🌟 Clinical Advantages of Unified Architecture

Seamless Task Switching

Complete all tasks
without switching checkpoints

Bidirectional Knowledge Sharing

Generation tasks enhance understanding
Understanding tasks optimize generation

Clinical Workflow Integration

Aligning with clinical practice
Observation-Knowledge-Analysis workflow

Experimental Results Visualization

Performance Visualization Comparison

Performance Comparison

Comprehensive performance comparison across different training stages and modalities

Multimodal Task Demonstrations

Multimodal Results

Comprehensive visualization of UniMedVL's multimodal capabilities

💬 Medical Visual Question Answering

VQA Examples

Accuracy: SLAKE 75.4% | PathVQA 53.5%

📄 Medical Report Generation

Report Generation

Generating detailed medical diagnostic reports

🎨 Text-Driven Medical Image Generation

T2I Example 1
T2I Example 2

High-quality text-driven image generation across 8 medical imaging modalities
Average gFID: 96.29 | BioMedCLIP: 0.706

🔬 VAE Reconstruction Quality

VAE Reconstruction

Qualitative comparison across diverse medical imaging modalities

🔬 Medical Imaging Modalities

Eight different medical imaging modalities supported by UniMedVL

Chest X-ray

Chest X-ray (CXR)

CT Scan

CT Scan

MRI

MRI

Ultrasound

Ultrasound

OCT

OCT (Optical Coherence Tomography)

Retinal Fundus

Retinal Fundus Photo

Histopathology

Histopathology (HIS)

Endoscopy

Endoscopy