Benchmark Construction

MedQ-Deg is built upon a dual-tier hierarchy that stratifies both model capabilities and clinical degradation patterns, supported by a human-curated data pipeline.

Fig. 2: Overview of the MedQ-Deg benchmark framework.

Capability Hierarchy

A three-tier framework decomposing clinical competence:

T1 High-level: Medical Perception & Clinical Reasoning
T2 Mid-level: 6 clinical tasks (CU, IP, AR, BS, Diag, Treat)
T3 Fine-grained: 30 specific clinical skills

Degradation Hierarchy

Two-level organization grounded in clinical imaging physics:

Artifacts: 7 types
Motion Interference: 2 types
Intensity Jitter: 3 types
Noise: 2 types
Resolution & Blur: 5 types

Each degradation calibrated at 3 severity levels (L0/L1/L2) by expert radiologists.

Data Pipeline

Human-in-the-loop quality assurance:

1. 4,306 clean medical images from 3 top-tier benchmarks
2. Modality-specific degradation application at L1 & L2
3. 3 board-certified radiologists independently reviewed each pair
4. 8.3% removed by human filter -> 24,894 QA pairs

Degradation Type Statistics

19 degradation types across 5 categories, spanning 7 medical imaging modalities.

Name	Parent Category	Count	Ratio in Total	Modality
Artifacts	-	5,322	25.85%	All
limited_angle	Artifacts	1,440	6.99%	CT
sparse_view	Artifacts	1,440	6.99%	CT
bias_field_artifact	Artifacts	936	4.55%	MRI
undersampling_artifact	Artifacts	672	3.26%	MRI
ghosting_artifact	Artifacts	352	1.71%	MRI
blood_cell_artifact	Artifacts	304	1.48%	Pathology
dark_spots_artifact	Artifacts	178	0.86%	Pathology
Motion Interference	-	2,992	14.53%	All
object_rotation	Motion	2,718	13.20%	All
object_movement	Motion	274	1.33%	All
Intensity Jitter	-	1,846	8.97%	All
adjust_brightness	Intensity	916	4.45%	All
exposure	Intensity	518	2.52%	All
reduce_contrast	Intensity	412	2.00%	All
Noise	-	3,796	18.44%	All
gaussian_noise	Noise	2,680	13.02%	All
low_dose	Noise	1,116	5.42%	CT
Resolution & Blur	-	6,632	32.21%	All
low_resolution	Resolution & Blur	3,046	14.80%	All
motion_blur	Resolution & Blur	2,820	13.70%	All
gaussian_blur	Resolution & Blur	470	2.28%	All
bubble	Resolution & Blur	296	1.44%	Pathology

Degradation Examples

Visual examples showing how each degradation type affects medical images at different severity levels.

Evaluation Metrics

We utilize a combination of accuracy and uncertainty metrics for quantifying both actual performance and calibration shift.

Actual Performance

Acc(f, D) = (1/|D|) Σ 1[ŷ(x) = y*(x)]

Standard accuracy metric via multiple-choice selection.

Perceived Confidence

C(x) = 1 - H(x) / log K

Measured via voting-based prediction consistency over T=10 independent trials.

Calibration Shift

Δ_calib = (1/|D|) Σ C(x) - Acc(f, D)

Positive values indicate overconfidence (AI Dunning-Kruger Effect).

40 Models Evaluated

Three groups spanning commercial, open-source general, and medical-specialized models.

Commercial MLLMs

GPT-5GPT-5.1GPT-4oGPT-4.1GPT-4.1-miniGPT-4o-miniGemini-2.5-ProGemini-2.5-FlashClaude-Sonnet-4.5

Open-Source General

InternVL3-78BQwen3-VL-235BGLM-4.5vQwen2.5-VL-72BGemma-3(4B)Idefics3(8B)ShowO2(7B)DeepSeek-VL2BAGEL-MoTJanus-Pro(7B) +11 more

Medical-Specialized

UniMedVLHulu-Med(7B)Hulu-Med(14B)Lingshu(7B)Lingshu(32B)HealthGPT-L(14B)MedGemma(4B)HealthGPT-XL(32B)HealthGPT-M(3B)MedGemma(27B)

Explore the Results

See how 40 MLLMs perform under medical image quality degradations.

View Results View Examples