Benchmark Construction

MedQ-Deg is built upon a dual-tier hierarchy that stratifies both model capabilities and clinical degradation patterns, supported by a human-curated data pipeline.

MedQ-Deg Framework

Fig. 2: Overview of the MedQ-Deg benchmark framework.

1

Capability Hierarchy

A three-tier framework decomposing clinical competence:

  • T1 High-level: Medical Perception & Clinical Reasoning
  • T2 Mid-level: 6 clinical tasks (CU, IP, AR, BS, Diag, Treat)
  • T3 Fine-grained: 30 specific clinical skills
2

Degradation Hierarchy

Two-level organization grounded in clinical imaging physics:

  • Artifacts: 7 types
  • Motion Interference: 2 types
  • Intensity Jitter: 3 types
  • Noise: 2 types
  • Resolution & Blur: 5 types
Each degradation calibrated at 3 severity levels (L0/L1/L2) by expert radiologists.
3

Data Pipeline

Human-in-the-loop quality assurance:

  • 1. 4,306 clean medical images from 3 top-tier benchmarks
  • 2. Modality-specific degradation application at L1 & L2
  • 3. 3 board-certified radiologists independently reviewed each pair
  • 4. 8.3% removed by human filter -> 24,894 QA pairs

Degradation Type Statistics

19 degradation types across 5 categories, spanning 7 medical imaging modalities.

Name Parent Category Count Ratio in Total Modality
Artifacts - 5,322 25.85% All
limited_angleArtifacts1,4406.99%CT
sparse_viewArtifacts1,4406.99%CT
bias_field_artifactArtifacts9364.55%MRI
undersampling_artifactArtifacts6723.26%MRI
ghosting_artifactArtifacts3521.71%MRI
blood_cell_artifactArtifacts3041.48%Pathology
dark_spots_artifactArtifacts1780.86%Pathology
Motion Interference - 2,992 14.53% All
object_rotationMotion2,71813.20%All
object_movementMotion2741.33%All
Intensity Jitter - 1,846 8.97% All
adjust_brightnessIntensity9164.45%All
exposureIntensity5182.52%All
reduce_contrastIntensity4122.00%All
Noise - 3,796 18.44% All
gaussian_noiseNoise2,68013.02%All
low_doseNoise1,1165.42%CT
Resolution & Blur - 6,632 32.21% All
low_resolutionResolution & Blur3,04614.80%All
motion_blurResolution & Blur2,82013.70%All
gaussian_blurResolution & Blur4702.28%All
bubbleResolution & Blur2961.44%Pathology

Degradation Examples

Visual examples showing how each degradation type affects medical images at different severity levels.

Degradation Examples Page 1
Degradation Examples Page 2
Degradation Examples Page 3

Evaluation Metrics

We utilize a combination of accuracy and uncertainty metrics for quantifying both actual performance and calibration shift.

Actual Performance

Acc(f, D) = (1/|D|) Σ 1[ŷ(x) = y*(x)]

Standard accuracy metric via multiple-choice selection.

Perceived Confidence

C(x) = 1 - H(x) / log K

Measured via voting-based prediction consistency over T=10 independent trials.

Calibration Shift

Δcalib = (1/|D|) Σ C(x) - Acc(f, D)

Positive values indicate overconfidence (AI Dunning-Kruger Effect).

40 Models Evaluated

Three groups spanning commercial, open-source general, and medical-specialized models.

9

Commercial MLLMs

GPT-5GPT-5.1GPT-4oGPT-4.1GPT-4.1-miniGPT-4o-miniGemini-2.5-ProGemini-2.5-FlashClaude-Sonnet-4.5
21

Open-Source General

InternVL3-78BQwen3-VL-235BGLM-4.5vQwen2.5-VL-72BGemma-3(4B)Idefics3(8B)ShowO2(7B)DeepSeek-VL2BAGEL-MoTJanus-Pro(7B) +11 more
10

Medical-Specialized

UniMedVLHulu-Med(7B)Hulu-Med(14B)Lingshu(7B)Lingshu(32B)HealthGPT-L(14B)MedGemma(4B)HealthGPT-XL(32B)HealthGPT-M(3B)MedGemma(27B)

Explore the Results

See how 40 MLLMs perform under medical image quality degradations.