Benchmark Construction
MedQ-Deg is built upon a dual-tier hierarchy that stratifies both model capabilities and clinical degradation patterns, supported by a human-curated data pipeline.
Fig. 2: Overview of the MedQ-Deg benchmark framework.
Capability Hierarchy
A three-tier framework decomposing clinical competence:
- T1 High-level: Medical Perception & Clinical Reasoning
- T2 Mid-level: 6 clinical tasks (CU, IP, AR, BS, Diag, Treat)
- T3 Fine-grained: 30 specific clinical skills
Degradation Hierarchy
Two-level organization grounded in clinical imaging physics:
- Artifacts: 7 types
- Motion Interference: 2 types
- Intensity Jitter: 3 types
- Noise: 2 types
- Resolution & Blur: 5 types
Data Pipeline
Human-in-the-loop quality assurance:
- 1. 4,306 clean medical images from 3 top-tier benchmarks
- 2. Modality-specific degradation application at L1 & L2
- 3. 3 board-certified radiologists independently reviewed each pair
- 4. 8.3% removed by human filter -> 24,894 QA pairs
Degradation Type Statistics
19 degradation types across 5 categories, spanning 7 medical imaging modalities.
| Name | Parent Category | Count | Ratio in Total | Modality |
|---|---|---|---|---|
| Artifacts | - | 5,322 | 25.85% | All |
| limited_angle | Artifacts | 1,440 | 6.99% | CT |
| sparse_view | Artifacts | 1,440 | 6.99% | CT |
| bias_field_artifact | Artifacts | 936 | 4.55% | MRI |
| undersampling_artifact | Artifacts | 672 | 3.26% | MRI |
| ghosting_artifact | Artifacts | 352 | 1.71% | MRI |
| blood_cell_artifact | Artifacts | 304 | 1.48% | Pathology |
| dark_spots_artifact | Artifacts | 178 | 0.86% | Pathology |
| Motion Interference | - | 2,992 | 14.53% | All |
| object_rotation | Motion | 2,718 | 13.20% | All |
| object_movement | Motion | 274 | 1.33% | All |
| Intensity Jitter | - | 1,846 | 8.97% | All |
| adjust_brightness | Intensity | 916 | 4.45% | All |
| exposure | Intensity | 518 | 2.52% | All |
| reduce_contrast | Intensity | 412 | 2.00% | All |
| Noise | - | 3,796 | 18.44% | All |
| gaussian_noise | Noise | 2,680 | 13.02% | All |
| low_dose | Noise | 1,116 | 5.42% | CT |
| Resolution & Blur | - | 6,632 | 32.21% | All |
| low_resolution | Resolution & Blur | 3,046 | 14.80% | All |
| motion_blur | Resolution & Blur | 2,820 | 13.70% | All |
| gaussian_blur | Resolution & Blur | 470 | 2.28% | All |
| bubble | Resolution & Blur | 296 | 1.44% | Pathology |
Degradation Examples
Visual examples showing how each degradation type affects medical images at different severity levels.
Evaluation Metrics
We utilize a combination of accuracy and uncertainty metrics for quantifying both actual performance and calibration shift.
Actual Performance
Standard accuracy metric via multiple-choice selection.
Perceived Confidence
Measured via voting-based prediction consistency over T=10 independent trials.
Calibration Shift
Positive values indicate overconfidence (AI Dunning-Kruger Effect).
40 Models Evaluated
Three groups spanning commercial, open-source general, and medical-specialized models.
Commercial MLLMs
Open-Source General
Medical-Specialized
Explore the Results
See how 40 MLLMs perform under medical image quality degradations.