MedQ-Deg
A Multidimensional Benchmark for Evaluating MLLMs Across Medical Image Quality Degradations
Benchmark Framework
Two orthogonal hierarchies structure the evaluation: a capability hierarchy decomposing clinical competence into 30 fine-grained skills, and a degradation hierarchy covering 19 degradation types across 7 modalities.
Fig. 2: Overview of the MedQ-Deg benchmark framework. Left: Medical MLLM Capability Hierarchy. Middle: Benchmark Construction pipeline. Right: Medical Image Degradation Hierarchy.
Key Contributions
Comprehensive Dataset
24,894 QA pairs across 18 degradation types, 7 medical imaging modalities, and 30 fine-grained clinical skills
Extensive Evaluation
40 MLLMs evaluated with Calibration Shift metric to assess model reliability and confidence calibration
AI Dunning-Kruger Effect
Discovery of the AI Dunning-Kruger Effect: models with lower performance often exhibit higher confidence
Degradation Categories
19 degradation types across 5 major categories, each calibrated at 3 severity degrees by expert radiologists.
Ready to Explore?
Dive into our comprehensive medical image degradation benchmark.