Medical Image Segmentation

STU-Net: Scalable and Transferable Medical Image Segmentation Models

A family of scalable U-Net models ranging from 14M to 1.4B parameters, pre-trained on TotalSegmentator for universal medical image segmentation

Led by Shanghai AI Laboratory in collaboration with Shanghai Jiao Tong University.

GitHub arXiv Paper

🏆 MICCAI 2023 ATLAS Challenge — Champion
🏆 MICCAI 2023 SPPIN Challenge — Champion
🥈 MICCAI 2023 AutoPET II Challenge — Runner-up (Highest DSC)
🥈 MICCAI 2023 BraTS2023 — Runner-up (+ two 3rd-place finishes)
🥉 FLARE 2023 — 3rd Place

Figure 1. STU-Net architecture overview. (a) Encoder-decoder structure with residual blocks. (b) Residual block design. (c) Downsampling block with dual-branch shortcut. (d) Stem module for input channel conversion. (e) Segmentation head. (f) Nearest-interpolation upsampling block for transferable weights.

Large-scale pre-trained models have transformed natural language processing and computer vision — yet medical image segmentation has remained dominated by small-scale models with only tens of millions of parameters. Scaling these models to higher orders of magnitude, and establishing whether larger models actually transfer better across clinical tasks, was an open question before STU-Net.

We designed a series of Scalable and Transferable U-Net (STU-Net) models with parameter counts ranging from 14M (STU-Net-S) to 1.4B (STU-Net-H). STU-Net-H is the largest medical image segmentation model to date. All variants are built on the nnU-Net framework with key architectural refinements: residual connections for deep scalability, and weight-free interpolation-based upsampling to eliminate the weight-mismatch problem during transfer learning.

Pre-trained on TotalSegmentator — 1,204 CT volumes covering 104 anatomical structures — STU-Net demonstrates that scaling consistently improves segmentation accuracy. On the TotalSegmentator benchmark, STU-Net-H achieves 90.06% mean DSC, outperforming all CNN and Transformer competitors. Its transferability extends to 14 downstream datasets for direct inference and 3 datasets for fine-tuning, covering diverse modalities (CT, MRI, PET) and segmentation targets.

🌟 Core Highlights

Segmentation performance vs FLOPs on TotalSegmentator — Figure 2. Segmentation performance vs. computational cost (FLOPs) on TotalSegmentator. Bubble area is proportional to FLOPs. STU-Net consistently outperforms nnU-Net, nnFormer, UNETR, and SwinUNETR at every scale.

01 — Scalability: Four Model Sizes from 14M to 1.4B Parameters

STU-Net comes in four sizes — S (14.6M), B (58.3M), L (440M), and H (1.46B parameters). The scaling strategy jointly increases network depth and width, which outperforms scaling either dimension alone. STU-Net-B already surpasses nnU-Net by 0.36% and SwinUNETR-B by 4.48% in mean DSC on TotalSegmentator. STU-Net-H achieves 90.06% mean DSC — the highest ever reported on this benchmark.

The architectural refinements make scaling possible: residual connections in each block prevent gradient diffusion in very deep networks, while the fixed 6-stage, isotropic-kernel configuration ensures that pre-trained weights are reusable across tasks without shape mismatch.

02 — Transferability: Strong Zero-Shot and Fine-tuned Performance Across 17 Datasets

Pre-trained on TotalSegmentator, STU-Net can directly infer on 14 downstream CT datasets containing a subset of the 104 pre-training classes — no additional training required. Across these 14 datasets (2,494 cases total), STU-Net-H achieves 84.02% mean DSC vs. nnU-Net's 76.37%, a gain of 7.65%.

For fine-tuning on three challenging downstream datasets — FLARE22, AMOS22 (CT + MR), and AutoPET22 (CT + PET) — STU-Net-H-ft reaches 80.69% mean DSC vs. nnU-Net's 77.06%. Remarkably, fine-tuning on non-CT modalities (MRI, PET) also benefits from CT pre-training, suggesting the model captures fundamental anatomical structures that generalise beyond modality-specific features.

Qualitative segmentation results across FLARE22, AMOS-CT, AMOS-MR, AutoPET — Figure 3. Qualitative segmentation results on FLARE22 (Row 1), AMOS-CT (Row 2), AMOS-MR (Row 3), AutoPET-CT (Row 4), and AutoPET-PET (Row 5). Larger STU-Net models produce cleaner boundaries and fewer missed structures.

TotalSegmentator validation results table across 5 anatomical sub-groups — Table 1. Segmentation results on TotalSegmentator validation set across 5 anatomical sub-groups and all 104 classes. STU-Net-H achieves the best results in every category.

03 — State-of-the-Art Performance on TotalSegmentator

On the TotalSegmentator validation set — the largest publicly available CT segmentation benchmark with 104 structure annotations across organs, vertebrae, cardiac structures, muscles, and ribs — STU-Net-H achieves 90.06% mean DSC. This surpasses the previous best CNN model (nnU-Net: 86.76%) by +3.3% and the best Transformer model (SwinUNETR-B: 82.64%) by +7.4%.

The improvement is consistent across all five anatomical sub-groups, with the most notable gains in vertebrae (nnU-Net: 86.97% → STU-Net-H: 90.43%) and ribs (nnU-Net: 86.11% → STU-Net-H: 90.29%). This demonstrates that scaling genuinely improves comprehensiveness, not just overall average performance.

04 — Universal Models Surpass Specialist Models at Scale

A long-standing assumption in medical image segmentation is that specialist models — trained on a single category group — outperform universal models handling all classes simultaneously. STU-Net challenges this assumption.

We trained five specialist models (organs, vertebrae, cardiac, muscles, ribs) and compared them against a single universal STU-Net trained on all 104 classes. At the STU-Net-H scale (1.4B parameters), the universal model achieves 90.06% overall mean DSC, surpassing the best specialist ensemble (89.07%). This suggests that at sufficient scale, a single unified model can simultaneously master all segmentation targets — a key step toward a true medical segmentation foundation model.

Universal STU-Net vs five category-specific expert models — Figure 4. Universal STU-Net vs. five category-specific expert models. At STU-Net-H scale, the universal model surpasses all expert models with 90.06% overall mean DSC.

05 — Model Variants: Jointly Scaling Depth and Width

The four STU-Net variants are defined by systematic joint scaling of encoder depth and channel width: S (14.6M params, 12.8B FLOPs), B (58.3M, 60.9B), L (440M, 416B), and H (1.46B, 1,623B). Empirical ablations show that depth-only or width-only scaling yields diminishing returns compared to balanced joint scaling. Despite the 100× parameter gap between S and H, all variants share an identical 6-stage encoder-decoder topology and isotropic kernel configuration — this design constraint enables true weight transferability without shape-mismatch adapters.

STU-Net model variants: S/B/L/H parameters, FLOPs, and DSC comparison — STU-Net model variants (S / B / L / H) with parameter counts, FLOPs, and mean DSC on TotalSegmentator. Jointly scaling depth and width outperforms scaling either dimension alone.

06 — Cross-Modality Transfer: Fine-Tuning on Downstream Datasets

On three challenging fine-tuning benchmarks — FLARE22 (13 abdominal organs), AMOS22 (CT + MRI, 15 organs), and AutoPET22 (CT + PET lesion segmentation) — STU-Net-H fine-tuned from pre-trained weights consistently outperforms nnU-Net fine-tuned from random initialization. The cross-modality transfer result is particularly notable: STU-Net-H pre-trained on CT only, when fine-tuned on AMOS-MRI and AutoPET-PET, achieves higher DSC than nnU-Net trained from scratch on those modalities — suggesting the pre-trained weights encode modality-agnostic anatomical priors.

Fine-tuning results on FLARE22, AMOS22, and AutoPET22 — Fine-tuning results on FLARE22, AMOS22 (CT + MRI), and AutoPET22 (CT + PET). STU-Net-H-ft surpasses nnU-Net on all three datasets, including non-CT modalities, demonstrating cross-modality transferability.

Conclusion

STU-Net establishes that the scaling laws observed in natural language and computer vision do apply to 3D medical image segmentation. With 1.4B parameters and strong transferability across 17 datasets spanning CT, MRI, and PET modalities, STU-Net-H represents the current frontier of universal medical segmentation. It is a foundation model building block for Medical Artificial General Intelligence (MedAGI).

Key Contributions

Designed STU-Net-S/B/L/H — scaling from 14M to 1.4B parameters; STU-Net-H is the largest medical image segmentation model to date.
Demonstrated clear scaling laws: larger models trained on TotalSegmentator consistently achieve higher DSC on both the pre-training benchmark and 14 downstream transfer datasets.
Refined nnU-Net architecture with residual blocks and weight-free interpolation upsampling for true cross-task weight transferability.
Won championship at MICCAI 2023 ATLAS and SPPIN challenges; runner-up at AutoPET II; multiple top-3 finishes at BraTS2023 and FLARE2023.

Authors

Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaoting Zhang, Yu Qiao

arXiv 2023

GitHub Repository arXiv Paper ← Back to Projects