← Projects
Surgical AI · Ophthalmic Surgery

OphCLIP

Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Led by Shanghai AI Laboratory and Monash University, in collaboration with University of Strasbourg, TU München, University of Michigan, Hong Kong Polytechnic University, Shanghai Jiao Tong University, and Xiamen University.
GitHub arXiv Paper
ICCV 2025
OphCLIP: dataset scale comparison and zero-shot accuracy gains over CLIP
Figure 1. Top: OphVL is 15× larger than existing surgical VLP datasets, with 375K clip-text pairs from 44K videos. Bottom: OphCLIP (dark blue) consistently outperforms vanilla CLIP and CLIP* (CLIP fine-tuned on OphVL) across phase recognition benchmarks.

Vision-language pretraining (VLP) has enabled open-world generalisation beyond predefined labels — a critical capability in surgery, where the diversity of procedures, instruments, and patient anatomies makes fixed label sets impractical. However, applying VLP to ophthalmic surgery presents unique challenges: limited vision-language data, intricate procedural workflows spanning hours, and the need for hierarchical understanding from fine-grained surgical actions to global clinical reasoning.

To address these challenges, we introduce OphVL — the first large-scale, hierarchically structured VLP dataset for ophthalmic surgery, containing over 375K video-text pairs from 7.5K hours of surgical video. This makes OphVL 15× larger than existing surgical VLP datasets. It captures a diverse range of attributes: surgical phases, operations, instruments, medications, disease causes, surgical objectives, and postoperative care.

Building on OphVL, we propose OphCLIP, a hierarchical retrieval-augmented VLP framework. OphCLIP learns short-term representations by aligning video clips with detailed narrations, and long-term representations by matching full videos with structured title summaries. It further leverages a knowledge base of silent surgical videos through retrieval-based supervision. Evaluations across 11 benchmark datasets for phase recognition and multi-instrument identification demonstrate OphCLIP's robust zero-shot generalisation, establishing it as a foundation model for ophthalmic surgery.

🌟 Core Highlights

01 — OphVL: The Largest Ophthalmic Surgical VLP Dataset

OphVL contains 375K clip-text pairs extracted from 13,654 narrated videos and 30,636 silent videos (totalling 9,363 hours). In collaboration with three practising ophthalmologists, over 3,000 ophthalmic surgery terms were compiled to guide YouTube channel discovery and video collection. A rigorous curation pipeline — filtering for narrative style, ASR-based transcription with Whisper Large-V3, denoising via SurgicBERTa, and LLM-guided text rewriting — produces high-quality hierarchical video-text pairs.

OphVL captures tens of thousands of attribute combinations spanning surgeries, phases, instruments, medications, eye disease causes, surgical objectives, and postoperative care — making it the most comprehensive ophthalmic surgical dataset to date.

OphVL construction pipeline: video collection, filtering, ASR transcription, text denoising, and clip-text pairing
Figure 2. OphVL construction pipeline — from YouTube channel discovery and narrative-style filtering through ASR transcription, SurgicBERTa-based denoising, LLM text extraction, to hierarchical clip-text pair generation. The pyramid (left) shows the structured surgical concept hierarchy.

02 — Hierarchical Retrieval-Augmented Pretraining

OphCLIP performs vision-language pretraining at two hierarchical levels. At the clip level, short video segments are aligned with detailed narration texts via contrastive learning, capturing fine-grained surgical actions and instrument usage. At the video level, entire procedure videos are matched with high-level title summaries, building long-term procedural context and clinical reasoning.

The key innovation is silent-video retrieval augmentation: a dynamically updated memory bank stores multi-modal embeddings of 30K+ silent surgical videos. Using maximum inner product search (MIPS), the system retrieves the top-K most relevant silent videos for each narrative video, adding them as auxiliary supervisory signals. This facilitates knowledge transfer across narrated and silent procedure videos — mirroring how surgeons learn from both explained and unexplained surgical footage.

OphCLIP framework: clip-level and video-level pretraining with silent video knowledge base
Figure 3. OphCLIP framework. Narrative videos are processed at clip level (narration alignment) and video level (title alignment). Silent videos form a dynamically updated knowledge base; maximum inner product search retrieves top-K relevant entries to augment video-level pretraining.

03 — State-of-the-Art Zero-Shot Surgical Understanding

OphCLIP achieves strong zero-shot performance across 11 benchmark datasets covering phase recognition and multi-instrument identification. On Cataract-1K, OphCLIP reaches 62.8% accuracy / 48.5% F1 — vs. vanilla CLIP's 6.9%/2.0%. On Cat-21, it achieves 41.4% / 28.8% — nearly tripling CLIP's performance. These gains hold across both fine-grained (operation-level) and coarse-grained (phase-level) tasks.

For multi-instrument recognition, OphCLIP dramatically reduces false positive rates from 100% (CLIP) to 45–51%, while improving mAP. In few-shot linear probing with 100% training data, OphCLIP reaches 72.1% accuracy on Cat-21 — demonstrating strong transferable visual representations.

Zero-shot performance comparison table across phase recognition and instrument identification tasks
Table 1. Zero-shot comparison on phase recognition (accuracy/F1) and instrument identification (FPR/mAP). OphCLIP consistently outperforms CLIP, SLIP, and LaCLIP across all benchmarks.

04 — Clinically Meaningful Attention Patterns

Attention map visualisations reveal that OphCLIP learns clinically meaningful focus patterns. For phase recognition (e.g., phacoemulsification), OphCLIP focuses on the relevant instruments and anatomical structures, while vanilla CLIP attends to irrelevant background regions. For instrument recognition, pretraining on OphVL enables both CLIP* and OphCLIP to consistently attend to domain-specific tools like the lens injector.

This cross-modal understanding — where the model prioritises clinically relevant regions in visual data that correspond to surgical concepts in text — demonstrates that OphCLIP has genuinely internalised ophthalmic surgical knowledge rather than relying on spurious visual correlations.

Attention map visualisations: CLIP vs CLIP* vs OphCLIP for phase and instrument recognition
Figure 4. Attention map comparison on unseen Cataract-1K dataset. Left: For phase recognition ("phacoemulsification"), OphCLIP attends to instruments and anatomy. Right: For instrument recognition, OphCLIP correctly focuses on the lens injector. Vanilla CLIP shows diffuse, non-specific attention.
Conclusion

OphCLIP establishes a new paradigm for ophthalmic surgical AI by combining the largest surgical VLP dataset (OphVL, 375K pairs) with a hierarchical retrieval-augmented framework that learns from both narrated and silent surgical videos. Its robust zero-shot performance across 11 benchmarks — coupled with clinically meaningful attention patterns — positions OphCLIP as a foundation model for ophthalmic surgical workflow understanding, opening avenues for more specialised and context-aware AI in eye surgery.

Key Contributions

Authors

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He, Zongyuan Ge

ICCV 2025

GitHub Repository arXiv Paper ← Back to Projects