Abstract
Video Object Segmentation (VOS) underpins applications from robotic surgery and wildfire monitoring to autonomous driving, yet progress is often limited by the cost of dense video annotation and the difficulty of generalizing without labels. This thesis addresses both data and label efficiency through complementary contributions.
First, it introduces EVA-VOS, a human-in-the-loop annotation framework that learns both what frame to annotate and how to annotate it, selecting among clicks, scribbles, or full masks. By predicting per-frame quality and optimizing an annotation policy with reinforcement learning, the framework substantially reduces labeling effort while preserving high segmentation quality.
Second, the thesis investigates segmentation without task-specific training by repurposing frozen diffusion features from image models. It identifies effective layers and timesteps for feature extraction, leverages affinity-based matching linked to reliable point correspondences, and applies light prompt optimization at test time. The analysis shows that image-trained diffusion features can rival dedicated segmentation pretraining, enabling strong transfer without additional supervision.
Third, it proposes VideoCoPA, a motion-aware copy-paste augmentation that composites segmented objects into new videos, injecting occlusions, interactions, and richer scene dynamics. VideoCoPA consistently boosts strong baselines across common benchmarks, demonstrating scalable gains from targeted, video-specific augmentation.
Together, these contributions reduce annotation time, enable training-free transfer with diffusion features, and expand data diversity, moving video object segmentation toward practical, scalable deployment in real-world settings.
First, it introduces EVA-VOS, a human-in-the-loop annotation framework that learns both what frame to annotate and how to annotate it, selecting among clicks, scribbles, or full masks. By predicting per-frame quality and optimizing an annotation policy with reinforcement learning, the framework substantially reduces labeling effort while preserving high segmentation quality.
Second, the thesis investigates segmentation without task-specific training by repurposing frozen diffusion features from image models. It identifies effective layers and timesteps for feature extraction, leverages affinity-based matching linked to reliable point correspondences, and applies light prompt optimization at test time. The analysis shows that image-trained diffusion features can rival dedicated segmentation pretraining, enabling strong transfer without additional supervision.
Third, it proposes VideoCoPA, a motion-aware copy-paste augmentation that composites segmented objects into new videos, injecting occlusions, interactions, and richer scene dynamics. VideoCoPA consistently boosts strong baselines across common benchmarks, demonstrating scalable gains from targeted, video-specific augmentation.
Together, these contributions reduce annotation time, enable training-free transfer with diffusion features, and expand data diversity, moving video object segmentation toward practical, scalable deployment in real-world settings.
| Original language | English |
|---|
| Publisher | Technical University of Denmark |
|---|---|
| Number of pages | 106 |
| Publication status | Published - 2025 |
Fingerprint
Dive into the research topics of 'Balancing data and label efficiency in Video Object Segmentation in the era of Generative AI'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Scaling up video annotation towards deeper video understanding
Delatolas, A. (PhD Student), Papadopoulos, D. (Main Supervisor), Dahl, A. B. (Supervisor), Kalogeiton, V. (Supervisor), Picard, D. (Examiner) & Tolias, G. (Examiner)
15/09/2022 → 10/02/2026
Project: PhD
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver