FRAME: Feature Representation and Anticipation with Memory

Adobe Research1, University of Illinois Urbana-Champaign2

*Equal advising, Now at NVIDIA
FRAME Teaser

FRAME outperforms state-of-the-art self-supervised models (DINO, SiamMAE) on multiple dense video tasks. The student (FRAME) surpasses the teacher (DINO) by learning to predict current and future features using memory, improving temporal consistency and visual correspondence. (Right) Eg: VOS where FRAME improves segmentation of horse and rider. (left) Tasks shown: VOS = Video Object Segmentation, Part Prop. = Part Propagation, Pose Prop. = Pose Propagation, Seg = Semantic Segmentation of current & future frame.

Abstract

The Task: Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame.

What's Missing: Existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks.

Our Solution: We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification.

Our Performance: We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Method Overview

FRAME is trained in two stages. In Stage 1, we train a student encoder to match dense patch-level and class-level features from frozen image-based teacher models (DINO and CLIP). In Stage 2, we equip the student with lightweight temporal modules—a memory unit that aggregates past context and an anticipation unit that predicts future features.

FRAME Framework

Overview of FRAME Architecture and Two-Stage Training Process.

Video Examples

Downstream Tasks

Performance Comparison in Visual Correspondance tasks

FRAME substantially outperforms existing self-supervised methods on visual correspondence tasks which indicates feature consistency.

Method Backbone DAVIS (J&F) VIP (mIoU) JHMDB (PCK@0.1)
DINO ViT-S/16 61.8 36.2 45.6
SiamMAE ViT-S/16 62.0 37.3 47.0
FRAME (ours) ViT-S/16 65.7 41.2 48.7
DINO ViT-S/8 69.9 39.5 56.5
SiamMAE ViT-S/8 71.4 45.9 61.9
FRAME (ours) ViT-S/8 73.2 47.9 64.1

Qualitative Examples

FRAME vs DINO Qualitative Results

Comparison of FRAME and DINO on feature propagation across video frames. FRAME demonstrates greater robustness to viewpoint changes, occlusions, and object reappearances , making it a more suitable video frame encoder.

DINO (Left) vs FRAME (Right) - Comparison

Insights

Memory Anticipation Stages Data DAVIS (J&F) VIP (mIoU) JHMDB (PCK)
Stage 1 Kinetics 62.1 39.0 46.6
2-Stage Kinetics 65.7 41.2 48.7
2-Stage Kin.+Ego4D 65.5 41.2 48.6
2-Stage Kin.+Ego4D 66.3 42.0 49.0

Both memory and anticipation components contribute significantly to performance, with the two-stage training providing the best results.

Semantic Segmentation and Anticipation

Semantic segmentation on current and future frames

Semantic segmentation on current and future frames.

Model CamVid Current CamVid Future VSPW Current VSPW Future
DINO ViT-S/16 60.1 50.2 36.4 25.6
FRAME ViT-S/8 62.6 54.0 38.0 27.4
DINOv2 ViT-L/14 68.3 56.1 41.8 30.3
FRAME ViT-L/14 69.8 59.2 44.0 33.8

Performance vs Parameters

FRAME outperforms DINO with fewer parameters.

FRAME outperforms DINO with fewer parameters.

Bibtex

If you find this work useful, please consider citing:

@misc{tv2025framepretrainingvideofeature,
      title={FRAME: Pre-Training Video Feature Representations via Anticipation and Memory}, 
      author={Sethuraman TV and Savya Khosla and Vignesh Srinivasakumar and Jiahui Huang and Seoung Wug Oh and Simon Jenni and Derek Hoiem and Joon-Young Lee},
      year={2025},
      eprint={2506.05543},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05543}, 
}