FRAME: Feature Representation and Anticipation with Memory

Abstract

The Task: Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame.

What's Missing: Existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks.

Our Solution: We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification.

Our Performance: We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Method Overview

FRAME is trained in two stages. In Stage 1, we train a student encoder to match dense patch-level and class-level features from frozen image-based teacher models (DINO and CLIP). In Stage 2, we equip the student with lightweight temporal modules—a memory unit that aggregates past context and an anticipation unit that predicts future features.

Overview of FRAME Architecture and Two-Stage Training Process.

Interactive Demo

Experiment with an in-browser FRAME-DINOv3 tracker on a few examples. The video loads and processes automatically—click anywhere on the subject you want to follow to visualize the similarity heat map.

Preparing demo…

Frame 0 / 0

Performance Comparison in Visual Correspondance tasks

FRAME substantially outperforms existing self-supervised methods on visual correspondence tasks which indicates feature consistency.

Method	Backbone	DAVIS (J&F)	VIP (mIoU)	JHMDB (PCK@0.1)
DINO	ViT-S/16	61.8	36.2	45.6
SiamMAE	ViT-S/16	62.0	37.3	47.0
FRAME (ours)	ViT-S/16	65.7	41.2	48.7
DINO	ViT-S/8	69.9	39.5	56.5
SiamMAE	ViT-S/8	71.4	45.9	61.9
FRAME (ours)	ViT-S/8	73.2	47.9	64.1

Qualitative Examples

Comparison of FRAME and DINO on feature propagation across video frames. FRAME demonstrates greater robustness to viewpoint changes, occlusions, and object reappearances , making it a more suitable video frame encoder.

Insights

Memory	Anticipation	Stages	Data	DAVIS (J&F)	VIP (mIoU)	JHMDB (PCK)
✗	✗	Stage 1	Kinetics	62.1	39.0	46.6
✓	✓	2-Stage	Kinetics	65.7	41.2	48.7
✓	✗	2-Stage	Kin.+Ego4D	65.5	41.2	48.6
✓	✓	2-Stage	Kin.+Ego4D	66.3	42.0	49.0

Both memory and anticipation components contribute significantly to performance, with the two-stage training providing the best results.

Semantic Segmentation and Anticipation

Semantic segmentation on current and future frames.

Model	CamVid Current	CamVid Future	VSPW Current	VSPW Future
DINO ViT-S/16	60.1	50.2	36.4	25.6
FRAME ViT-S/8	62.6	54.0	38.0	27.4
DINOv2 ViT-L/14	68.3	56.1	41.8	30.3
FRAME ViT-L/14	69.8	59.2	44.0	33.8

Performance vs Parameters

FRAME outperforms DINO with fewer parameters.

Bibtex

If you find this work useful, please consider citing:

@misc{tv2025framepretrainingvideofeature,
      title={FRAME: Pre-Training Video Feature Representations via Anticipation and Memory}, 
      author={Sethuraman TV and Savya Khosla and Vignesh Srinivasakumar and Jiahui Huang and Seoung Wug Oh and Simon Jenni and Derek Hoiem and Joon-Young Lee},
      year={2025},
      eprint={2506.05543},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.05543}, 
}

FRAME: Feature Representation and Anticipation with Memory

Abstract

Method Overview

Overview of FRAME Architecture and Two-Stage Training Process.

Video Examples

Interactive Demo

Downstream Tasks

Video Object Segmentation on DAVIS

Semantic Part Propagation on VIP

Pose Propagation on JHMDB