The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.
FRESCO-v2 is further instantiated on two kinds of image manipulation techniques: 1) Adapting image-conditioned generation technique of ControlNet and SDEdit for video-to-video translation; 2) Adapting text-guided image editing technique of Plug-and-Play for text-guided video editing. Moreover, we instantiate the proposed FRESCO adaptation on two hybrid frameworks based on video interpolation techniques of EbSynth and TokenFlow, to facilitate long video manipulation.
Base Methods: C for ControlNet, S for SDEdit, P for Plug-and-Play, E for Ebsynth, T for TokenFlow
We build an effective three-level hybrid framework for very long video manipulation by jointly processing batched frames with inter-batch consistency.
We compare with three zero-shot video translation methods: Text2Video-Zero, ControlVideo, and Rerender-A-Video. To ensure a fair comparison, all methods employ identical settings of ControlNet, SDEdit, and LoRA. The compared methods that rely on ControlNet conditions without inversion features can suffer a reduction in video editing quality, particularly if those conditions are degraded by factors such as defocus or motion blur. In contrast, our method utilizes the robust guidance of FRESCO to produce consistently reliable videos.
We compare with three zero-shot video editing methods: FateZero, Pix2Video, and FLATTEN. Compared to inversion-free methods, inversion-based methods are good at preserving input structures but may suffer artifacts when merging the inversion features and edited features. Our method achieves good temporal consistency and overall editing quality with fewer artifacts.
@article{yang2025zero,
author = {Yang, Shuai and Lin, Junxin and Zhou, Yifan and Liu, Ziwei and and Loy, Chen Change},
title = {Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence},
journal = {arXiv},
year = {2025},
}