Research Collection School Of Computing and Information Systems

Zero-shot video translation via token warping

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

2-2026

Abstract

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame's query, key, and value patches, aligning them with the current frame's patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations are available in supplementary materials.

Keywords

Video Translation, Diffusion Model, Attention, Zero-shot

Discipline

Broadcast and Video Studies | Graphics and Human Computer Interfaces | Software Engineering

Research Areas

Intelligent Systems and Optimization

Publication

IEEE Transactions on Visualization and Computer Graphics

Volume

Issue

First Page

1582

Last Page

1592

ISSN

1077-2626

Identifier

10.1109/TVCG.2025.3636949

Publisher

Institute of Electrical and Electronics Engineers

Citation

ZHU, Haiming; XU, Yangyang; YU, Jun; and HE, Shengfeng. Zero-shot video translation via token warping. (2026). IEEE Transactions on Visualization and Computer Graphics. 32, (2), 1582-1592.
Available at: https://ink.library.smu.edu.sg/sis_research/11050

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TVCG.2025.3636949

Download

Included in

Broadcast and Video Studies Commons, Graphics and Human Computer Interfaces Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Zero-shot video translation via token warping

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Zero-shot video translation via token warping

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links