VIDEO FRAMES & PROMPTS

MY ALT TEXT

VIDEOS

Abstract

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text \(\textbf{encoders}\) yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of \(\textbf{decoder-only}\) transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with \(\textbf{Mimir}\), an end-to-end training framework featuring a carefully tailored \(\textbf{token fuser}\) to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of our approach in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. The code and models will be made publicly available.

Overall Framework of Mimir

MY ALT TEXT

The framework of \(\textbf{Mimir}\). Given a text prompt, we employ a text encoder and a decoder-only large language model to obtain \(e_\theta\) and \(e_\beta\). Additionally, we add an instruction prompt which, after processing by the decoder-only model, yields the corresponding instruction token \(e_i\). To prevent any convergence issue in training caused by the feature distribution gap of \(e_\theta\) and \(e_\beta\), the proposed token fuser first applies a normalization layer and a learnable scale to \(e_\beta\). It then uses Zero-Conv to preserve the original semantic space in the early of training. These modified tokens are then summed to produce \(e \in \mathbb{R}^{n\times4096}\). Meanwhile, we initialize four learnable tokens \(e_l\), which are added to \(e_i\) to stabilize divergent semantic features. Finally, the token fuser concatenates \(e\) and \(e_s\) to generate videos.

Spatial Semantic Understanding

Temporal Semantic Understanding

More interesting examples

Reference

      
@article{Mimir2025,
  title={Mimir: Improving Video Diffusion Models for Precise Text Understanding},
  author={Tan, Shuai and Gong, Biao and Feng, Yutong and Zheng, Kecheng and Zheng, Dandan and Shi, Shuwei and Shen, Yujun and Chen, Jingdong and Yang, Ming},
  journal={arXiv preprint arXiv:2412.03085},
  year={2025}}
        
@article{AnimateX2025,
  title={Animate-X: Universal Character Image Animation with Enhanced Motion Representation},
  author={Tan, Shuai and Gong, Biao and Wang, Xiang and Zhang, Shiwei and Zheng, Dandan and Zheng, Ruobin and Zheng, Kecheng and Chen, Jingdong and Yang, Ming},
  journal={arXiv preprint arXiv:2410.10306},
  year={2025}}

@article{Ranni2024,
  title={Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following}, 
  author={Yutong Feng and Biao Gong and Di Chen and Yujun Shen and Yu Liu and Jingren Zhou},
  journal={CVPR2024 Oral},
  year={2024}}