Mimir: Improving Video Diffusion
Models for Precise Text Understanding

CVPR 2025

Shuai Tan^1*, Biao Gong^1*†, Yutong Feng², Kecheng Zheng¹,
Dandan Zheng¹, Shuwei Shi¹, Yujun Shen¹, Jingdong Chen¹, Ming Yang¹

¹Ant Group, ²Tsinghua University

^*Equal Contribution.

^†Corresponding Author.

Paper Video

VIDEO FRAMES & PROMPTS

VIDEOS

Abstract

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text \(\textbf{encoders}\) yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of \(\textbf{decoder-only}\) transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with \(\textbf{Mimir}\), an end-to-end training framework featuring a carefully tailored \(\textbf{token fuser}\) to harmonize the outputs from text encoders and LLMs. Such a design allows the T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs. Extensive quantitative and qualitative results demonstrate the effectiveness of our approach in generating high-quality videos with excellent text comprehension, especially when processing short captions and managing shifting motions. The code and models will be made publicly available.

Overall Framework of Mimir

The framework of \(\textbf{Mimir}\). Given a text prompt, we employ a text encoder and a decoder-only large language model to obtain \(e_\theta\) and \(e_\beta\). Additionally, we add an instruction prompt which, after processing by the decoder-only model, yields the corresponding instruction token \(e_i\). To prevent any convergence issue in training caused by the feature distribution gap of \(e_\theta\) and \(e_\beta\), the proposed token fuser first applies a normalization layer and a learnable scale to \(e_\beta\). It then uses Zero-Conv to preserve the original semantic space in the early of training. These modified tokens are then summed to produce \(e \in \mathbb{R}^{n\times4096}\). Meanwhile, we initialize four learnable tokens \(e_l\), which are added to \(e_i\) to stabilize divergent semantic features. Finally, the token fuser concatenates \(e\) and \(e_s\) to generate videos.

Spatial Semantic Understanding

Color Rendering

A neon pink elephant walking under a glowing green moon.

A green elephant walking under a glowing pink moon.

A gray elephant walking under a glowing red moon.

Turtle in fluorescent pink and rainbow color armor.

Red lion and blue grassland.

Blue cow and Orange Pasture.

Purple tiger and yellow grassland.

Blue desert and red cactus.

Absolute & Relative Position

A mischievous raccoon wearing a tiny hat sits to the right of a floating piece of cheese.

A mischievous raccoon wearing a tiny hat sits to the left of a floating piece of cheese.

A mischievous raccoon wearing a tiny hat sits to the bottom of a floating piece of cheese.

A mischievous raccoon wearing a tiny hat sits to the top of a floating piece of cheese.

A shoe on the left side of a bowl.

A shoe on the right side of a bowl.

A shoe on the top side of a bowl.

A shoe on the bottom side of a bowl.

A friendly dragon puffing colorful smoke, with a giant donut floating to its right.

A friendly dragon puffing colorful smoke, with a giant donut floating to its left.

A friendly dragon puffing colorful smoke, with a giant donut floating to its top.

A friendly dragon puffing colorful smoke, with a giant donut floating to its bottom.

Counting

One apple becomes two apples.

Two apples become three apples.

Two dogs.

Three birds.

Two butterflies.

Seven pearls.

Temporal Semantic Understanding

Sequential Actions

A race car speeds down a track and, with a burst of energy, changes into a superhero, launching into the sky to save the day.

A bicycle leisurely rolls along a park path, and suddenly it transforms into a high-speed jet ski, splashing through a nearby lake.

A puppy looks left and then right.

A lion looks right and then left.

A cat looks up, then down.

A cat looks down, then up.

A cat looks up, then down, and up again.

Illumination Harmonization

As dawn breaks, the once-vivid stars begin to dim, their brilliance softening as the sky transitions from deep indigo to a pale, serene blue. One by one, the celestial lights vanish, retreating into the vast expanse above. The faint glow of the morning sun brushes the horizon, casting gentle hues of peach and gold. In their place, a tranquil light blue sky emerges, vast and endless, signaling the quiet start of a new day and leaving behind a faint memory of the night.

The horizon glows with a fiery brilliance as a red sun begins its ascent above the calm sea. Its vibrant hue bathes the sky in shades of crimson and amber, casting a warm, ethereal light across the water. The sea, once cloaked in darkness, transforms into a shimmering expanse, reflecting the sun's fiery glow in rippling patterns. As the sun climbs higher, its light floods the world, illuminating the waves and painting the landscape with radiant warmth, heralding the arrival of a new day in breathtaking beauty.

Among the forests, mist lingers among the green trees, and the sunlight penetrates the branches and leaves, shedding bits of golden light. In the evening, the setting sun paints the sky in a blazing orange-red color.

The fields are awakened by the golden sunlight, and a gentle breeze stirs up a green wave. In the evening, the setting sun puts on a coat of fiery red for the earth. At night, the fields are filled with starlight like water, and the Milky Way in the distance quietly guards this peaceful world.

On the vast plains, at dusk, the setting sun colors the clouds into a flaming golden red, and after nightfall, the deep starry sky shines like a jewel, making the whole world seem serene and mysterious.

With a ghostly blue glow, the whole world seems pure and mysterious. The setting sun dyed the sky red, and the ice reflected the warm orange light. At night, the stars are densely packed and the Milky Way crosses the dome of the sky, reflecting the coldness and quietness of the glacier.

More interesting examples

A weathered, vintage truck, its paint faded and rusted, sits anchored in a serene bay, half-submerged in the crystal-clear water. The truck's bed is filled with vibrant wildflowers. The sun sets in the background, casting a golden glow over the scene, while seagulls glide gracefully above.

A young woman with flawless skin and a serene expression sits at a vanity, bathed in soft morning light. She uses a foundation brush to blend a sheer layer of foundation, creating a natural, glowing base.

Aerial panoramic view of a breathtaking fantasy land from a drone. The scene features a vast forest with towering ancient trees, golden leaves shimmering under a mystical twilight sky. A crystal-clear river winds through the forest, sparkling with an ethereal glow. Snow-capped mountains rise in the distance, dotted with vibrant, otherworldly flora. A hidden valley holds a grand enchanted castle, its spires reaching the heavens, surrounded by floating islands and cascading waterfalls. The sky is painted with purple and pink hues, twinkling stars.

A charming animated scene features a quaint boat with colorful flags sailing on the serene Seine River, creating ripples that reflect the sunset's soft hues. The Eiffel Tower looms majestically in the background, while the sky glows with warm oranges and pinks.

Yoda playing guitar on the stage.

Po drinking coffee in a cafe in Paris, Van Gogh style.

A serene coastal scene at sunset with a rocky shoreline extending into the distance. The sky is a mix of warm oranges and cool blues, with scattered clouds. Palm trees and lush greenery line the left side, while the calm ocean reflects the colors of the sky. The overall atmosphere is tranquil, with no visible human presence. The quality is clear, and the style is realistic.

A vibrant underwater scene features a coral reef with large, textured, brown coral formations in the foreground. Numerous small, colorful fish swim around the corals.

Two animated dogs are perched on a cliff.

A vast, luminous spiral galaxy dominates the scene, with a bright core at its center emitting a yellowish glow, surrounded by dense blue and white star fields. The arms of the galaxy, adorned with countless stars, extend outward in a swirling pattern. The background is a deep, dark space dotted with distant stars, adding to the grandeur of the galaxy. The overall style is realistic, capturing the immense scale and beauty of the celestial structure.

A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.

Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time.

A steampunk airship floats above a bustling Victorian city, its brass gears and steam engines puffing clouds into the sky. Below, the streets are filled with horse-drawn carriages, and the skyline is dominated by towering smokestacks and ornate buildings, bathed in the golden glow of sunset.

A starlit sky filled with floating lanterns, with a curious cat perched on a moonbeam to the left.

A sudden eruption of flames engulfs the vehicle, illuminating the surroundings with an ethereal glow.

Reference

      
@inproceedings{Mimir2025,
  title={Mimir: Improving Video Diffusion Models for Precise Text Understanding},
  author={Tan, Shuai and Gong, Biao and Feng, Yutong and Zheng, Kecheng and Zheng, Dandan and Shi, Shuwei and Shen, Yujun and Chen, Jingdong and Yang, Ming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}}    
@article{tan2025SynMotion,
  title={SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation},
  author={Tan, Shuai and Gong, Biao and Wei, Yujie and Zhang, Shiwei and Liu, Zhuoxin and Zheng, Dandan and Chen, Jingdong and Wang, Yan and Ouyang, Hao and Zheng, Kecheng and Shen, Yujun},
  journal={arXiv preprint arXiv:2506.23690},
  year={2025}
}   
@article{AnimateX++2025,
  title={Animate-X++: Universal Character Image Animation with Dynamic Backgrounds},
  author={Tan, Shuai and Gong, Biao and Liu, Zhuoxin and Wang, Yan and Feng, Yifan and Zhao, Hengshuang},
  journal={arXiv preprint arXiv:2508.09545},
  year={2025}
}
@inproceedings{AnimateX2025,
  title={Animate-X: Universal Character Image Animation with Enhanced Motion Representation},
  author={Tan, Shuai and Gong, Biao and Wang, Xiang and Zhang, Shiwei and Zheng, Dandan and Zheng, Ruobin and Zheng, Kecheng and Chen, Jingdong and Yang, Ming},
  booktitle={International Conference on Learning Representations},
  year={2025}}
@article{Ranni2024,
  title={Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following}, 
  author={Yutong Feng and Biao Gong and Di Chen and Yujun Shen and Yu Liu and Jingren Zhou},
  journal={CVPR2024 Oral},
  year={2024}}