Text serves as the key control signal in video generation due to its narrative nature. To render text
descriptions into video clips, current video diffusion models borrow features from text
\(\textbf{encoders}\) yet struggle with limited text comprehension. The recent success of large language
models (LLMs) showcases the power of \(\textbf{decoder-only}\) transformers, which offers three clear
benefits
for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior
scalability, imagination beyond the input text enabled by next token prediction, and flexibility to
prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging
from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models.
This work addresses this challenge with \(\textbf{Mimir}\), an end-to-end training framework featuring a
carefully
tailored \(\textbf{token fuser}\) to harmonize the outputs from text encoders and LLMs. Such a design allows
the
T2V model to fully leverage learned video priors while capitalizing on the text-related capability of LLMs.
Extensive quantitative and qualitative results demonstrate the effectiveness of our approach in generating
high-quality videos with excellent text comprehension, especially when processing short captions and
managing shifting motions. The code and models will be made publicly available.
Overall Framework of Mimir
The framework of \(\textbf{Mimir}\). Given a text prompt, we employ a text encoder and a decoder-only large
language
model to obtain \(e_\theta\) and \(e_\beta\). Additionally, we add an instruction prompt which, after
processing
by the decoder-only model, yields the corresponding instruction token \(e_i\). To prevent any convergence
issue in training caused by the feature distribution gap
of \(e_\theta\) and \(e_\beta\), the proposed token fuser first applies a normalization layer and a
learnable
scale to \(e_\beta\). It then uses Zero-Conv to preserve the original semantic space in the early of
training.
These modified tokens are then summed to produce \(e \in \mathbb{R}^{n\times4096}\). Meanwhile, we
initialize
four learnable tokens \(e_l\), which are added to \(e_i\) to stabilize divergent semantic features. Finally,
the
token fuser concatenates \(e\) and \(e_s\) to generate videos.
Spatial Semantic Understanding
Color Rendering
A neon pink elephant walking under a glowing green moon.
A green elephant walking under a glowing pink moon.
A gray elephant walking under a glowing red moon.
Turtle in fluorescent pink and rainbow color armor.
Red lion and blue grassland.
Blue cow and Orange Pasture.
Purple tiger and yellow grassland.
Blue desert and red cactus.
Absolute & Relative Position
A mischievous raccoon wearing a tiny hat sits to the right of a floating piece of
cheese.
A mischievous raccoon wearing a tiny hat sits to the left of a floating piece of
cheese.
A mischievous raccoon wearing a tiny hat sits to the bottom of a floating piece of
cheese.
A mischievous raccoon wearing a tiny hat sits to the top of a floating piece of
cheese.
A shoe on the left side of a bowl.
A shoe on the right side of a bowl.
A shoe on the top side of a bowl.
A shoe on the bottom side of a bowl.
A friendly dragon puffing colorful smoke, with a giant donut floating to its right.
A friendly dragon puffing colorful smoke, with a giant donut floating to its left.
A friendly dragon puffing colorful smoke, with a giant donut floating to its top.
A friendly dragon puffing colorful smoke, with a giant donut floating to its bottom.
Counting
One apple becomes two apples.
Two apples become three apples.
Two dogs.
Three birds.
Two butterflies.
Seven pearls.
Temporal Semantic Understanding
Sequential Actions
A race car speeds down a track and, with a burst of energy, changes into a superhero,
launching into the sky to save the day.
A bicycle leisurely rolls along a park path, and suddenly it transforms into a
high-speed jet ski, splashing through a nearby lake.
A puppy looks left and then right.
A lion looks right and then left.
A cat looks up, then down.
A cat looks down, then up.
A cat looks up, then down, and up again.
Illumination Harmonization
As dawn breaks, the once-vivid stars begin to dim, their brilliance softening as the
sky transitions from deep indigo to a pale, serene blue. One by one, the celestial lights vanish,
retreating into the vast expanse above. The faint glow of the morning sun brushes the horizon, casting
gentle hues of peach and gold. In their place, a tranquil light blue sky emerges, vast and endless,
signaling the quiet start of a new day and leaving behind a faint memory of the night.
The horizon glows with a fiery brilliance as a red sun begins its ascent above the
calm sea. Its vibrant hue bathes the sky in shades of crimson and amber, casting a warm, ethereal light
across the water. The sea, once cloaked in darkness, transforms into a shimmering expanse, reflecting the
sun's fiery glow in rippling patterns. As the sun climbs higher, its light floods the world, illuminating
the waves and painting the landscape with radiant warmth, heralding the arrival of a new day in
breathtaking beauty.
Among the forests, mist lingers among the green trees, and the sunlight penetrates
the branches and leaves, shedding bits of golden light. In the evening, the setting sun paints the sky in
a blazing orange-red color.
The fields are awakened by the golden sunlight, and a gentle breeze stirs up a green
wave. In the evening, the setting sun puts on a coat of fiery red for the earth. At night, the fields are
filled with starlight like water, and the Milky Way in the distance quietly guards this peaceful world.
On the vast plains, at dusk, the setting sun colors the clouds into a flaming golden
red, and after nightfall, the deep starry sky shines like a jewel, making the whole world seem serene and
mysterious.
With a ghostly blue glow, the whole world seems pure and mysterious. The setting sun
dyed the sky red, and the ice reflected the warm orange light. At night, the stars are densely packed and
the Milky Way crosses the dome of the sky, reflecting the coldness and quietness of the glacier.
More interesting examples
A weathered, vintage truck, its paint faded and rusted, sits anchored in a serene
bay, half-submerged in the crystal-clear water. The truck's bed is filled with vibrant wildflowers. The
sun sets in the background, casting a golden glow over the scene, while seagulls glide gracefully above.
A young woman with flawless skin and a serene expression sits at a vanity, bathed in
soft morning light. She uses a foundation brush to blend a sheer layer of foundation, creating a natural,
glowing base.
Aerial panoramic view of a breathtaking fantasy land from a drone. The scene features
a vast forest with towering ancient trees, golden leaves shimmering under a mystical twilight sky. A
crystal-clear river winds through the forest, sparkling with an ethereal glow. Snow-capped mountains rise
in the distance, dotted with vibrant, otherworldly flora. A hidden valley holds a grand enchanted castle,
its spires reaching the heavens, surrounded by floating islands and cascading waterfalls. The sky is
painted with purple and pink hues, twinkling stars.
A charming animated scene features a quaint boat with colorful flags sailing on the
serene Seine River, creating ripples that reflect the sunset's soft hues. The Eiffel Tower looms
majestically in the background, while the sky glows with warm oranges and pinks.
Yoda playing guitar on the stage.
Po drinking coffee in a cafe in Paris, Van Gogh style.
A serene coastal scene at sunset with a rocky shoreline extending into the distance.
The sky is a mix of warm oranges and cool blues, with scattered clouds. Palm trees and lush greenery line
the left side, while the calm ocean reflects the colors of the sky. The overall atmosphere is tranquil,
with no visible human presence. The quality is clear, and the style is realistic.
A vibrant underwater scene features a coral reef with large, textured, brown coral
formations in the foreground. Numerous small, colorful fish swim around the corals.
Two animated dogs are perched on a cliff.
A vast, luminous spiral galaxy dominates the scene, with a bright core at its center
emitting a yellowish glow, surrounded by dense blue and white star fields. The arms of the galaxy, adorned
with countless stars, extend outward in a swirling pattern. The background is a deep, dark space dotted
with distant stars, adding to the grandeur of the galaxy. The overall style is realistic, capturing the
immense scale and beauty of the celestial structure.
A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.
Animated scene features a close-up of a short fluffy monster kneeling beside a
melting red candle. The mood of the painting is one of wonder and curiosity, as the monster gazes at the
flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness,
as if it is exploring the world around it for the first time.
A steampunk airship floats above a bustling Victorian city, its brass gears and steam
engines puffing clouds into the sky. Below, the streets are filled with horse-drawn carriages, and the
skyline is dominated by towering smokestacks and ornate buildings, bathed in the golden glow of sunset.
A starlit sky filled with floating lanterns, with a curious cat perched on a moonbeam
to the left.
A sudden eruption of flames engulfs the vehicle, illuminating the surroundings with
an ethereal glow.
Reference
@article{Mimir2025,
title={Mimir: Improving Video Diffusion Models for Precise Text Understanding},
author={Tan, Shuai and Gong, Biao and Feng, Yutong and Zheng, Kecheng and Zheng, Dandan and Shi, Shuwei and Shen, Yujun and Chen, Jingdong and Yang, Ming},
journal={arXiv preprint arXiv:2412.03085},
year={2025}}
@article{AnimateX2025,
title={Animate-X: Universal Character Image Animation with Enhanced Motion Representation},
author={Tan, Shuai and Gong, Biao and Wang, Xiang and Zhang, Shiwei and Zheng, Dandan and Zheng, Ruobin and Zheng, Kecheng and Chen, Jingdong and Yang, Ming},
journal={arXiv preprint arXiv:2410.10306},
year={2025}}
@article{Ranni2024,
title={Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following},
author={Yutong Feng and Biao Gong and Di Chen and Yujun Shen and Yu Liu and Jingren Zhou},
journal={CVPR2024 Oral},
year={2024}}