SynMotion: Semantic-Visual Adaptation
for Motion Customized Video Generation

Shuai Tan¹, Biao Gong^1†, Yujie Wei², Shiwei Zhang², Zhuoxin Liu³, Dandan Zheng¹,
Jingdong Chen¹, Yan Wang⁴, Hao Ouyang¹, Kecheng Zheng¹ Yujun Shen¹

¹Ant Group ²Tongyi Lab ³University of Wisconsin-Madison ⁴University of North Carolina at Chapel Hill
^†Project Leader, Corresponding Author

MY ALT TEXT

Abstract

Recent advances in diffusion models have enabled video generation, yet customizing specific motions for diverse subjects remains challenging due to the complexity of semantic alignment and visual dynamics. Most existing motion customization methods focus exclusively on either semantic guidance or visual adaptation, limiting their ability to generate accurate subject-specific customized motions in videos. To address these limitations, we propose \(\texttt{SynMotion}\), a new and powerful framework for motion-customized video generation that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce a dual-embedding semantic comprehension mechanism that disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we design an embedding-specific training strategy that alternately optimizes subject and motion embeddings, supported by a newly constructed Subject Prior Video (SPV) dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \(\texttt{SynMotion}\) significantly outperforms existing baselines.

Overall Framework of SynMotion

MY ALT TEXT

The pipeline of \(\texttt{SynMotion}\). Given a prompt in the form of \(<\text{subject}, \text{motion}>\), we use a MLLM to obtain the corresponding text embedding, which is then decomposed into a subject embedding \(e_{sub}\) and motion embedding \(e_{mot}\). Each part is augmented with the learnable embeddings (i.e., \(e^l_{sub}\) and \(e^l_{mot}\)) in a zero-initialized convolutional residual (Zero-Conv) \(\mathcal{Z}\) manner. These embeddings are then passed through an Embedding Refiner \(\mathcal{R}\), which fuses subject and motion semantics. The refined embeddings are reintegrated via Zero-Conv \(\mathcal{Z}\) and injected into the video generation backbone. An additional Adapter \(\mathcal{R}\) enhances motion-aware features, enabling the final model to generate videos with customized motion across novel subjects.

Motion Video Customization Results of SynMotion

"A sea lion claps"

Clap

Generated Video

"An orangutan waves"

Wave

Generated Video

"A kreindeer knocks door"

Knock door

Generated Video

"A kangaroo salutes"

Salute

Generated Video

"A crocodile handstands"

Handstand

Generated Video

"A chicken bows"

Bow

Generated Video

"A squirrel claps"

Clap

Generated Video

"A koala waves"

Wave

Generated Video

"A parrot knocks door"

Knock door

Generated Video

"A red panda salutes"

Salute

Generated Video

"A leopard handstands"

Handstand

Generated Video

"A squirrel bows"

Bow

Generated Video

"A fox opens door"

Open door

Generated Video

"A rabbit prays"

Pray

Generated Video

"A koala punches"

Punch

Generated Video

"A polar bear spins"

Spin

Generated Video

"A dog waves"

Wave

Generated Video

"A giraffe opens door"

Open door

Generated Video

"A bear prays"

Pray

Generated Video

"A dog punches"

Punch

Generated Video

"A bee spins"

Spin

Generated Video

"A panda waves"

Wave

Generated Video

"Albert Einstein raises arms"

Raise arm

Generated Video

"Marilyn Monroe punches"

Punch

Generated Video

"Hulk waves"

Wave

Generated Video

"Batman claps"

Clap

Generated Video

"Superman salutes"

Salute

Generated Video

"Ironman prays"

Pray

Generated Video

"Hulk raises arms"

Raise arm

Generated Video

"Albert Einstein punches"

Punch

Generated Video

"Marilyn Monroe waves"

Wave

Generated Video

"Elon Musk claps"

Clap

Generated Video

"Deadpool salutes"

Salute

Generated Video

"Barack Obama prays"

Pray

Generated Video

Motion Video Customization Results of SynMotion in Image-to-Video setting

"A penguin dances"

Dance

Generated Video

"An hippon waves"

Wave

Generated Video

"A koala claps"

Clap

Generated Video

"A fox prays"

Pray

Generated Video

"A crocodile dances"

Dance

Generated Video

"A wild boar waves"

Wave

Generated Video

"A polar bear claps"

Clap

Generated Video

"A sea lion prays"

Pray

Generated Video

"A giraffe squats"

Squat

Generated Video

"A lizard throws"

Throw

Generated Video

"A sea lion waves"

Wave

Generated Video

"A cat raises arms"

Raise arm

Generated Video

"A parrot squats"

Squat

Generated Video

"A panda throws"

Throw

Generated Video

"A cat waves"

Wave

Generated Video

"A fox raises arms"

Raise arm

Generated Video

"A kangaroo hops."

Exemplar Video

Generated Video

Generated Video

Generated Video

Generated Video

"A bird swings wings."

Exemplar Video

Generated Video

Generated Video

Generated Video

Generated Video

Qualitative Comparison of Motion Video Customization

"A person knocks door."

"A sea lion knocks door."

VMC

DMT

Motion Director

Motion Inversion

Base Model

Ours

"A person salutes."

"A rabbit salutes."

VMC

DMT

Motion Director

Motion Inversion

Base Model

Ours

"A person waves."

"A dog waves."

Textual Inversion

DreamBooth

ReVersion

Base Model

Ours

"A person punches."

"A koala punches."

Textual Inversion

DreamBooth

ReVersion

Base Model

Ours

"A person dances."

"A penguin dances."

CogVideoX-I2V

HunyuanVideo-I2V

Wan2.1-I2V

Ours

"A person waves."

"A crocodile waves."

CogVideoX-I2V

HunyuanVideo-I2V

Wan2.1-I2V

Ours

Reference

      
        @article{tan2025SynMotion,
          title={SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation},
          author={Tan, Shuai and Gong, Biao and Wei, Yujie and Zhang, Shiwei and Liu, Zhuoxin and Zheng, Dandan and Chen, Jingdong and Wang, Yan and Ouyang, Hao and Zheng, Kecheng and Shen, Yujun},
          journal={arXiv preprint arXiv:2506.23690},
          year={2025}
        }
        @article{AnimateX++2025,
          title={Animate-X++: Universal Character Image Animation with Dynamic Backgrounds},
          author={Tan, Shuai and Gong, Biao and Liu, Zhuoxin and Wang, Yan and Feng, Yifan and Zhao, Hengshuang},
          journal={arXiv preprint arXiv:2508.09545},
          year={2025}
        }
        @inproceedings{Mimir2025,
          title={Mimir: Improving Video Diffusion Models for Precise Text Understanding},
          author={Tan, Shuai and Gong, Biao and Feng, Yutong and Zheng, Kecheng and Zheng, Dandan and Shi, Shuwei and Shen, Yujun and Chen, Jingdong and Yang, Ming},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2025}}       
        @inproceedings{AnimateX2025,
          title={Animate-X: Universal Character Image Animation with Enhanced Motion Representation},
          author={Tan, Shuai and Gong, Biao and Wang, Xiang and Zhang, Shiwei and Zheng, Dandan and Zheng, Ruobin and Zheng, Kecheng and Chen, Jingdong and Yang, Ming},
          booktitle={International Conference on Learning Representations},
          year={2025}}