MY ALT TEXT

Abstract

Recent advances in diffusion models have enabled video generation, yet customizing specific motions for diverse subjects remains challenging due to the complexity of semantic alignment and visual dynamics. Most existing motion customization methods focus exclusively on either semantic guidance or visual adaptation, limiting their ability to generate accurate subject-specific customized motions in videos. To address these limitations, we propose \(\texttt{SynMotion}\), a new and powerful framework for motion-customized video generation that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce a dual-embedding semantic comprehension mechanism that disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we design an embedding-specific training strategy that alternately optimizes subject and motion embeddings, supported by a newly constructed Subject Prior Video (SPV) dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \(\texttt{SynMotion}\) significantly outperforms existing baselines.

Overall Framework of SynMotion

MY ALT TEXT

The pipeline of \(\texttt{SynMotion}\). Given a prompt in the form of \(<\text{subject}, \text{motion}>\), we use a MLLM to obtain the corresponding text embedding, which is then decomposed into a subject embedding \(e_{sub}\) and motion embedding \(e_{mot}\). Each part is augmented with the learnable embeddings (i.e., \(e^l_{sub}\) and \(e^l_{mot}\)) in a zero-initialized convolutional residual (Zero-Conv) \(\mathcal{Z}\) manner. These embeddings are then passed through an Embedding Refiner \(\mathcal{R}\), which fuses subject and motion semantics. The refined embeddings are reintegrated via Zero-Conv \(\mathcal{Z}\) and injected into the video generation backbone. An additional Adapter \(\mathcal{R}\) enhances motion-aware features, enabling the final model to generate videos with customized motion across novel subjects.

Motion Video Customization Results of SynMotion

Motion Video Customization Results of SynMotion in Image-to-Video setting

Qualitative Comparison of Motion Video Customization

Reference

      
        @article{tan2025SynMotion,
          title={SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation},
          author={Tan, Shuai and Gong, Biao and Wei, Yujie and Zhang, Shiwei and Liu, Zhuoxin and Zheng, Dandan and Chen, Jingdong and Wang, Yan and Ouyang, Hao and Zheng, Kecheng and Shen, Yujun},
          journal={arXiv preprint arXiv:2506.23690},
          year={2025}
        }
        @inproceedings{Mimir2025,
          title={Mimir: Improving Video Diffusion Models for Precise Text Understanding},
          author={Tan, Shuai and Gong, Biao and Feng, Yutong and Zheng, Kecheng and Zheng, Dandan and Shi, Shuwei and Shen, Yujun and Chen, Jingdong and Yang, Ming},
          booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2025}}       
        @inproceedings{AnimateX2025,
          title={Animate-X: Universal Character Image Animation with Enhanced Motion Representation},
          author={Tan, Shuai and Gong, Biao and Wang, Xiang and Zhang, Shiwei and Zheng, Dandan and Zheng, Ruobin and Zheng, Kecheng and Chen, Jingdong and Yang, Ming},
          booktitle={International Conference on Learning Representations},
          year={2025}}