MY ALT TEXT

Abstract

Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial misalignment between the reference image and the driving poses. We attribute these limitations to an overly rigid spatial binding that forces strict pixel-wise alignment between the pose and reference, and an inability to consistently rebind motion to intended subjects. To address these challenges, we propose CoDance, a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single, potentially misaligned pose sequence. Specifically, the Unbind module employs a novel pose shift encoder to break the rigid spatial binding between the pose and the reference by introducing stochastic perturbations to both poses and their latent features, thereby compelling the model to learn a location-agnostic motion representation. To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters. Furthermore, to facilitate comprehensive evaluation, we introduce a new multi-subject CoDanceBench. Extensive experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts. The code and weights will be open-sourced.

Overall Framework of Animate-X

MY ALT TEXT

The pipeline of CoDance. Given a reference image \(I^r\), a driving pose sequence \(I^p_{1:F}\), a text prompt \(\mathcal{T}\), and a subject mask \(\mathcal{M}\), our model generates an animation video \(I^g_{1:F}\). A VAE encoder extracts the latent feature \(f^r_e\) from \(I^r\). The Unbind module, implemented as a \textbf{Pose Shift Encoder}, processes \(I^p_{1:F}\) to produce pose features. These are concatenated with patchified tokens from the noisy latent input for the DiT backbone. The Rebind module provides dual guidance: \textbf{semantic features} from a umT5 text encoder are injected via cross-attention, while \textbf{spatial features} from a Mask Encoder are added element-wise to the noisy latent. To bolster the model's semantic comprehension, the training process alternates between animation data (with probability \(p_\text{ani}\)) and a diverse text-to-video dataset (with probability \(1-p_\text{ani}\)). The DiT is initialized from a pretrained T2V model and fine-tuned using LoRA. Finally, a VAE decoder reconstructs the video. Note that the Unbind module and mixed-data training are applied exclusively during the training phase.

Comparison with SOTA methods

Animating characters in games and cartoons

Animating interesting characters

Animating celebrities

Animating characters using multi-subject pose images

Animating single subject

Animating multiple subjects

Animating long videos with music

(Please click the speaker icon at the bottom right of the video and turn on your speakers)

Reference

      
@article{CoDance2025,
  title={CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation},
  author={Tan, Shuai and Gong, Biao and Ma, Ke and Feng, Yutong and Zhang, Qiyuan and Wang, Yan and Shen, Yujun and Zhao, Hengshuang},
  journal={arXiv preprint arXiv:2601.11096},
  year={2025}
}
@article{AnimateX++2025,
  title={Animate-X++: Universal Character Image Animation with Dynamic Backgrounds},
  author={Tan, Shuai and Gong, Biao and Liu, Zhuoxin and Wang, Yan and Feng, Yifan and Chen, Xi and Zhao, Hengshuang},
  journal={arXiv preprint arXiv:2508.09545},
  year={2025}
}
@inproceedings{AnimateX2025,
  title={Animate-X: Universal Character Image Animation with Enhanced Motion Representation},
  author={Tan, Shuai and Gong, Biao and Wang, Xiang and Zhang, Shiwei and Zheng, Dandan and Zheng, Ruobin and Zheng, Kecheng and Chen, Jingdong and Yang, Ming},
  booktitle={International Conference on Learning Representations},
  year={2025}}
@article{tan2025SynMotion,
  title={SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation},
  author={Tan, Shuai and Gong, Biao and Wei, Yujie and Zhang, Shiwei and Liu, Zhuoxin and Zheng, Dandan and Chen, Jingdong and Wang, Yan and Ouyang, Hao and Zheng, Kecheng and Shen, Yujun},
  journal={arXiv preprint arXiv:2506.23690},
  year={2025}
}
@inproceedings{Mimir2025,
  title={Mimir: Improving Video Diffusion Models for Precise Text Understanding},
  author={Tan, Shuai and Gong, Biao and Feng, Yutong and Zheng, Kecheng and Zheng, Dandan and Shi, Shuwei and Shen, Yujun and Chen, Jingdong and Yang, Ming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}}