MY ALT TEXT

VIDEOS

Abstract

Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed \(\textbf{Mimicry-Discovery Cycle}\) (MDcycle), which allows substantial fine-tuning while fully preserving the model's ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark \(\textbf{PhysRVGBench}\) and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness. Our code and ckpt will be released publicly soon.

Overall Framework of PhysRVG

MY ALT TEXT

The framework of \(\textbf{PhysRVG}\). Given a text prompt and context frames, the model generates future video frames. For both the groundtruth and sampled frames, we derive motion masks \(M\) by prompting SAM2 with object coordinates \(p_1\) from the first frame, which are manually annotated during data preprocessing. We then compute object trajectories \(P\) and perform collision detection. The trajectory offset \(O\) between the sampled and groundtruth trajectories is calculated and reweighted by the collision signal \(w_t\) to yield a weighted trajectory offset \(O_c\), which serves as the per-sample score. All transformer blocks are trained with full parameters.

Videos Generated by PhysRVG

Qualitative Comparison

Ablation Study of the Collision Detection

AI Billiards Game

Note: The model generation process has been accelerated for a smoother viewing experience.

Reference

      
@article{PhysRVG2025,
  title={PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models},
  author={Zhang, Qiyuan and Gong, Biao and Tan, Shuai and Zhang, Zheng and Shen, Yujun and Zhu, Xing and Li, Yuyuan and Yao, Kelu and Shen, Chunhua and Zou, Changqing},
  journal={arXiv preprint arXiv:2601.11087},
  year={2025}
}