Poster
DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
Chenyang Wang · Zerong Zheng · Tao Yu · Xiaoqian Lv · Bineng Zhong · Shengping Zhang · Liqiang Nie
Arch 4A-E Poster #133
Existing diffusion models for pose-guided human video generation mostly suffer from temporal inconsistency in the generated appearance and poses due to the inherent randomization nature of the generation process. In this paper, we propose a novel framework, DiffPerformer, to synthesize high-fidelity and temporally consistent human video. Without complex architecture modification or costly training, DiffPerformer finetunes a pretrained diffusion model on a single video of the target character and introduces an implicit video representation as a proxy to learn temporally consistent guidance for the diffusion model. The guidance is encoded into VAE latent space and an iterative optimization loop is constructed between the implicit video representation and the diffusion model, allowing to harness the smooth property of the implicit video representation and the generative capabilities of the diffusion model in a mutually beneficial way. Moreover, we propose 3D-aware human flow as a temporal constraint during the optimization to explicitly model the correspondence between driving poses and human appearance. This alleviates the misalignment between guided poses and target performer and therefore maintains the appearance coherence under various motions. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods.