Abstract:
This paper presents the DiffusionRegPose, a novel approach to multi-person pose estimation that converts a one-stage, end-to-end keypoint regression model into a diffusion-based sampling process. Existing one-stage deterministic regression methods, though efficient, are often prone to missed or false detections in crowded or occluded scenes, due to their inability to reason pose ambiguity. To address these challenges, we handle ambiguous poses in a generative fashion, i.e., sampling from the image-conditioned pose distributions characterized by a diffusion probabilistic model. Specifically, with initial pose tokens extracted from the image, noisy pose candidates are progressively refined by interacting with the initial tokens via attention layers. Extensive evaluations on the COCO and CrowdPose datasets show that DiffusionRegPose clearly improves the pose accuracy in crowded scenarios, as evidenced by a notable 3.3 AP increase in the $AP_H$ metric on the CrowdPose dataset. This demonstrates the model's potential for robust and precise human pose estimation in real-world applications.
Chat is not available.