Abstract:
Human-centric video frame interpolation has great potential for enhancing entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available for video frame interpolation in the community, none of them is dedicated to human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark featuring over 130K high-resolution ($\geq$720p) slow-motion sports video clips, totaling over 1M video frames, sourced from YouTube. We re-train several state-of-the-art methods on our benchmark, and we observed a noticeable decrease in their accuracy compared to other datasets. This highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To tackle these challenges, we propose human-aware loss terms, where we add auxiliary supervision for human segmentation in panoptic settings and keypoints detection. These loss terms are model-agnostic and can be easily plugged into any video frame interpolation approach. Experimental results validate the effectiveness of our proposed human-aware loss terms, leading to consistent performance improvement over existing models. The dataset and code will be publicly released to foster future research.
Chat is not available.