Poster
SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
Tao Wu · Runyu He · Gangshan Wu · Limin Wang
Arch 4A-E Poster #379
Video visual relation detection tasks, such as video scene graph generation, play an important role in fine-grained video understanding. However, current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. Firstly, they do not explore complex human-human interactions in multi-person scenarios. Secondly, the relation types they define have a relatively low semantic level and can often be recognized by appearance or prior information, without the need for detailed spatio-temporal context reasoning. Nevertheless, comprehending high-level interactions between humans is crucial for understanding complex multi-person videos, such as sports and surveillance videos. To address this gap, we propose a new video visual relation detection task: video human-human interaction detection, and introduce a novel dataset named SportsHHI. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118075 human bounding boxes and 50649 interaction instances are annotated on 11398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.