In this paper, we suggest a new novel method to understand complex semantic structures through long video inputs.Conventional methods for understanding videos have been focused on short-term clips, and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures.However, most real-world videos are composed of long videos ranging from minutes to hours, therefore, it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them.To tackle this problem, we suggested a new algorithm to learn representations of videos at the object level, defining spatiotemporal high-order relationships among the representations as semantic units.The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs, and a compositional learning method to learn disentangled features for each semantic unit.Using the suggested method, we resolve the challenging video task, which is compositional generalization understanding of unseen videos.In experiments using CATER and newly suggested compositional generalization datasets, we demonstrate new state-of-the-art performance.