Temporal grounding of text descriptions in video is an important task amid vision-language learning, and remains a challenging problem in video understanding. Existing methods focus on grounding a few text queries within minute-long videos, yet fail to scale up to hour-long videos with hundreds of queries. In this paper, we present a systematic study for the design of scalable video grounding models. We compare design choices for cross-modal fusion, analyze their computational cost, and point out key insight and a new training scheme that enables scalable video grounding. We further present a simple model following our key findings. Our model attains superior accuracy and efficiency on recent benchmarks for long-form video grounding, while remaining highly competitive on previous benchmarks comprising short videos.