Video small object detection is a difficult task due to the lack of object information. Recent methods focus on adding more temporal information to obtain more potent high-level features, which often fail to specify the most vital information for small objects, resulting in insufficient or inappropriate features. Since information from frames at different positions contributes differently to small objects, it is not ideal to assume that using one universal method will extract proper features. We find that context information from the long-term frame and temporal information from the short-term frame are two useful cues for video small object detection. To fully utilize these two cues, we propose a long short-term feature enhancement network (LSTFE-Net) for video small object detection. First, we develop a plug-and-play spatio-temporal feature alignment module to create temporal correspondences between the short-term and current frames. Then, we propose a frame selection module to select the long-term frame that can provide the most additional context information. Finally, we propose a long short-term feature aggregation module to fuse long short-term features. Compared to other state-of-the-art methods, our LSTFE-Net achieves 4.4% absolute boosts in AP on the FL-Drones dataset. More details can be found at https://github.com/xiaojs18/LSTFE-Net.