Human-Object Interaction (HOI) Detection constitutes an important aspect of human-centric scene understanding, which requires precise object detection and interaction recognition. Despite increasing advancement in detection, recognizing subtle and intricate interactions remains challenging. Recent methods have endeavored to leverage the rich semantic representation from pre-trained CLIP, yet fail to efficiently capture finer-grained spatial features that are highly informative for interaction discrimination. In this work, instead of solely using representations from CLIP, we fill the gap by proposing a spatial adapter that efficiently utilizes the multi-scale spatial information in the pre-trained detector. This leads to a bilateral adaptation that produces complementary features. Moreover, we design a Conditional Contextual Mining module that further mines informative contextual clues from the spatial features via a tailored cross-attention mechanism. To further improve interaction recognition under occlusion, which is common in crowded scenarios, we propose an Occluded Part Extrapolation module that guides the model to recover the spatial details from manually occluded feature maps. Extensive experiments on V-COCO and HICO-DET benchmarks demonstrate that our method significantly outperforms prior art on both traditional and zero-shot settings, resulting in new state-of-the-art performance. Additional ablation studies further validate the effectiveness of each component in our method.