Video Frame Interpolation (VFI) has witnessed a surge in popularity due to its abundant downstream applications. Event-based VFI (E-VFI) has recently propelled the advancement of VFI. Thanks to the high temporal resolution benefits, event cameras can bridge the informational void present between successive video frames. Most state-of-the-art E-VFI methodologies follow the conventional VFI paradigm, which pivots on motion estimation between consecutive frames to generate intermediate frames through a process of warping and refinement. However, this reliance engenders a heavy dependency on the quality and consistency of keyframes, rendering these methods susceptible to challenges in extreme real-world scenarios, such as missing moving objects and severe occlusion dilemmas. This study proposes a novel E-VFI framework that directly synthesize intermediate frames leveraging event-based reference, obviating the necessity for explicit motion estimation and substantially enhancing the capacity to handle motion occlusion. Given the sparse and inherently noisy nature of event data, we prioritize the reliability of the event-based reference, leading to the development of an innovative event-aware reconstruction strategy for accurate reference generation. Besides, we implement a bi-directional event-guided alignment from keyframes to the reference using the introduced E-PCD module. Finally, a transformer-based decoder is adopted for prediction refinement. Comprehensive experimental evaluations on both synthetic and real-world datasets underscore the superiority of our approach and its potential to execute high-quality VFI tasks.