We present a method for visual SLAM that can generalize to unseen scenes without the need for retraining and enjoys fast optimization. Existing methods struggle to generalize to novel scenes, i.e., they are optimized on a per-scene basis. Recently, neural scene representations have shown promise in SLAM to produce dense 3D reconstruction with high quality, at the cost of long training time. To overcome the limitations on generalization and efficiency, we propose IBD-SLAM, an Image-Based Depth fusion framework for generalizable SLAM. In particular, we adopt a Neural Radiance Field (NeRF) for scene representation. Inspired by image-based rendering, instead of learning a fixed grid of scene representation, we propose to learn image-based depth fusion, by deriving xyz-maps from the depth maps inferred from the given images. Once trained, the model can be applied to new uncalibrated monocular RGBD videos of unseen scenes, without the need for retraining. For any given new scene, only the pose parameters need to be optimized, which is very efficient. We thoroughly evaluate IBD-SLAM on public visual SLAM benchmarks, outperforming the previous state-of-the-art while being 10 times faster.