Humans can quickly assess how different parts of a scene would feel if touched. However, this ability still eludes current techniques in scene reconstruction. This work presents a scene representation that brings vision and touch into a shared 3D space, which we define as a tactile-augmented radiance field. This representation capitalizes on two key insights: (i) ubiquitous touch sensors are built on perspective cameras, and (ii) visually and structurally similar regions of a scene share the same tactile features. We leverage these insights to train a conditional diffusion model that, provided with an RGB image and a depth map rendered from a neural radiance field, generates its corresponding tactile ``image''. To train this diffusion model, we collect the largest collection of spatially-aligned visual and tactile data, significantly surpassing the size of the largest prior dataset. Through qualitative and quantitative experiments, we demonstrate the accuracy of our cross-modal generative model and the utility of collected and rendered visual-tactile pairs across a range of downstream tasks.