As a fundamental problem in multimodal learning, multimodal fusion aims to compensate for the inherent limitations of a single modality. One challenge of multimodal fusion is that the unimodal data in their unique embedding space mostly contains potential noise, which leads to corrupted cross-modal interactions. However, in this paper, we show that the potential noise in unimodal data could be well quantified and further employed to enhance more stable unimodal embeddings via contrastive learning. Specifically, we propose a novel generic and robust multimodal fusion strategy, termed Embracing Aleatoric Uncertainty (EAU), which is simple and can be applied to kinds of modalities. It consists of two key steps: (1) the Stable Unimodal Feature Augmentation (SUFA) that learns a stable unimodal representation by incorporating the aleatoric uncertainty into self-supervised contrastive learning. (2) Robust Multimodal Feature Integration (RMFI) leveraging an information-theoretic strategy to learn a robust compact joint representation. We evaluate our proposed EAU method on five multimodal datasets, where the video, RGB image, text, audio, and depth image are involved. Extensive experiments demonstrate the EAU method is more noise-resistant over existing multimodal fusion strategies and establishes new state-of-the-art on several benchmarks.