This paper studies the problem of semi-supervised 2D-3D retrieval, which aims to align both labeled and unlabeled 2D and 3D data into the same embedding space. The problem is challenging due to the complicated heterogeneous relationships between 2D and 3D data. Moreover, label scarcity in real-world applications hinders from generating discriminative representations. In this paper, we propose a semi-supervised approach named Fine-grained Prototypcical Voting with Heterogeneous Mixup (FIVE), which maps both 2D and 3D data into a common embedding space for cross-modal retrieval. Specifically, we generate fine-grained prototypes to model inter-class variation for both 2D and 3D data. Then, considering each unlabeled sample as a query, we retrieve relevant prototypes to vote for reliable and robust pseudo-labels, which serve as guidance for discriminative learning under label scarcity. Furthermore, to bridge the semantic gap between two modalities, we mix cross-modal pairs with similar semantics in the embedding space and then perform similarity learning for cross-modal discrepancy reduction in a soft manner. The whole FIVE is optimized with the consideration of sharpness to mitigate the impact of potential label noise. Extensive experiments on benchmark datasets validate the superiority of FIVE compared with a range of baselines in different settings. On average, FIVE outperforms the second-best approach by 4.74% on 3D MNIST, 12.94% on ModelNet10, and 22.10% on ModelNet40.