The quest for optimal vision-language pretraining strategies has led to the exploration of masking techniques as a way to enhance data efficiency. Previous approaches include random masking and semantic masking, the latter requiring the retention or exclusion of patches in areas with similar semantics. Despite its effectiveness, semantic masking often needs an additional, complex model for identifying semantically related patches, increasing computational demands. Our method utilizes naturally emerging clusters within images unlike other approaches using text supervision. We employ random clusters of image patches for masking, utilizing the raw RGB values of patches as the feature representation. This method capitalizes on the observation that basic visual similarity measures can effectively identify coherent visual structures, such as parts of objects. Our approach, therefore, combines the computational efficiency of random patch dropping with the enhanced performance achieved through masking coherent visual structures.