In the Visual Question Answering (VQA) task, recognizing and localizing entities pose significant challenges. Pretrained vision-and-language models have addressed this problem by providing a text description as the answer. However, in visual scenes with multiple entities, textual descriptions struggle to distinguish the entities from the same category effectively. Consequently, the VQA dataset is limited by the limitations of text description and cannot adequately cover scenarios involving multiple entities. To address this challenge, we introduce a module, Mask for Align (Mask4Align), which can determine the entity's position in the given image that best matches the user-input question. This module incorporates colored masks into the image, enabling the VQA model to handle discrimination and localization challenges associated with multiple entities. To process an arbitrary number of similar entities, Mask4Align is designed in a hierarchical manner to discern subtle differences, achieving precise localization. Furthermore, Mask4Align leverages pre-trained models and does not involve any extra task-sensitive parameters, and therefore, no additional training or hyper-parameter tuning is required. We evaluate our method on the gaze target prediction task and test our methods on different types of questions in the wild to verify the module's generalization. Code implementations are provided in the supplementary material.