Generative vision-language models (VLMs) have shown impressive performance on zero-shot vision-language tasks through generalization-based text generation. However, recent studies have improved upon this generalization capability for zero-shot VL tasks by utilizing second-stage instruction tuning with a large amount of human-labeled and externally generated data from large language models.In this work, we propose the image-conditioned text correction task for enhancing zero-shot text generation with unlabeled data. By leveraging the inherent structure of language, we produce the image-text pair containing mismatched concepts, and VLMs are required to identify and correct the error according to the vision modality by text generation.Our experiments demonstrate that our second-stage tuning framework significantly enhances the generalization capabilities of VLMs on various image-to-text generation-based VL tasks.This work represents a promising direction for advancing the field of zero-shot inference in VLMs by providing a cost-effective and scalable solution for enhancing generalization performance.