Despite the remarkable success of Vision Transformers (ViT) across diverse fields in computer vision, they have a clear drawback of expensive adaption cost for downstream tasks due to the increased scale. To address this, Visual Prompt Tuning (VPT) incorporates learnable parameters in the input space of ViT. While freezing the ViT backbone and tuning only the prompts, it exhibits superior performances to full fine-tuning. However, despite the outstanding advantage, we point out that VPT may lead to serious unfairness in downstream classification. Initially, we investigated the causes of unfairness in VPT, identifying the biasedly pre-trained ViT as a principal factor. Motivated by this observation, we propose a Fair Visual Prompt Tuning (Fair-VPT) which removes biased information in the pre-trained ViT while adapting it to downstream classification tasks. To this end, we categorize prompts into "cleaner prompts" and "target prompts''. Based on this, we encode the class token in two different ways by either masking or not masking the target prompts in the self-attention process. These encoded tokens are trained with distinct objective functions, resulting in the inclusion of different information in the target and cleaner prompts. Moreover, we introduce a disentanglement loss based on contrastive learning to further decorrelate them. In experiments across diverse benchmarks, the proposed method demonstrates the most superior performance in terms of balanced classification accuracy and fairness.