Abstract:
We present S$^4$Former, a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S$^4$). At its core, S$^4$Former employs a Vision Transformer within a classic teacher-student framework, and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique, Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method, and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input, feature, and output dimensions, S$^4$Former exploits the Transformer’s ability to capture and differentiate consistent global contextual information in unlabeled images. Overall, S$^4$Former not only defines a new state of the art in S$^4$ but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks, S$^4$Former achieves strong improvements (up to 4.9\%) on benchmarks like Pascal VOC 2012, COCO, and Cityscapes, with varying numbers of labeled data. The code will be made publicly available.
Chat is not available.