Implementation of the paper: "PCSformer: Pair-wise Cross-scale Sub-prototypes Mining with CNN-Transformers for Weakly Supervised Semantic Segmentation"
Generating initial seeds is an important step in weakly supervised semantic segmentation. Our approach concentrates on generating and refining initial seeds. The convolutional neural networks (CNNs)--based initial seeds focus only on the most discriminative regions and lack global information about the target. The Vision Transformer (ViT)--based approach can capture long-range feature dependencies due to the unique advantage of the self-attention mechanism. Still, we find that it suffers from distractor object leakage and background leakage problems. Based on these observations, we propose PCSformer in this paper, which improves the model's ability to extract features through a Pair-wise Cross-scale (PC) strategy and solves the problem of distractor object leakage by further extracting potential target features through Sub-Prototypes (SP) mining. In addition, the proposed Conflict Self-Elimination (CSE) module further alleviates the background leakage problem. We validate our approach on the commonly used Pascal VOC 2012 and MS COCO 2014, and extensive experiments show that we achieve superior results. We also extend PCSformer to weakly supervised object localization tasks and perform well. In addition, our approach is competitive for semantic segmentation in medical images and challenging deformable and often translucent cluttered scenes. The code is available at
Ubuntu 18.04, CUDA 11.4, Python 3.9.18, and the following Python dependencies.
pip install -r requirements.txt
Download the PASCAL VOC 2012 development kit.
Download Conformer_small_patch16.pth.
Download ilsvrc-cls_rna-a1_cls1000_ep-0001.params.
Download saliency map.
1. cd PC_1, Run the script for training PCSformer in the Pair-wise Cross scale (PC) strategy stage
2. cd SP_2, Run the script for training PCSformer in the Sub-prototype (SP) strategy stage
To train DeepLab-v2, we refer to deeplab-pytorch.
Stage | Backbone | Google drive | mIoU (%) |
Initial seeds (after PC) | Conformer-S | Weights | 66.4 |
Initial seeds (after SP) | Conformer-S | Weights | 68.2 |
Final prediction (on VOC datasets) | ResNet101 | Weights | 72.8 |
Final prediction (on COCO datasets) | ResNet101 | Weights | 41.9 |
This code is borrowed from TransCAM, SC-CAM, and deeplab-pytorch.