Weakly Supervised Learning Method for Semantic Segmentation of Large-Scale 3D Point Cloud Based on Transformers

Semantic segmentation is a key technique to assign a semantic label to each individual point in a point cloud. However, the large demand for supervised data and the difficulty of learning local features of point cloud are still unsolved problems.

To improve 3D point feature, inspired by the idea of transformer, we employ a so-call LCP network that extracts better feature by investigating attentions between target 3D points and its corresponding local neighbors via local context propagation.

Training transformer-based network needs amount of training samples, which itself is a labor-intensive, costly and error-prone work, therefore, this work proposes a weakly supervised framework, in particular, pseudo-labels are estimated based on the feature distances between unlabeled points and prototypes, which are calculated based on labeled data.

The methodology and workflow of our approach are illustrated in the figure below. We begin by feeding the point cloud into an LCP network to predict the initial semantic information of the point cloud. Next, we employ a prototype pseudo-label generation strategy based on momentum to generate pseudo-labels for unlabeled points. These pseudo-labels, along with the predicted results, are optimized using a loss function.

We propose an effective weakly supervised framework based on Transformer, and the overview of the framework is illustrated in the figure below. Our approach combines a Transformer network with LCP (Local Context Perception) modules and pseudo-label generation techniques to achieve better semantic segmentation results with only a small amount of real annotations.

We constructed a UNet-like network for semantic segmentation tasks using 4 LCP Blocks and 4 up-sampling layers, as it requires per-point features for dense prediction. Before entering the first LCP Block, the data passes through a shared MLP.

The dimensions of each layer in the network are 128, 256, 512, and 1024, respectively. The input consists of 40960 points.

To demonstrate the efficacy of our proposed PL-LCP, we evaluate 3D semantic segmentation results on both indoor and outdoor scenarios using two large-scale point cloud datasets. First, we do two ablation experiments to validate the ability of the LCP module to integrate inter-block information and the effect of pseudo-labels. Then, our method is compared with other relevant approaches, primarily to demonstrate the effectiveness of the PL-LCP network architecture. Our experimental environment is: Intel Core i7-8700 CPU (3.70GHz), 64GB RAM, NVIDIA GeForce RTX 4090 24GB GPU, 64-bit Ubuntu 22.04.3 LTS Operating System (5.4.0-149-generic).

We trained the network for 200 epochs using the Adam optimizer with momentum, batch size and weight decay set to 0.9, 4 and 0.0001, respectively. The initial learning rate was set to 0.01, and decreased by a factor of 10 at 120 epochs.

Here we show some of the algorithms which have already been tested with the corresponding results:

Part	LCP	OA(%)	mAcc(%)	mIOU(%)	Labels
1	√	90.2	74.3	67.6	fully
2	×	87.6	74.5	64.6	fully
3	√	90.1	74.4	67.1	10%
4	√	89.2	73.2	65.9	1%

Performance comparisons with existing sota methods on SensatUrban test set
Methods	OA(%)	mIOU(%)	ground	Veg.	buildings	walls	bridge	parking	rail	traffic	street	Cars	path	bikes	water
PointNet	80.8	23.7	68.0	89.5	80.0	0.0	0.0	4.0	0.0	31.6	0.0	35.1	0.0	0.0	0.0
PointNet++	84.3	32.9	72.5	94.2	84.8	2.7	2.1	25.8	0.0	31.5	11.4	38.8	7.1	0.0	56.9
TrangenConv	77.0	33.3	71.5	91.4	75.9	35.2	0.0	45.3	0.0	26.7	19.2	67.6	0.0	0.0	0.0
SPGraphr	85.3	37.3	69.9	94.6	88.9	32.8	12.6	15.8	15.5	30.6	23.0	56.4	0.5	0.0	44.2
SparseConv	88.7	42.7	74.1	97.9	94.2	63.3	7.5	24.2	0.0	30.1	34.0	74.4	0.0	0.0	54.8
KPConv	93.2	57.6	87.1	98.9	95.3	74.4	28.7	41.4	0.0	56.0	54.4	85.7	40.4	0.0	86.3
RandLA-Net	89.8	52.7	80.1	98.1	91.6	48.9	40.8	51.6	0.0	56.7	33.2	80.1	32.6	0.0	71.3
PL-LCP(ours)	93.9	67.3	83.5	98.7	96.3	72.3	84.2	57.0	46.9	74.5	54.9	90.1	43.5	0.0	72.8

Performance comparisons with previous methods on S3DIS
	OA(%)	mAcc(%)	mIOU(%)
PointNet	-	23.7	41.1
TragenConv	82.5	63.2	52.8
SPGraph	86.4	66.5	58.0
LocalTransformer	87.6	71.9	64.1
RandLA-Net	87.2	71.4	62.4
PSNet	87.8	-	64.9
PL-LCP(ours)	90.2	74.3	67.6

If you want to show your test result here, please send your e-mail to the following e-mail address:

2020114062@my.swjtu.edu.cn, ZhaoNing Zhang, Southwest Jiao Tong University
tf.wang@whu.edu.cn, TengFei Wang, WuHan University
xwang@sgg.whu.edu.cn, Xin Wang, WuHan University

Thanks for your support!

If you have any questions or advice, you can contact us through following address:

2020114062@my.swjtu.edu.cn, ZhaoNing Zhang, Southwest Jiao Tong University
tf.wang@whu.edu.cn, TengFei Wang, WuHan University
xwang@sgg.whu.edu.cn, Xin Wang, WuHan University

This work was jointly supported Natural Science Foundation of Hubei Province，China (2022CFB727) and National Natural Science Foundation of China (42301507).