Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation

L. Zeng*, A. Lengyel, N. Tömen, J.C. van Gemert

*Corresponding author for this work

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

24 Downloads (Pure)


In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, while we are also more computationally efficient. Our code is available on
Original languageEnglish
Title of host publication33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022
PublisherBMVA Press
Number of pages18
Publication statusPublished - 2022
Event33rd British Machine Vision Conference 2022 - London, United Kingdom
Duration: 21 Nov 202224 Nov 2022
Conference number: 33


Conference33rd British Machine Vision Conference 2022
Abbreviated titleBMVC 2022
Country/TerritoryUnited Kingdom


Dive into the research topics of 'Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation'. Together they form a unique fingerprint.

Cite this