The humanly constructed world is well-organized in space. A prominent feature of this artificial world is the presence of repetitive structures and coherent patterns, such as lines, junctions, wireframes of a building, and footprints of a city. These structures and patterns facilitate visual scene understanding by providing abundant geometry information. Humans can easily recognize diverse geometric structures. However, we wonder how to instruct an autonomous agent to interpret the visual world as we do? There has been great interest in automatic understanding of the geometric world from images in the past few decades. Conventional approaches first detect hand-crafted edge features and then group them into parametric shapes such as lines. Recently, this strategy has been gradually replaced by deep neural networks because learning features from large annotated datasets gives a richer representation compared to hand-designed features. Although the progress is inspiring, there are still several concerns on deploying neural networks in the real-world. A primary concern is the availability of massive labeled data, as the performance of deep networks deteriorates substantially when training data is scarce. This thesis introduces novel strategies to enhance the performance of neural networks in a small data regime, by adding geometric priors into learning. We start with the Hough Transform, a well-known prior for straight lines, and offer a principled way to add this prior into neural networks for data efficient end-to-end learning. On the wireframe parsing task, our model advances the state-of-the-art substantially on various subsets with much less training data. Subsequently, we extend the Hough Transform line priors to semi-supervised lane detection, only requiring a small amount of labeled data, and show that this approach improves the overall performance by leveraging a massive amount of unlabeled data. We explore a second geometric prior, the Gaussian sphere mapping, for vanishing point detection. We present an end-to-end framework for detecting multiple non-orthogonal vanishing points without relying on large quantities of training samples. Moreover, the proposed model exhibits consistent performance across multiple datasets without fine-tuning, thus demonstrating the effectiveness of geometric priors in tackling data variation. Next, we study detecting 3D mirror symmetry from single-view images. We explicitly incorporate 3D mirror geometry into identifying symmetry planes. To reduce the computational footprint, we design multi-stage spherical convolutions to hierarchically pinpoint the optimal plane in the parameter space. Our model not only improves overall performance but also reduces the inference latency substantially. Finally, we explore the possibility of detecting polygonal shapes from images by using transformers. We provide a full picture of the strength and weakness of the auto-regressive and parallel transformers on detecting polygons viewed as collections of points. we demonstrate on a toy dataset that the auto-regressive transformers can be a reasonable option for learning polygonal representations from real-world images. Taken together, with this thesis we show that incorporating geometric priors into modern deep learning allows reducing the need for expensive, manually annotated data.
|Qualification||Doctor of Philosophy|
|Award date||25 Apr 2022|
|Publication status||Published - 2022|
- Computer Vision
- Deep Learning