Spatial localization in time is vital for humans. Therefore we desire that computer vision algorithms are also able to spatially and temporally localize objects and actions. These algorithms generally learn from given data and discover patterns, parts, motions, and their locations by exploiting inductive biases that are essential for learning. However, localization is complex, error-prone and hard to inspect. In this thesis, we investigate location biases and how CNNs explore and exploit location and temporal information in the image and video domain. An interesting finding of the thesis is that heuristics about what is outside the image (border handling) enables CNNs to exploit absolute spatial location and break translation equivariance. The thesis proposes a simple solution to eliminate the spatial location biases. The proposed solution improves translation equivariance and provides data efficiency and robustness. Furthermore, the thesis investigates object and part locations on images. First, the thesis studies object-context relationships of modern object detectors and reveals insights about helpful location biases. In addition, the effect of unhelpful location biases is investigated for a visual verification task. These analyses show that object detectors can hallucinate the location of an object with high confidence score even if the object is not in the image. Based on these insights, the thesis provides suggestions for researchers on how to choose an object detector for their specific tasks. Another interesting finding of this thesis shows limitations of data augmentation techniques to resolve robustness issues of pose estimation methods when dealing with occlusions. Even if data augmentation alleviates some problems caused by sampling biases, it can only yield limited improvement and the performance saturates after applying a stack of augmentations. Finally, the thesis investigates temporal location information and demonstrates spatio-temporal location biases in video data. A time-efficient video labeling solution that uses latent space feature similarity is proposed to annotate long-untrimmed videos. Besides, using only keyframe labels with Positive-Unlabeled learning achieves highquality action proposals that can be utilized with many temporal action localization methods. The proposed method can provide data and label efficiency. Taken together, this thesis investigates how CNNs use location information and introduce location biases that can result in positive as well as negative outcomes on various computer vision tasks.
|Award date||21 Feb 2022|
|Publication status||Published - 2022|