Abstract
Computer vision systems, such as image classifiers, object detectors and video analysis tools, serve diverse applications, ranging from autonomous vehicles and drone navigation to medical image analysis and anomaly inspection in the manufacturing industry. The development of these systems relies heavily on well established practices, which include the adoption of conventional training and evaluation metrics and benchmark datasets. However, we argue that standard approaches are sub-optimal with respect to the ultimate objectives of the computer vision systems. In this thesis, we question whether the training and evaluation of computer vision systems for object detection and long-term action recognition are typically aligned with human-defined end goals.
Object detectors are deployed for object tracking in autonomous vehicles and drones, but also as user assistive tools in medical image analysis and anomaly inspection in industry. Regardless of the end use, object detectors are trained with standard optimization and evaluation strategies. By investigating whether the optimization and evaluation methods of object detectors correlate with human quality judgments, we discover a discrepancy between established metrics and human preferences. To address this, we propose an alternative training loss that better aligns object detectors with human preference.
Subsequently, we ask whether object detections can be used to improve longterm human action recognition in videos. We find that explicitly focusing on the region containing the detected human is beneficial to long-term action recognition models. Unexpectedly, we also find that including a temporal attention module does not help recognizing the videos. Motivated by this result, we investigate how much temporal information is needed to solve long-term action recognition in three popular video datasets. Our results show that most of these videos can be recognized without any long-term temporal information. This suggests that models trained on these videos might exploit short-term shortcuts, instead of learning long-term temporal dependencies. Importantly, these models would not perform successfully on new videos where long-term reasoning is necessary.
As a follow-up, we investigate the impact of the temporal receptive field in longterm action recognition models. The size of the temporal receptive field determines the capability to encode long-term information in videos, like the actions order and duration. We experimentally verify that large temporal receptive fields are sensitive to order and can overfit on the exact action orders seen at training time. Contrarily, short temporal receptive fields are more robust to order permutations and perform better on a current long-term video dataset. This result further demonstrates the irrelevance of long-term information in current long-term action recognition datasets. Our research findings highlight the importance of using training and evaluation metrics that match the intended use of the computer vision systems and choosing training and evaluation datasets that carefully represent the problem at hand.
Object detectors are deployed for object tracking in autonomous vehicles and drones, but also as user assistive tools in medical image analysis and anomaly inspection in industry. Regardless of the end use, object detectors are trained with standard optimization and evaluation strategies. By investigating whether the optimization and evaluation methods of object detectors correlate with human quality judgments, we discover a discrepancy between established metrics and human preferences. To address this, we propose an alternative training loss that better aligns object detectors with human preference.
Subsequently, we ask whether object detections can be used to improve longterm human action recognition in videos. We find that explicitly focusing on the region containing the detected human is beneficial to long-term action recognition models. Unexpectedly, we also find that including a temporal attention module does not help recognizing the videos. Motivated by this result, we investigate how much temporal information is needed to solve long-term action recognition in three popular video datasets. Our results show that most of these videos can be recognized without any long-term temporal information. This suggests that models trained on these videos might exploit short-term shortcuts, instead of learning long-term temporal dependencies. Importantly, these models would not perform successfully on new videos where long-term reasoning is necessary.
As a follow-up, we investigate the impact of the temporal receptive field in longterm action recognition models. The size of the temporal receptive field determines the capability to encode long-term information in videos, like the actions order and duration. We experimentally verify that large temporal receptive fields are sensitive to order and can overfit on the exact action orders seen at training time. Contrarily, short temporal receptive fields are more robust to order permutations and perform better on a current long-term video dataset. This result further demonstrates the irrelevance of long-term information in current long-term action recognition datasets. Our research findings highlight the importance of using training and evaluation metrics that match the intended use of the computer vision systems and choosing training and evaluation datasets that carefully represent the problem at hand.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 31 Oct 2024 |
Print ISBNs | 978-94-6366-935-1 |
DOIs | |
Publication status | Published - 2024 |
Keywords
- computer vision
- shortcut learning
- action recognition
- human evaluation
- object detection