Video face clustering is a fundamental step in automatically annotating a video in terms of when and where (i.e., in which video shot and where in a video frame) a given person is visible. State-of-the-art face clustering solutions typically rely on the information derived from visual appearances of the face images. This is challenging because of a high degree of variation in these visual appearances due to factors like scale, viewpoint, head pose and facial expression. As a result, either the generated face clusters are not sufficiently pure, or their number is much higher than that of people appearing in the video. A possible way towards improved clustering performance is to analyze visual appearances of faces in specific contexts and take the contextual information into account when designing the clustering algorithm. In this paper, we focus on the context of quasi-static scenes, in which we can assume that the people's positions in a scene are (quasi-)stationary. We present a novel video clustering algorithm that exploits this property to match faces and efficiently propagate face labels across the scope of viewpoints, scale and level of zoom characterizing different frames and shots of a video. We also present a novel publicly available dataset of manually annotated quasi-static scene videos. Experimental assessment on the latter indicates that exploiting information derived by the scene and the spatial relationships between people can substantially improve the clustering performance compared to the state-of-the-art in the field.
- Video face annotation
- Face clustering