Music Information Retrieval beyond Audio: A Vision-based Approach for Real-world Data

Alessio Bazzica

Research output: ThesisDissertation (TU Delft)

86 Downloads (Pure)


Digital music platforms have recently become the primary revenue stream for recorded music, making record labels and content owners increasingly interested in developing new digital features for their users.
Besides listening to expert-curated playlists and automatically recommended music, users can also benefit from a more informative, non-linearly accessible experience accommodating multiple perspectives on the content.
To give some examples of such enriched experiences, an alternative version of a piece can automatically be suggested. Users can skip throughout a long classical music piece guided by a visualization of its structure (\eg movements, recurring themes). They can also switch viewpoints while watching a music video instead of sticking to the editor's choice.

Developing such features requires innovation of automated content-based methods that extract musical knowledge. Traditionally, Music Information Retrieval (Music IR) researchers have tackled this problem mostly from an audio-only perspective.
Several works have however shown that other types of data, such as social tags, listening behaviors, and symbolic music scores, can largely improve the performance of audio-only algorithms, or even enable tasks that cannot be solved at all using audio alone.

In this thesis, we focus on the relatively unexplored field of \textit{vision-based Music IR}, which studies how to analyze the visual channel accompanying a music recording in order to learn more about the music piece being performed.
Several existing methods require obtrusive settings, such as 3D motion capture systems, which are not applicable in professional environments (\eg during a live classical music concert). Other methods rely instead on favorable viewpoints, static cameras, and uniform backgrounds to simplify the musicians' movements analysis process.
In both cases, the devised algorithms may not be suitable for commercial music platforms, especially those dealing with \textit{real-world data} --- \ie \textit{unstructured} and \textit{unconstrained} music videos.
We therefore consider tasks, algorithms and datasets with the real-world data challenges in mind, advancing the state-of-the-art in two ways: (i) we investigate how to process videos of a single musician aiming to extract musically relevant cues that can be exploited to solve existing, as well as new, Music IR problems, and (ii) we address the challenging case of large ensembles, proposing a way to possibly parse complex scenes and link musician-wise cues to identity and instrumental part annotations.

More in detail, this thesis first presents a global motion feature which aims to represent musicians' movements over time.
While lightweight and instrument-generic, it shows limitations with camera motion.
For this reason, we switch to detecting ``play\-ing/non-playing'' (P/NP) labels, which can be guessed from different viewpoints and at different scales and they can be used to encode the instrumentation of a performance over time.
We first show the value of such semantic feature by proving that it allows to roughly synchronize a symbolic music score to a performance recording.
We then focus on the visual analysis of large classical music ensembles videos, presenting a semi-automatic framework for P/NP annotation.
The experiments show that video face clustering is a critical problem to solve; we therefore illustrate a novel method that exploits the \textit{quasi-static scene} properties of classical music videos to generate better face clusters by relying on an automatically built map of the scene.
Finally, we address the challenging problem of detecting note onsets for clarinetist videos as a case study for woodwind and brass instruments. We propose a novel convolutional network architecture based on multiple streams and absence of temporal pooling, aiming to capture the fine spatio-temporal information conveyed by finger movements.

Our proposed methods, outcomes, and envisioned applications show that real-world music videos are an unexploited asset rather than a problem to avoid.
Furthermore, the light this thesis sheds on vision-based Music IR gives various indications on where future Computer Vision and Music IR research agendas can meet, bringing further innovation to the digital music platforms market.
Original languageEnglish
Awarding Institution
  • Delft University of Technology
  • Hanjalic, A., Supervisor
  • Liem, C.C.S., Supervisor
Award date15 Dec 2017
Print ISBNs978-94-6299-807-0
Publication statusPublished - 2017


  • music information retrieval
  • computer vision
  • cross-modal analysis

Fingerprint Dive into the research topics of 'Music Information Retrieval beyond Audio: A Vision-based Approach for Real-world Data'. Together they form a unique fingerprint.

Cite this