Information on whether a musician in a large symphonic orchestra plays her instrument at a given time stamp or not is valuable for a wide variety of applications aiming at mimicking and enriching the classical music concert experience on modern multimedia platforms. In this work, we propose a novel method for generating playing/non-playing labels per musician over time by efficiently and effectively combining an automatic analysis of the video recording of a symphonic concert and human annotation. In this way, we address the inherent deficiencies of traditional audio-only approaches in the case of large ensembles, as well as those of standard human action recognition methods based on visual models. The potential of our approach is demonstrated on two representative concert videos (about 7 hours of content) using a synchronized symbolic music score as ground truth. In order to identify the open challenges and the limitations of the proposed method, we carry out a detailed investigation of how different modules of the system affect the overall performance.