Multi-modal human aggression detection

J. F.P. Kooij; M. C. Liem; J. D. Krijnders; T.C. Andringa; D. M. Gavrila

doi:10.1016/j.cviu.2015.06.009

Multi-modal human aggression detection

J. F.P. Kooij, M. C. Liem, J. D. Krijnders, T.C. Andringa, D. M. Gavrila^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Scientific › peer-review

60 Citations (Scopus)

Abstract

This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of complementary audio and video cues to disambiguate scene activity in real-life environments. From the video side, the system uses overlapping cameras to track persons in 3D and to extract features regarding the limb motion relative to the torso. From the audio side, it classifies instances of speech, screaming, singing, and kicking-object. The audio and video cues are fused with contextual cues (interaction, auxiliary objects); a Dynamic Bayesian Network (DBN) produces an estimate of the ambient aggression level. Our prototype system is validated on a realistic set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.

Original language	English
Pages (from-to)	106-120
Number of pages	15
Journal	Computer Vision and Image Understanding
Volume	144
DOIs	https://doi.org/10.1016/j.cviu.2015.06.009
Publication status	Published - 2016
Externally published	Yes

Keywords

Aggression detection
Automated video surveillance
Dynamic Bayesian Network
Multi-modal sensor fusion

Access to Document

10.1016/j.cviu.2015.06.009

Cite this

@article{32e7270d9283465f9c145c1bdf8accda,

title = "Multi-modal human aggression detection",

abstract = "This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of complementary audio and video cues to disambiguate scene activity in real-life environments. From the video side, the system uses overlapping cameras to track persons in 3D and to extract features regarding the limb motion relative to the torso. From the audio side, it classifies instances of speech, screaming, singing, and kicking-object. The audio and video cues are fused with contextual cues (interaction, auxiliary objects); a Dynamic Bayesian Network (DBN) produces an estimate of the ambient aggression level. Our prototype system is validated on a realistic set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.",

keywords = "Aggression detection, Automated video surveillance, Dynamic Bayesian Network, Multi-modal sensor fusion",

author = "Kooij, {J. F.P.} and Liem, {M. C.} and Krijnders, {J. D.} and T.C. Andringa and Gavrila, {D. M.}",

year = "2016",

doi = "10.1016/j.cviu.2015.06.009",

language = "English",

volume = "144",

pages = "106--120",

journal = "Computer Vision and Image Understanding",

issn = "1077-3142",

publisher = "Academic Press",

}

TY - JOUR

T1 - Multi-modal human aggression detection

AU - Kooij, J. F.P.

AU - Liem, M. C.

AU - Krijnders, J. D.

AU - Andringa, T.C.

AU - Gavrila, D. M.

PY - 2016

Y1 - 2016

N2 - This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of complementary audio and video cues to disambiguate scene activity in real-life environments. From the video side, the system uses overlapping cameras to track persons in 3D and to extract features regarding the limb motion relative to the torso. From the audio side, it classifies instances of speech, screaming, singing, and kicking-object. The audio and video cues are fused with contextual cues (interaction, auxiliary objects); a Dynamic Bayesian Network (DBN) produces an estimate of the ambient aggression level. Our prototype system is validated on a realistic set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.

AB - This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of complementary audio and video cues to disambiguate scene activity in real-life environments. From the video side, the system uses overlapping cameras to track persons in 3D and to extract features regarding the limb motion relative to the torso. From the audio side, it classifies instances of speech, screaming, singing, and kicking-object. The audio and video cues are fused with contextual cues (interaction, auxiliary objects); a Dynamic Bayesian Network (DBN) produces an estimate of the ambient aggression level. Our prototype system is validated on a realistic set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.

KW - Aggression detection

KW - Automated video surveillance

KW - Dynamic Bayesian Network

KW - Multi-modal sensor fusion

UR - http://www.scopus.com/inward/record.url?scp=84956691496&partnerID=8YFLogxK

U2 - 10.1016/j.cviu.2015.06.009

DO - 10.1016/j.cviu.2015.06.009

M3 - Article

AN - SCOPUS:84956691496

SN - 1077-3142

VL - 144

SP - 106

EP - 120

JO - Computer Vision and Image Understanding

JF - Computer Vision and Image Understanding

ER -

Multi-modal human aggression detection

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this