TY - GEN
T1 - Multimodal Human Action Recognition Based on a Fusion of Dynamic Images Using CNN Descriptors
AU - Escobedo Cardenas, Edwin Jonathan
AU - Camara Chavez, Guillermo
N1 - Funding Information:
The authors thank the Graduate Program in Computer Science (PPGCC) at the Federal University of Ouro Preto (UFOP), the Coordination for the Improvement of Higher Level Personneland (CAPES) and the funding Brazilian agency CNPq.
Publisher Copyright:
© 2018 IEEE.
PY - 2019/1/15
Y1 - 2019/1/15
N2 - In this paper, we propose the use of dynamic-images-based approach for action recognition. Specifically, we exploit the multimodal information recorded by a Kinect sensor (RGB-D and skeleton joint data). We combine several ideas from rank pooling and skeleton optical spectra to generate dynamic images to summarize an action sequence into single flow images. We group our dynamic images into five groups: a dynamic color group (DC); a dynamic depth group (DD) and three dynamic skeleton groups (DXY, DYZ, DXZ). As action is composed of different postures along time, we generated N different dynamic images with the main postures for each dynamic group. Next, we applied a pre-trained flow-CNN to extract spatiotemporal features with a max-mean aggregation. The proposed method was evaluated on a public benchmark dataset, the UTD-MHAD, and achieved the state-of-the-art result.
AB - In this paper, we propose the use of dynamic-images-based approach for action recognition. Specifically, we exploit the multimodal information recorded by a Kinect sensor (RGB-D and skeleton joint data). We combine several ideas from rank pooling and skeleton optical spectra to generate dynamic images to summarize an action sequence into single flow images. We group our dynamic images into five groups: a dynamic color group (DC); a dynamic depth group (DD) and three dynamic skeleton groups (DXY, DYZ, DXZ). As action is composed of different postures along time, we generated N different dynamic images with the main postures for each dynamic group. Next, we applied a pre-trained flow-CNN to extract spatiotemporal features with a max-mean aggregation. The proposed method was evaluated on a public benchmark dataset, the UTD-MHAD, and achieved the state-of-the-art result.
KW - action recognition
KW - CNN
KW - dynamic images
KW - RGB D data
UR - http://www.scopus.com/inward/record.url?scp=85062229311&partnerID=8YFLogxK
U2 - 10.1109/SIBGRAPI.2018.00019
DO - 10.1109/SIBGRAPI.2018.00019
M3 - Articulo (Contribución a conferencia)
AN - SCOPUS:85062229311
T3 - Proceedings - 31st Conference on Graphics, Patterns and Images, SIBGRAPI 2018
SP - 95
EP - 102
BT - Proceedings - 31st Conference on Graphics, Patterns and Images, SIBGRAPI 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st Conference on Graphics, Patterns and Images, SIBGRAPI 2018
Y2 - 29 October 2018 through 1 November 2018
ER -