Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Ziliang Ren; Xiongjiang Xiao; Huabei Nie

doi:10.3390/s24237682

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Sensors (Basel). 2024 Nov 30;24(23):7682. doi: 10.3390/s24237682.

Authors

Ziliang Ren¹, Xiongjiang Xiao¹, Huabei Nie²

Affiliations

¹ School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523820, China.
² School of Artificial Intelligence, Dongguan City University, Dongguan 523419, China.

PMID: 39686219
DOI: 10.3390/s24237682

Abstract

Action recognition based on 3D heatmap volumes has received increasing attention recently because it is suitable for application to 3D CNNs to improve the recognition performance of deep networks. However, it is difficult for models to capture global dependencies due to their restricted receptive field. To effectively capture long-range dependencies and balance computations, a novel model, PoseTransformer3D with Global Cross Blocks (GCBs), is proposed for pose-based action recognition. The proposed model extracts spatio-temporal features from processed 3D heatmap volumes. Moreover, we design a further recognition framework, RGB-PoseTransformer3D with Global Cross Complementary Blocks (GCCBs), for multimodality feature learning from both pose and RGB data. To verify the effectiveness of this model, we conducted extensive experiments on four popular video datasets, namely FineGYM, HMDB51, NTU RGB+D 60, and NTU RGB+D 120. Experimental results show that the proposed recognition framework always achieves state-of-the-art recognition performance, substantially improving multimodality learning through action recognition.

Keywords: 3D heatmap volumes; action recognition; global cross learning; pose modality; vision transformers (ViTs).