Abstract
Human action recognition, with its irrefutable and varied use cases across fields of surveillance,
robotics, human object interaction analysis and many more, has gained critical importance and attention in the field of compute vision. Traditionally entirely based on RGB sequences, action recognition
domain has shifted focus towards using skeleton sequences due to the easy availability of skeleton data
capturing apparatus and the release of large scale datasets, in recent years. Skeleton based human action recognition, having superiority in terms of privacy, robustness and computational efficiency over
traditional RGB based action recognition, is the primary focus of this thesis.
Ever since the release of large scale skeleton action datasets namely NTURGB+D and NTURGB+D
120, the community has solely focused on developing complex approaches, ranging from CNNs to
complex GCNs and more recently transformers, to achieve the best classification accuracy for these
datasets. However, in this rat race for state of the art performance, the community turned a blind eye
to a major drawback at the data level which bottlenecks even the most sophisticated approaches. This
drawback is where we start our explorations in this thesis.
The pose tree provided in the NTURGB+D datasets contains only 25 joints, out of which only 6
joints (3 for each hand) are finger joints. This is a major drawback since only 3 finger level joints are not
sufficient enough to distinguish between action categories such as ”Thumbs up” and ”Thumbs down” or
”Make ok sign” and ”Make victory sign”. To specifically address this bottleneck, we introduce two new
pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing
action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTURGBD, NTU60-X and NTU120-X dataset include finger and facial joints, enabling a richer skeleton
representation. We appropriately modify the state of the art approaches to enable training using the
introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming
the aforementioned bottleneck and improving the state of the art performance, overall and on previously
worst performing action categories.
Pose-based action recognition is predominantly tackled by approaches that treat the input skeleton
in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches
ignore the fact that action categories are often characterized by localized action dynamics involving
only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’).
Although part-grouping based approaches exist, each part group is not considered within the global
pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times
on these streams, which massively increases the number of training parameters. To address these issues,
we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At
the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is
unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves the state of the
art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU
60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400%
more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency make it an attractive choice for
action recognition and for deployment on compute-restricted embedded and edge devices.
Finally, we conclude this thesis by exploring new and more challenging frontiers under the umbrella
of skeleton action recognition namely ”in the wild” skeleton action recognition and ”non-contextual”
skeleton action recognition. We introduce Skeletics-152, a curated and 3D pose dataset derived from
the RGB videos included in the larger Kinetics-700 dataset to explore in the wild skeleton action recognition. We further introduce, Skeleton-mimetics, a 3D pose dataset derived from recently introduced
non-contextual action dataset-Mimetics. By benchmarking and analysing various approaches on these
two new dataset we lay the ground for future exploration in these two challenging problems within
skeleton action recognition.
Overall in this thesis, we draw attention to prevailing drawbacks in the existing skeleton action
datasets and introduce extensions of these datasets to counter their shortcomings. We also introduce a
novel, efficient and hi