Abstract
The ability to synthesize novel and diverse human motion at scale is indispensable not only to the
umbrella field of computer vision but in multitudes of allied fields such as animation, human computer interaction, robotics and human robot interaction. Over the years, various approaches have been
proposed including physics-based simulation, key-framing, database methods, etc. But ever since the
renaissance of deep learning and the rapid development of computing, the generation of synthetic human
motion using deep learning based methods have received significant attention. Apart from pixel-based
video data, the availability of reliable motion capture systems has enabled pose-based human action
synthesis. Much of it is owed to the development of frugal motion capture systems, which enabled the
curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action
generation.
To begin with, we study an approach for large-scale skeleton-based action generation. In doing so,
we introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multiperson pose-based action sequences with locomotion. Our controllable approach enables variable-length
generations customizable by action category, across more than 100 categories. To enable intra/intercategory diversity, we model the latent generative space using a Conditional Gaussian Mixture Variational Autoencoder. To enable realistic generation of actions involving locomotion, we decouple local
pose and global trajectory components of the action sequence. We incorporate duration-aware feature
representations to enable variable-length sequence generation. We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD120. To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL provides better
quality generations, paving the way for practical and controllable large-scale human action generation.
Further, we study the approaches for methods that are generalizable across datasets with varying
properties and we also study methods for dense skeleton action generation. In this backdrop, we introduce DSAG, a controllable deep neural framework for action-conditioned generation of full body
multi-actor variable duration actions. To compensate for incompletely detailed finger joints in existing
large-scale datasets, we introduce full body dataset variants with detailed finger joints. To overcome
shortcomings in existing generative approaches, we introduce dedicated representations for encoding
finger joints. We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in
body joint counts (24 - 52), frame rates (13 - 50), global body movement (in-place, locomotion) and
action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M).
Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale and also for the challenging task of long-term motion
prediction.