Recognizing an action from a sequence of 3D skeletal poses is a challenging task. First, different actors may perform the same action in various performing styles. Second, the estimated poses are sometimes inaccurate due to sensory noises. These challenges can cause large variations between instances of the same class. Third, the datasets are usually small, with only a few actors performing few repetitions of each action. Hence training complex classifiers risks over-fitting the data. We address this task by mining a set of key-pose-motifs for each action class. A key-pose-motif contains a set of ordered poses or action units(a short sequence of poses), which are required to be close but not necessarily adjacent in the action sequences. The representation is robust to style variations and outlier poses. The key-pose-motifs are represented in terms of a dictionary using soft-quantization (probabilities) to deal with inaccuracies caused by quantization. We propose an efficient algorithm to mine key-pose-motifs taking into account these probabilities. We classify a sequence by matching it to the motifs of each class and select the class that maximizes the matching score. This simple classifier obtains state-of-the-art performance on two benchmark datasets and outperforms a deep network approach.
Supplementary notes can be added here, including code and math.