Classification of Multi-class Daily Human Motion using Discriminative Body Parts and Sentence Descriptions
Int J Comput Vis
Classification of Multi-class Daily Human Motion using Discriminative Body Parts and Sentence Descriptions
Yusuke Goutsu 0 1 2 3
Wataru Takano 0 1 2 3
Yoshihiko Nakamura 0 1 2 3
0 Center for Mathematical Modeling and Data Science, Osaka University , 1-3 Machikaneyamacho, Toyonaka-shi, Osaka , Japan
1 Computer Vision Research Group, Advanced Industrial Science and Technology (AIST) , Central 1, 1-1-1 Umezono, Tsukuba, Ibaraki , Japan
2 Communicated by Koichi Kise
3 Mechano-Informatics, University of Tokyo , 7-3-1 Hongo, Bunkyo-ku, Tokyo , Japan
In this paper, we propose a motion model that focuses on the discriminative parts of the human body related to target motions to classify human motions into specific categories, and apply this model to multi-class daily motion classifications. We extend this model to a motion recognition system which generates multiple sentences associated with human motions. The motion model is evaluated with the following four datasets acquired by a Kinect sensor or multiple infrared cameras in a motion capture studio: UCFkinect; UT-kinect; HDM05-mocap; and YNL-mocap. We also evaluate the sentences generated from the dataset of motion and language pairs. The experimental results indicate that the motion model improves classification accuracy and our approach is better than other state-of-the-art methods for specific datasets, including human-object interactions with variations in the duration of motions, such as daily human motions. We achieve a classification rate of 81.1% for multi-class daily motion classifications in a non cross-subject setting. Additionally, the sentences generated by the motion recognition system are semantically and syntactically appropriate for the description of the target motion, which may lead to human-robot interaction using natural language.
Hidden Markov model; Fisher vector; Multiple kernel learning; Motion classification; Multi-class; Sentence description
1 Introduction
As the result of a change of social demand from industrial
uses to service uses, robots and systems have become more
intelligent and are a familiar presence in our daily lives.
Along with this change, intelligent robots and systems used in
human living areas should be expected to have the abilities to
observe humans closely, understand human behavior, grasp
their intentions and give proper livelihood support.
Classifying daily human motions into specific categories plays an
important role because a failure to do so could cause danger
or inconvenience to humans.
An intuitive and common method to represent human
motions is to use sequences of skeleton configurations.
Optical motion capture systems provide accurate 3D skeleton
markers of motion by using multiple infrared cameras. These
systems are limited to use in motion capture studios and
subjects have to wear cumbersome devices while performing
motions. However, the release of low-cost and marker-less
motion sensors, such as the Kinect developed by Microsoft,
has recently made skeleton-position extractions much easier
and more practical for skeleton-based motion classification
(Shotton et al. 2013)
.
Presti and Cascia (2016)
have reviewed
the many works related to skeleton-based motion
classification.
In this context, we proceed on the basis of the
following two findings. First, local motion features derived from
discriminative parts of human body are more useful than a
global motion feature derived from the whole body. This is
because the discriminative body parts are different
according to the target motion. For example, the “punch” motion
mainly uses one arm, the “clap” motion mainly uses both
arms and the “run” motion mainly uses both legs. Second, it
is also desirable to classify daily human motions
systematically to focus on the discriminative body parts related to the
target motion. This is because human motion is an
interaction between objects in the environment and the body parts
in contact with them. For example, the relationship between
the positions of a hand and the face becomes important in the
“make a phone call” or “drink water” motions because of the
contact between an object and an ear or the mouth,
respectively. However, simply classifying human motions cannot
directly lead to behavior supports. A connection to other
information is also required for the highly intelligent
processing referred to as “motion recognition”. Here, humans
are different from other animals in that they can understand
the real world using natural language and engage in complex
communication with others. In order to understand the real
world in the same way, it is important for intelligent robots
and systems to link the real world with natural language.
Therefore, we also use the properties of natural language,
which has the benefits of scalability due to the usage of
largescale language corpora and interpretability by humans. By
connecting human motions to common words, motion
classification expands to include a (...truncated)