




免费预览已结束,剩余1页可下载查看
付费下载
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Learning Multimodal Representations for Sample effi cient Recognition of Human Actions Miguel Vasco1Francisco S Melo1David Martins de Matos 1 Ana Paiva1Tetsunari Inamura2 Abstract Humans interact in rich and diverse ways with the environment However the representation of such behavior by artifi cial agents is often limited In this work we present motion concepts a novel multimodal representation of human actions in a household environment A motion concept encompasses a probabilistic description of the kinematics of the action along with its contextual background namely the location and the objects held during the performance We introduce a novel algorithm which learns and recognizes motion concepts from action demonstrations named Online Motion Concept Learn ing OMCL The algorithm is evaluated on a virtual reality household environment with the presence of a human avatar OMCL outperforms standard motion recognition algorithms on an one shot recognition task attesting to its potential for sample effi cient recognition of human actions I INTRODUCTION Humans are able to interact with their environment in rich and diverse ways Such richness and variety make it impossible to program an artifi cial agent that is able to recognize all possible actions performed by a human user One common approach is to program agents to learn to rec ognize new human actions from demonstrations However it is unrealistic to assume that such learning will depend on large amounts of data as required by many current action recognition algorithms Instead the agent should be able to learn and recognize novel actions from just a few demonstrations provided by the human To attain such effi cient learning the learning process should take into account the multimodal information pro vided by the human to create a rich representation of the action to be recognized However the conventional method ology of learning human action representations considers only motion pattern data and neglects the rich contextual background of the demonstration This negligence results in a limited representation of the human action hindering its recognition and introducing diffi culties in the distinction between actions with similar motion patterns In this work we address the problem of learning and recognizing human actions from few demonstrations pro vided by a human in a household environment The main contributions of the paper are two fold 1M Vasco F S Melo D M de Matos and A Paiva are with INESC IDandInstitutoSuperiorT ecnico UniversityofLisbon Portugal E mail miguel vasco gaips inesc id ptand fmelo david matos ana paiva inesc id pt 2T Inamura is with the National Institute of Informatics and the Department of Informatics SOKENDAI The Graduate University for Advanced Studies 2 1 2 Hitotsubashi Chiyoda ku Tokyo Japan Email inamura nii ac jp We propose a novel representation that allows for mul tiple demonstrations of a given action named motion concept A motion concept encompasses a probabilistic description of the motion patterns observed augmenting it with their contextual background information namely the location of the action and the objects used during the demonstrations Moreover the motion concept takes into account information provided directly through in teraction with a human and allows the agent to reason about the importance of each contextual feature for its recognition We introduce the Online Motion Concept Learning OMCL algorithm responsible for the creation of new motion concepts through interaction with a human user The algorithm is able to recognize motion concepts from a single training demonstration and continuously update motion concepts as more demonstrations are provided We evaluate the algorithm s performance on an offl ine one shot motion recognition task revealing the importance of contextual information for the recognition of motion concepts built from a single training demonstration The obtained results attest to the potential of OMCL for sample effi cient recognition of human actions II RELATEDWORK Several representations have been proposed to model hu man action based on motion data Xia et al 1 propose a view invariant action representation based on histograms of 3D joint position modelled according to discrete HMMs The authors in 2 propose a novel feature for human action recognition based on the differences in the position of joints in the skeletal model of the human employing a Naive Bayes Nearest Neighbor NBNN classifi er for recognition of the action classes The authors in 3 propose an interpretable representation of an action based on the sequence of joints in the skeletal model of the human which at each time instant are considered to be the most informative of a given demonstration The informative criteria are based on predefi ned measures such as the mean and variance of joint angle trajectories Vemulapalli et al 4 model an action as a curve in a Lie group manifold The curve of each action is generated based on a skeletal based representation that explicitly models the geometric relationships between various body parts using rotations and translations in 3D space However all the presented representations neglect the rich contextual background of an action which is fundamen tal for the distinction of action classes with similar motion patterns In this work we consider contextual information 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE4288 provided by the human objects used and the location of the demonstration in the creation of the action representations Several deep learning frameworks have been recently pro posed for motion and action recognition Du et al 5 proposed a hierarchical recurrent neural network RNN framework for skeleton based action recognition in which the human skeleton is divided according to the human s physical structure Simonyan et al 6 propose a convolu tional network architecture for action recognition in video that incorporates spatial and temporal networks While these architectures obtain impressive recognition results they re quire large amounts of training data In this work we build and recognize action representations from a limited number of demonstrations Multimodal approaches to the creation of action represen tations have also been explored in literature The authors in 7 represent actions as an ensemble model and have pro posed novel features for depth data which capture human mo tion and human object interaction data from a demonstration Using image data Yao et al 8 learn action representations composed of attributes words that describe the properties of human actions and action parts the objects and poselets that are related to the actions The authors in 9 model an action by integrating multiple feature channels such as objects scenes and people extracted from video sequences In this work we build multimodal action representations in a learning from demonstration scenario from a small number of demonstrations III CONCEPTUALREPRESENTATION OF ANACTION A The setup We consider the following setup for learning from demon stration A human user demonstrates an action which may involve interaction with objects in the environment The environment comprises a number of locations of interest and the human user may be in any of these locations at the time of demonstration We assume that the environment is engineered with a number of sensors providing information regarding the lo cation and pose of the human user as well as the objects that the user interacts with see Fig 1 for an illustration The sensors act as input channels for the system and as such we henceforth refer to sensors generally as a channels a sensor deployed to provide object information is referred simply as an object channel sensors deployed to track the human pose are referred as motion channels and the location of the user in the environment is provided by a dedicated sensing module referred to as the location channel A demonstration by a human user yields a number of data streams arising from the different input channels In particular Each motion channel k k 1 K provides two streams of length T x x xk 0 T and R R Rk 0 T where each x x xkt indicates the position of a body element joint limb at time step t t 0 T measured with respect to a common fi xed world frame and each R R Rk t is a rotation matrix representing the orientation of that same body element at time step t Fig 1 Depiction of the setup used throughout the paper Sensors correspond to input channels and provide streams of observations of length T 1 Each object channel m m 1 M provides one stream of length T o o om 0 T where each individual obser vation o o om t corresponds to a binary vector indicating the objects from a predefi ned fi nite set of objects O that the user is interacting with at time step t t 0 T according to channel m Finally the location module provides a stream of length T l0 T where ltindicates the location of the user at time step t We assume that the location of the user takes values in a fi nite set L of possible locations Learning a representation of an action will consist of taking the streams from the different input channels and compile them into a unique compact representation that we refer to as a motion prototype B Motion prototype The central constituent in our proposed action representa tion is the motion prototype providing a compact representa tion for a single demonstration of an action by a human user In particular motion prototypes capture in a probabilistic manner motion object and location information Formally we represent a motion prototype as a tuple P where summarize the associated context information namely object and location information and summarizes the motion observed in the demonstration Specifi cally m m 1 M where M is the total number of object channels For every object o O m o P om t o 1 t 0 T where om t o is a random variable indication whether object o was observed in object channel m at time step t We assume that the observation of an object o O in channel m at any moment during the human demonstration can be described probabilistically as a Bernoulli random variable with parameter m o 4289 Fig 2 Summary representation of a motion prototype For every location l L l P lt l t 0 T where ltis a random variable indicating the location of the human demonstrator at time step t We assume that the location of the human user during the demonstra tion can be described probabilistically as a categorical distribution with parameters l l L Finally we have that k k 1 K where K is the number of motion channels and each kis a sequence of mo tion primitives n n 1 N a concept widely explored to describe and represent animal and robot motion 10 11 In this work a motion primitive nis a probability distribution over the space of trajectories given an arbitrary trajectory x x x0 T R R R0 T n x x x0 T R R R0 T P xn 0 T x x x0 T Rn 0 T R R R0 T For the purpose of learning and recognition it is conve nient to treat an action not as comprising a single trajectory x x x0 T R R R0 T but instead as a sequence of smaller trajectories x x x 0 t1 R R R0 t1 x x xt1 t2 R R Rt1 t2 x x x tN 1 tN R R RtN 1 tN which are then encoded as a sequence of motion primitives n n 1 N each nproviding a compact description of x x xtn 1 tn R R Rtn 1 tn Each motion primitive nis selected among a library of available motion primitives to maximize the likelihood of the observed trajectory i e n argmax x x x tn 1 tn R R Rtn 1 tn 1 Summarizing a motor prototype compactly encodes a demonstration of an action in the form of a tuple where is a collection of trajectories one for each motion channel each represented as a sequence of motor primitives is a collection of probability distributions one for each object channel describing how the human interacted with each object in the environment and is a probability distribution describing where the human was located during the demonstration see Fig 2 Fig 3 Schematic representation of a motion concept C Motion concept It is possible for a single action to be performed in multiple different ways A motion prototype while providing a convenient representation for a single demonstration and corresponding context is insuffi cient to capture the diversity that a broader notion of action entails We introduce motion concept as a higher level representation of an action A motion concept seeks to accommodate the different ways by which an action may be performed while at the same time encode distinctive aspects that are central in recognizing such action For example to distinguish actions such as waving goodbye and washing a window it is important to note that the latter involves interaction with an object such a sponge while the fi rst does not Formally a motion concept of action a depicted in Fig 3 is a tuple a P k k where P P P P1 P where each Piis a motion prototype describing one possible way by which the action a can be performed is a designation a name provided by the user to refer to the action a for example it may consist of a label for a or an utterance that corresponds to the spoken designation of a kpand k are two constants used to weight the impor tance of object information and location information for the recognition of action a IV LEARNING MOTION CONCEPTS We now introduce the OMCL Online Motion Concept Learning algorithm designed to construct motion concepts from the demonstration data provided by a human user The OMCL algorithm can roughly be understood as working on two different abstraction levels At a lower level OMCL takes the data from a single demonstration and constructs a motion prototype from such data At a higher level OMCL combines information from multiple demonstrations to build a motion concept that potentially contains multiple motion prototypes We now detail OMCL at each of these two abstraction levels A Learning a motion prototype from data Given a demonstration of action a OMCL starts by seg menting the single trajectory of motion channel k x x xk 0 T R R R k 0 T 4290 into multiple sub trajectories x x xk 0 t1 R R Rk0 t1 x x xktN 1 tN R R RktN 1 tN 2 which can be achieved using any segmentation method from the literature OMCL is agnostic to the particu lar segmentation method used OMCL uses online kernel density estimation1to build new motion primitives from sub trajectory data The list of segmented sub trajectories Eq 2 is used to update the previous library of avail able motion primitives Subsequently each sub trajectory x x xk tn 1 tn R R Rktn 1 tn 0 n N is evaluated against the updated and a primitive nis selected accordingly to Eq 1 The resulting sequence of motion primitives k 1 k N corresponds to the trajectory representation k previously described Such procedure is repeated for all motion channels in order to build a As for aand awe use standard maximum likelihood estimation to compute the parameters a o o O and a l l L from the data o o o0 Tand l0 T respectively B Building a motion concept From a provided demonstration of action a the OMCL algorithm learns a motion prototype Pa a a a If Pa is the fi rst motion prototype of action a provided a new motion concept ais built through the following procedure The motion prototype is added to the empty list of prototypes P P P P P P Pa The importance weights k k are initialized to prede termined values k k 0 k k 0 The human is queried for the designation of the action If the motion prototype Paconcerns an action a previously demonstrated it is then used to update the respective motion concept a as follows The motion prototype is added to the list of prototypes P P P1 Pa We update the values of the importance weights k k If for the majority of Pi i i i P P P Papreviously contained in P P P argmax o O i m o argmax o O a m o m M we increase the value of k by a percentage kof its value Otherwise we decrease it by the same percent age The same update rule is applied for k resorting to the location data 1In our implementation we use the XOKDE algorithm from 12 V RECOGNIZINGMOTIONCONCEPTS The OMCL algorithm is also capable of recognizing previously observed actions and assessing the novelty of previously unobserved actions Given a motion prototype from an unknown action P OMCL compares P with the prototypes contained in every motion concept in the current library of motion concepts The cost of assigning P to the motion concept i is given by C i P CM P P Pi k CO P P Pi k CL P P Pi 3 where C Pi is the average distance between the se quence of motion primitives in and the sequence of motion primitives of every motion prototype P Pi of i To compute this cost we use dynamic time warping 13 between the sequences of each motion channel with 0 1 loss C Pi is the average distance between and the collection of object probability distributions of every motion prototype P P P Piof i The algorithm is agnostic to the type of metric used to compute the distance between distributions C P is the average distance between and the location probability distributions of every motion prototype P P P Piof i k k are the object and location information weights of i For recognition purposes the cost is computed for all motion concepts in and P is assigned to the motion concept R with the lowest assignment cost C R P To address the question of the possible novelty of the action demonstrated after the recognition procedure OMCL determines if P belongs to the assigned motion concept or if it belongs to a novel action class The decision takes into account the average cost CWof assigning P to the motion concepts in R In case C R P CW C C R P 4 the provisional assignment of P to R is confi rmed where C is a predefi ned constant Otherwise the assignment can be considered quasi random and P will be used to build a new motion concept as detailed in the previous section VI EVALUATION A Experimental Setup The evaluation of the OMCL algorithm was pe
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 语音识别试题及答案
- 阿里定级面试题及答案
- 房地产销售策略与实战
- 2025年 道真自治县“特岗计划”教师招聘考试笔试试卷附答案
- 员工安全培训手册
- 2025年中国喷气背包行业市场全景分析及前景机遇研判报告
- 2025年中国内衣裤洗衣机行业市场全景分析及前景机遇研判报告
- 急救培训圆满毕业
- 住院患者护理风险评估制度
- 肿瘤晚期患者教育
- 物业承接查验标准及表格
- 灯箱广告投标方案(完整技术标)
- dzl213型锅炉低硫烟煤烟气袋式除尘湿式脱硫系统设计
- SOP标准作业指导书excel模板
- 《公路桥涵养护规范》(5120-2021)【可编辑】
- 新人教版一年级数学下册期末考试卷(附答案)
- 人教版三年级语文上册期末试卷及答案【完整】
- ptfe膜雨棚施工方案
- 人工智能伦理规则
- 米亚罗-孟屯河谷风景名胜区旅游基础设施建设项目环评报告
- 妇产科护理学教材(课后思考题参考答案)
评论
0/150
提交评论