多模态人机交互综述(译文)_第1页
多模态人机交互综述(译文)_第2页
多模态人机交互综述(译文)_第3页
多模态人机交互综述(译文)_第4页
多模态人机交互综述(译文)_第5页
已阅读5页,还剩26页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Alejandro Jaimes Nicu Sebe Multimodal human computer interaction A survey Computer Vision and Image Understanding 2007 多模态人机交互综述多模态人机交互综述 摘要 本文总结了多模态人机交互 MMHCI Multi Modal Human Computer Interaction 的主要方法 从计算机视觉角度给出了领域的全貌 我们尤其将重点放在身体 手势 视线和情感交互 人脸表情识 别和语音中的情感 方面 讨论了用户和任务建模及多模态融合 multimodal fusion 并指出了多模态人 机交互研究的挑战 热点课题和兴起的应用 highlighting challenges open issues and emerging applications 1 引言 多模态人机交互 MMHCI 位于包括计算机视觉 心理学 人工智能等多个研究领域的交叉点 我 们研究MMHCI是要使得计算机技术对人类更具可用性 Usable 这总是需要至少理解三个方面 与计 算机交互的用户 系统 计算机技术及其可用性 和用户与系统间的交互 考虑这些方面 可以明显看出 MMHCI 是一个多学科课题 因为交互系统设计者应该具有一系列相关知识 心理学和认知科学来理 解用户的感知 认知及问题求解能力 perceptual cognitive and problem solving skills 社会学来理 解更宽广的交互上下文 工效学 ergonomics 来理解用户的物理能力 图形设计来生成有效的界面展 现 计算机科学和工程来建立必需的技术 等等 MMHCI的多学科特性促使我们对此进行总结 我们不是将重点只放在MMHCI的计算机视觉技术方 面 而是给出了这个领域的全貌 从计算机视觉角度I讨论了MMHCI中的主要方法和课题 1 1 动动机机 在人与人通信中本质上要解释语音和视觉信号的混合 很多领域的研究者认识到了这点 并在单 一模态技术unimodal techniques 语音和音频处理及计算机视觉等 和硬件技术hardware technologies 廉价的摄像机和其它类型传感器 的研究方面取得了进步 这使得MMHCI方面的研究已经有了重要进 展 与传统HCI应用 单个用户面对计算机并利用鼠标或键盘与之交互 不同 在新的应用 如 智能家居 105 远程协作 艺术等 中 交互并非总是显式指令 explicit commands 且经常包含多个用户 部 分原因式在过去的几年中计算机处理器速度 记忆和存储能力得到了显著进步 并与很多使普适计算 ubiquitous computing 185 67 66 成为现实的新颖输入和输出设备的有效性相匹配 设备包括电话 phones 嵌入式系统 embedded systems 个人数字助理 PDA 笔记本电脑 laptops 屏幕墙 wall size displays 等等 大量计算具有不同计算能量和输入输出能力的设备可用意味着计算的未来 将包含交互的新途径 一些方法包括手势 gestures 136 语音 speech 143 触觉 haptics 9 眨 眼 eye blinks 58 和其它方法 例如 手套设备 Glove mounted devices 19 和and可抓握用户界面 graspable user interfaces 48 及有形用户界面 Tangible User interface 现在似乎趋向成熟 ripe for exploration 具有触觉反馈 视线跟踪和眨眼检测 69 的点设备 Pointing devices 现也已出现 然而 恰如在人与人通讯中一样 当以组合方式使用不同输入设备时 情感通讯 effective communication 就 会发生 多模态界面具有很多优点 34 可以防止错误 为界面带来鲁棒性 帮助用户更简单地纠正错误或 复原 为通信带来更宽的带宽 对不同的状况和环境增加可选的通信方法 在很多系统中 采用多模 态接口消除易出错模态 error prone modalities 的模糊性是多模态应用的重要动机之一 如Oviatt 123 所述 易出错技术可以相互补充 而不是给接口带来冗余和减少纠错的需要 然而 必须指出的是 多模态单独 multiple modalities alone 并不为界面带来好处 多模态的使用可能是无效的 ineffective 甚至是无益的 disadvantageous 据此 Oviatt 124 已经提出了多模态接口的共同错误概念 common misconceptions or myths 其中大多数与采用语音作为输入模态相关 本文中 我们调研了我们认为是MMHCI本质的研究领域 概括了当前研究状况 the state of the art 并以我们的调研结果为基础 给出了MMHCI中的主要趋势和研究课题 identify major trends and open issues 我们按照人体将视觉技术进行了分组 如图1所示 大规模躯体运动 Largescale body movement 姿势 gesture 和注视 gaze 分析用于诸如情感交互中的表情识别任务或其它各种应用 我们讨论了情感计算机交互 affective computer interaction 多模态融合 建模和数据收集中的课题及 各种正在出现的MMHCI应用 由于MMHCI是一个非常动态和广泛的研究领域 我们不是去呈现完整的 概括 因此 本文的主要贡献是在对在MMHCI中使用的主要计算机视觉技术概括的同时 给出对 MMHCI中的主要研究领域 技术 应用和开放课题的综述 Fig 1 采用以人为中心多模态交互概略 1 2 Related surveys 已经有在多个领域中广泛的综述发表 诸如人脸检测 190 63 人脸识别 196 人脸表情分析 facial expression analysis 47 131 语音情感 vocal emotion 119 109 姿态识别 gesture recognition 96 174 136 人运动分析 human motion analysis 65 182 182 56 3 46 107 声音 视觉 自动语音识别 audio visual automatic speech recognition 143 和眼跟踪 eye tracking 41 36 对基于 视觉HCI的综述呈现在 142 和 73 中 其重点是头部跟踪 head tracking 人脸和脸部表情识别 face and facial expression recognition 眼睛跟踪 eye tracking 及姿态识别 gesture recognition 文 40 中讨论了自适应和智能HCI 主要是对用于人体运动分析的计算机视觉的综述和较低手臂运动检测 人 脸处理和注视分析技术的讨论 125 128 144 158 135 171 中讨论了多模态接口 84 和 77 中讨论 了HCI的实时视觉技术 Real time vision 包括人体姿态 对象跟踪 手势 注视力和脸姿态等 这里 我们不讨论前面综述中包含的工作 增加前面综述中没有覆盖的领域 如 84 40 142 126 115 并讨 论在兴起领域中的新的应用 着重指出了主要研究课题 相关的的会议和讨论会包括 ACM CHI IFIP Interact IEEE CVPR IEEE ICCV ACM Multimedia International Workshop on Human Centered Multimedia HCM in conjunction with ACM Multimedia International Workshops on Human Computer Interaction in conjunction with ICCV and ECCV Intelligent User Interfaces IUI conference和International Conference on Multimodal Interfaces ICMI 2 多模态交互概要 术语 multimodal 已经在很多场合使用并产生了多种释义 如 10 12 中对模态的解释 对于我们 来讲 多模态HCI系统简单地是一个以多种模态或通信通道响应输入的系统 如 语音speech 姿态 gesture 书写writing和其它等等 我们采用 以人为中心 的方法 human centered approach 所 指的 借助于模态 by modality 意味着按照人的感知 human senses 的通信模式和由人激活或衡量 人的量 如 血压计 的计算机输入设备 如图1所示 人的感知包括视线 sight 触觉 touch 听力 hearing 嗅觉 smell 和味觉 taste 很多计算机输入设备的输入模态对应于人的感知 摄像机 cameras sight 触觉传感器haptic sensors touch 9 麦克风microphones hearing 嗅觉设备 olfactory smell 和味觉设备taste 92 然而 很多其它由人激活的计算机输入设备可以理解为对于人 的感觉的组合或就没有对应物 如 键盘 keyboard 鼠标 mouse 手写板 writing tablet 运动输入 motion input 如 自身运动用来交互的设备 电磁皮肤感应器 galvanic skin response 和其它生物传 感器 biometric sensors 在我们的定义中 字 input 是最重要的 恰如在实际中大多数与计算机的交互都采用多个模态 而发生 例如 当我们打字时 我们接触键盘上的键以将数据输入计算机 但有些人也汇同时用视线 阅读我们所输入的或确定所要按压键的位置 因此 牢记交互过程中人所在做的 what the human is doing 与系统实际接收作为输入 what the system is actually receiving as input 间的差异是十分重要的 例如 一台装有麦克风的计算机可能能理解多种语言和仅是不同类型的声音 如 采用人性化界面 humming interface 来进行音乐检索 尽管术语 multimodal 已常用来指这种状况 如 13 中的多 语言输入被认为是多通道的multimodal 但本文仅指那些采用不同模态 如 通信通道 结合的系统是 多模态的 如图1所示 例如 一个系统仅采用摄像机对人脸表情和手势作出响应就不是多模态的 即 使输入信号来自多个摄像机 利用同样的假设 具有多个键的系统也不是多模态的 但具有鼠标和键 盘输入的则是多模态的 尽管对采用诸如鼠标和键盘 键盘和笔等多种设备的多模态交互已经有研究 本文仅涉及视觉 摄像机 输入与其它类型输入结合的人机交互技术 在HCI中 多模态技术可以用来构造多种不同类型的界面 如图1 我们特别感兴趣的是感知 perceptual 注意 attentive 和活跃 enactive 界面 如在 177 中定义的那样 感知界面 Perceptual interfaces 176 是高度互动 interactive 且能使得与计算机丰富 自然和有效交互的多模态界面 multimodal interfaces 感知界面寻找感知输入sensing input 和绘制输出rendering output 的杠杆技 术以提供利用标准界面及诸如键盘 鼠标和其它监视器等公共I O设备 177 所不能实现的交互 并使得 计算机视觉成为很多情况下的核心要件 central component 注意界面 Attentive interfaces 180 是依 赖于人的注意力作为主要输入的上下文敏感界面 context aware interfaces 160 即 注意界面 120 采用收集到的信息来评估与用户通信的最佳时间和方法 由于注意力主要由眼睛注视 eye contact 160 和 姿态 gestures 尽管诸如鼠标移动等度量也指示 来体现 计算机视觉在注意界面中起着主要作用 活 跃界面 Enactive interfaces 是那些帮助用户以手或身体的主动使用为基础来与一组理解任务的知识通 信 help users communicate a form of knowledge based on the active use of the hands or body for apprehension tasks 活跃知识不只是简单的多感知中间知识 multisensory mediated knowledge 而 是以机器响应 motor responses 的形式贮存 并由 Doing 动作来获取 典型的例子是诸如打字 驾 驶汽车 跳舞 弹奏乐器和粘土制模等任务所需要的能力 所有这些任务是难以采用图标或符号形式 来描述的 3 Human centered vision We classify vision techniques for MMHCI using a human centered approach and we divide them according to the human body 1 large scale body movements 2 hand gestures and 3 gaze We make a distinction between command actions can be used to explicitly execute commands select menus etc and non command interfaces actions or events used to indirectly tune the system to the user s needs 111 23 In general vision based human motion analysis systems used for MMHCI can be thought of as having mainly four stages 1 motion segmentation 2 object classification 3 tracking and 4 interpretation While some approaches use geometric primitives to model different components e g cylinders used to model limbs head and torso for body movements or for hand and fingers in gesture recognition others use feature representations based on appearance appearance based methods In the first approach external markers are often used to estimate body posture and relevant parameters While markers can be accurate they place restrictions on clothing and require calibration so they are not desirable in many applications Moreover the attempt to fit geometric shapes to body parts can be computationally expensive and these methods are often not suitable for real time processing Appearance based methods on the other hand do not require markers but require training e g with machine learning probabilistic approaches etc Since they do not require markers they place fewer constraints on the user and are therefore more desirable Next we briefly discuss some specific techniques for body gesture and gaze The motion analysis steps are similar so there is some inevitable overlap in the discussions Some of the issues for gesture recognition for instance apply to body movements and gaze detection 3 1 Large scale body movements Tracking of large scale body movements head arms torso and legs is necessary to interpret pose and motion in many MMHCI applications However since extensive surveys have been published in this area 182 56 107 183 we discuss the topic briefly There are three important issues in articulated motion analysis 188 representation joint angles or motion of all the sub parts computational paradigms deterministic or probabilistic and computation reduction Body posture analysis is important in many MMHCI applications For example in 172 the authors use a stereo and thermal infrared video system to estimate the driver s posture for deployment of smart air bags The authors of 148 propose a method for recovering articulated body pose without initialization and tracking using learning The authors of 8 use pose and velocity vectors to recognize body parts and detect different activities while the authors of 17 use temporal templates In some emerging MMHCI applications group and non command actions play an important role In 102 visual features are extracted from head and hand forearm blobs the head blob is represented by the vertical position of its centroid and hand blobs are represented by eccentricity and angle with respect to the horizontal These features together with audio features e g energy pitch and speaking rate among others are used for segmenting meeting videos according to actions such as monologue presentation white board discussion and note taking The authors of 60 use only computer vision but make a distinction between body movements events and behaviors within a rule based system framework Important issues for large scale body tracking include whether the approach uses 2D or 3D desired accuracy speed occlusion and other constraints Some of the issues pertaining to gesture recognition discussed next can also apply to body tracking 3 2 Hand gesture recognition Although in human human communication gestures are often performed using a variety of body parts e g arms eyebrows legs entire body etc most researchers in computer vision use the term gesture recognition to refer exclusively to hand gestures We will use the term accordingly and focus on hand gesture recognition in this section Psycholinguistic studies of human to human communication 103 describe gestures as the critical link between our conceptualizing capacities and our linguistic abilities Humans use a very wide variety of gestures ranging from simple actions of using the hand to point at objects to the more complex actions that express feelings and allow communication with others Gestures should therefore play an essential role in MMHCI 83 186 52 as they seem intrinsic to natural interaction between the human and the computer controlled interface in many applications ranging from virtual environments 82 and smart surveillance 174 to remote collaboration applications 52 There are several important issues that should be considered when designing a gesture recognition system 136 The first phase of a recognition task is choosing a mathematical model that may consider both the spatial and the temporal characteristics of the hand and hand gestures The approach used for modeling plays a crucial role in the nature and performance of gesture interpretation Typically features are extracted from the images or video and once these features are extracted model parameters are estimated based on subsets of them until a right match is found For example the system might detect n points and attempt to determine if these n points or a subset of them could match the characteristics of points extracted from a hand in a particular pose or performing a particular action The parameters of the model are then a description of the hand pose or trajectory and depend on the modeling approach used Among the important problems involved in the analysis are hand localization 187 hand tracking 194 and the selection of suitable features 83 After the parameters are computed the gestures represented by them need to be classified and interpreted based on the accepted model and based on some grammar rules that reflect the internal syntax of gestural commands The grammar may also encode the interaction of gestures with other communication modes such as speech gaze or facial expressions As an alternative to modeling some authors have explored the use of combinations of simple 2D motion based detectors for gesture recognition 71 In any case to fully exploit the potential of gestures for an MMHCI application the class of possible recognized gestures should be as broad as possible and ideally any gesture performed by the user should be unambiguously interpretable by the interface However most of the gesture based HCI systems allow only symbolic commands based on hand posture or 3D pointing This is due to the complexity associated with gesture analysis and the desire to build real time interfaces Also most of the systems accommodate only single hand gestures Yet human gestures especially communicative gestures naturally employ actions of both hands However if the two hand gestures are to be allowed several ambiguous situations may appear e g occlusion of hands intentional vs unintentional etc and the processing time will likely increase Another important aspect that is increasingly considered is the use of other modalities e g speech to augment the MMHCI system 127 162 The use of such multimodal approaches can reduce the complexity and increase the naturalness of the interface for MMHCI 126 3 3 Gaze detection Gaze defined as the direction to which the eyes are pointing in space is a strong indicator of attention and it has been studied extensively since as early as 1879 in psychology and more recently in neuroscience and in computing applications 41 While early eye tracking research focused only on systems for in lab experiments many commercial and experimental systems are available today for a wide range of applications Eye tracking systems can be grouped into wearable or non wearable and infrared based or appearance based In infrared based systems a light shining on the subject whose gaze is to be tracked creates a red eye effect the difference in reflection between the cornea and the pupil is used to determine the direction of sight In appearance based systems computer vision techniques are used to find the eyes in the image and then determine their orientation While wearable systems are the most accurate approximate error rates below 1 4 vs errors below 1 7 for nonwearable infrared they are also the most intrusive Infrared systems are more accurate than appearance based but there are concerns over the safety of prolonged exposure to infrared lights In addition most non wearable systems require often cumbersome calibration for each individual 108 121 Appearance based systems usually capture both eyes using two cameras to predict gaze direction Due to the computational cost of processing two streams simultaneously the resolution of the image of each eye is often small This makes such systems less accurate although increasing computational power and lower costs mean that more computationally intensive algorithms can be run in real time As an alternative in 181 the authors propose using a single high resolution image of one eye to improve accuracy On the other hand infrared based systems usually use only one camera but the use of two cameras has been proposed to further increase accuracy 152 Although most research on non wearable systems has focused on desktop users the ubiquity of computing devices has allowed for application in other domains in which the user is stationary e g 168 152 For example the authors of 168 monitor driver visual attention using a single non wearable camera placed on a car s dashboard to track face features and for gaze detection Wearable eye trackers have also been investigated mostly for desktop applications or for users that do not walk wearing the device Also because of advances in hardware e g reduction in size and weight and lower costs researchers have been able to investigate uses in novel applications For example in 193 eye tracking data are combined with video from the user s perspective head directions and hand motions to learn words from natural interactions with users the authors of 137 use a wearable eye tracker to understand hand eye coordination in natural tasks and the authors of 38 use a wearable eye tracker to detect eye contact and record video for blogging The main issues in developing gaze tracking systems are intrusiveness speed robustness and accuracy The type of hardware and algorithms necessary however depend highly on the level of analysis desired Gaze analysis can be performed at three different levels 23 a highly detailed low level micro events b low level intentional events and c coarse level goal based events Micro events include micro saccades jitter nystagmus and brief fixations which are studied for their physiological and psychological relevance by vision scientists and psychologists Low level intentional events are the smallest coherent units of movement that the user is aware of during visual activity which include sustained fixations and revisits Although most of the work on HCI has focused on coarse level goal based events e g using gaze as a pointer 165 it is easy to foresee the importance of analysis at lower levels particularly to infer the user s cognitive state in affective interfaces e g 62 Within this context an important issue often overlooked is how to interpret eye tracking data In other words as the user moves his eyes during interaction the system must decide what the movements mean in order to react accordingly We move our eyes 2 3 times per second so a system may have to process large amounts of data within a short time a task that is not trivial even if processing does not occur in real time One way to interpret eye tracking data is to cluster fixation points and assume for instance that clusters correspond to areas of interest Clustering of fixation points is only one option however and as the authors of 154 discuss it can be difficult to determine the clustering algorithm parameters Other options include obtaining statistics on measures such as number of eye movements saccades distances between fixations order of fixations and so on 4 Affective human computer interaction Most current MMHCI systems do not account for the fact that human human communication is always socially situated and that we use emotion to enhance our communication However since emotion is often expressed in a multimodal way it is an important area for MMHCI and we will discuss it in some detail HCI systems

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论