IROS2019国际学术会议论文集An assistive low-vision plat that augments spatial cognition through proprioceptive guidancePoint-to-Tell-and-Touch_第1页
IROS2019国际学术会议论文集An assistive low-vision plat that augments spatial cognition through proprioceptive guidancePoint-to-Tell-and-Touch_第2页
IROS2019国际学术会议论文集An assistive low-vision plat that augments spatial cognition through proprioceptive guidancePoint-to-Tell-and-Touch_第3页
IROS2019国际学术会议论文集An assistive low-vision plat that augments spatial cognition through proprioceptive guidancePoint-to-Tell-and-Touch_第4页
IROS2019国际学术会议论文集An assistive low-vision plat that augments spatial cognition through proprioceptive guidancePoint-to-Tell-and-Touch_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

An assistive low-vision platform that augments spatial cognition through proprioceptive guidance: Point-to-Tell-and-Touch Wenjun Gui, Bingyu Li, Shuaihang Yuan, John-Ross Rizzo, Lakshay Sharma, Chen Feng, Anthony Tzes and Yi Fang AbstractSpatial cognition, as gained through the sense of vision, is one of the most important capabilities of human beings. However, for the visually impaired (VI), lack of this perceptual capability poses great challenges in their life. Therefore, we have designed Point-to-Tell-and-Touch, a wear- able system with an ergonomic human-machine interface, for assisting the VI with active environmental exploration, with a particular focus on spatial intelligence and navigation to objects of interest in an alien environment. Our key idea is to link visual signals, as decoded synthetically, to the VIs proprioception for more intelligible guidance, in addition to vision-to-audio assistance, i.e., fi nger pose, as indicated by pointing, is used as “proprioceptive laser pointer” to target an object in that line of sight. The whole system consists of two features, Point-to-Tell and Point-to-Touch, both of which can work independently or cooperatively. The Point- to-Tell feature contains a camera with a novel one-stage neural network tailored for blind-centered object detection and recognition, and a headphone telling the VI the semantic label and distance from the pointed object. the Point-to-Touch, the second feature, leverages a vibrating wrist band to create a haptic feedback tool that supplements the initial vectorial guidance provided by the fi rst stage (hand pose being direction and the distance being the extent, offered through audio cues). Both platform features utilize proprioception or joint position sense. Through hand pose, the VI end user knows where he or she is pointing relative to their egocentric coordinate system and we are able to use this foundation to build spatial intelligence. Our successful indoor experiments demonstrate the proposed system to be effective and reliable in helping the VI gain spatial cognition and explore the world in a more intuitive way. I. INTRODUCTION It is estimated that a population of more than 253 million are suffering from visual impairment 1, which results in a host of social, emotional, and health-related problems, as increasingly limited vision leads to increasingly diffi cult mobility, resulting in falls, injuries, and comorbidities 2 5. These decrements in vision and mobility are correlated with signifi cant unemployment and often lead to severe compromises in the quality of life 2. Assistive low- vision platforms are desigend to assist VI explore 3D surrounding environment. Despite the active research in Wenjun Gui, Bingyu Li, Shuaihang Yuan, Lakshay Sharma and Yi Fang are with the NYU Multimedia and Visual Computing Lab, NYU Tandon. Yi Fang and Anthony Tzes are with NYU Abu Dhabi. John-Ross is with NYU Langone Medical Center. Chen Feng is with NYU Tandon. Wenjun Gui, Bingyu Li, Shuaihang Yuan and John-Ross Rizzo contribute equally to this paper. Yi Fang is the corresponding author: existing assistive low-vision platforms and some encour- aging outcomes in academic settings, the uptake of current platforms by visually impaired community has been low 68. This is because 1) existing platform 920 using spatially informative senses: hearing and touch to assist the VI (e.g., vision-to-audio and vision-to-touch ) which has limited transcoding ability to explicitly describe the visuospatial information at position-level 2123. However, for dynamic and time-constrained navigation and interactive exploration, a effective and effi cient platform needs to intuitively provide the visuospatial information with an improved spatial awareness of their surrounding as explicitly as possible. 2) the platform that use spatially informative senses (e.g., vision-to-audio) to assist the VI is one-way 24, 25, leading to the fact that VI users can only passively receive translated information. The one-way platform consequently prevents VI users from an interactive exploration of 3D space to actively acquire information around a user-specifi ed location 26. Therefore, a effective and effi cient design should provide a more effective two- way platform to augment spatial cognition for more intelli- gible guidance to enable VI users interactively explore and discover 3D space. Fig. 1: Two types of VI-assistance of the proposed system. To develop a effective tow-way platform to enable VI users interactively explore and discover 3D space, we proposed a assistive low-vision platfrom, named Point- to-Tell-and-Touch, that use VIs proprioception, by which people can intuitively perceive the position and movement of body parts without sight 2729, as the second way other than spatially informative senses (i.e. hearing and touch) to essentially map the surrounding scene objects in reference to the VI users egocentric coordinate system in an intuitive manner. The system is composed of two modules, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE3817 Point-to-Tell and Point-to-Touch. As the fi gure 1 shows, while the VI scan the environment with their fi ngers which are used as “proprioceptive laser pointer” to explicitly target an object in that line of sight, the Point-to-Tell tells the distance and category of pointed object via voice output. Our Point-to-Tell system directly outputs the location and classifi cation of the pointed object and the location of the fi ngertip. The VI can activate the Point-to-Touch by drawing a circle with their fi nger. During their movements toward the object, the system will track the located object with Kalman fi lter 30. Then haptic feedback will be given to the VI based on the deviation from idealized trajectory from initial point to the destination. There are two motor control-based feedback loops, as shown in fi gure 2, Point- To-Tell and Point-To-Touch. Our platform offers the novel opportunity to close the open loop that vision loss creates by connecting existing sensory channels with computer vision-based spatial intelligence. Point-To-Tell Central Control Movement “POINTING” Object of Interest Decision Auditory Feedback Motor Control Loops with Sensory Substitution Point-To-Engage Central Control Movement “POINTING” Decision Gesture “CIRCLE” Engage Audio Cue “Spatial Position” Telling Movement ”REACHING” Feedback Auditory Online Reaching Correction Haptic Object of Interest Feedback Proprioceptive Visual SYSTEM output CV-Supplemented Sensory Cues Lost Sensory Cue Muscle Command Native Sensory cues Feedback Proprioceptive Visual Fig. 2: The pipeline of motor control-based feedback loops The key contributions of our work are as follows: We introduce a natural assistive interface for the VI. The VI only need gestures to interact with the system to get surrounding information via audio and haptic feedback, which extends the VIs biological proprioception range to their immediate surroundings, enabling more intuitive assistance. We design a novel single-shot object detection neural network that simultaneously detects fi ngertip position and the object being pointed at, for the purpose of assisting the VI to proprioceptively understand and pinpoint objects surrounding them. We implement the system on common hardwares (including Nvidia Jetson Box Tx2, Zed Camera, Ar- durino Controller Board and Headset) and conducted experiments with groups of VI (performed by volun- teers) in real environment to demonstrate its effective- ness and robustness. Fig. 3: the overview of the system structure II. POINT-TO-TELL-AND-TOUCH A. Hardware Devices The fi gure 3 shows how the hardware devices support the system. ZED mini stereo camera mimics the way human perceiving the world. Frames from left camera will be served as the input of Point-to-Tell net for detecting pointed objects. Also, the distance between objects and left camera could be calculated. The result of pointed objects categories from Point-to-Tell and the distance of objects got by stereo cameras will return to the VI through headphone. The wristband assists Point-to-Touch to navigate the VI to reach the target objects. It contains four vibration sensors representing four directions (up, down,left and right), and certain one will vibrate to guide the VI to move to the correct direction when the VI are off-route. The micropro- cessor Nvidia Jetson TX2(256 CUDA cores, Dual Denver and Quad ARM CPU and 8GB 128bit LDDR4 memory) and MAXOAK power bank are used for supporting the whole system. When the VI are equipped with the system to actively explore the surrounding environment, they need to perform the ”Pointing” gesture to activate the Point-to- Tell system or circle the object of interesting to activate the Point-to-Touch system. B. Point-to-Tell 1) Network Architecture: We come up with a novel end-to-end neural network predicting location of fi ngertip together with bounding box and category of pointed object. The fi gure 4 shows the structure of this end-to-end neural network. Our network architecture is inspired by that of SSD 31, an end-to-end objects detection and classifi cation network. Similar to the SSD, we adopted VGG16 32 to build our base net, inheriting the fi rst 13 layers. Also, we changed the last two fully connected layers as convolution layers. Then several more convolutional feature layers were added to the end of truncated base network. Eventually, our Point-to-Tell net is a 23 layers fully connected neural network, the feature layers are from layer conv6-2, conv7- 2, conv8-2 and conv9-2. The input for the Point-to-Tell net is the modifi ed image together with its ground truth. When feature maps are generated, a 3*3 kernel is used to produce the predictions using convolution. For each cell in the feature maps, the network predicts the offsets of the objects bounding box relative to the default box of map cell, scores for the presence of per-class to the object, offsets for coordinate of fi ngertip relative to the 3818 Fig. 4: Datafl ow of the Point-to-Tell-and-Touch. The frame captured by the left camera will be served as the input to Point- to-Tell net. The result of detection and classifi cation by point-to-Tell net will convey to both the VI via headphone and Point-to-Touch for input center of map cell, and the existence possibility of fi ngertip respectively. 2) Multi-scale Feature Layers: In order for the network to detect objects of different scales, we use a similar strategy as 31, doing the predictions on multi-scale fea- ture maps. Such an architecture will not only improve the precision for the detection of multi-scale objects, but can also ensure that both the fi ngertip and the desired object could be perceived by the same map cell. The smaller the size of feature map is, the bigger the receptive fi eld on the original image is. Even the desired object occupies a large area of the frame, we could always make sure the object and the fi nger being mapped in the same feature map cell. Each feature map is divided into several map cell based on the size of it. A series of default boxes are raised in each map cell. When training the network, for every map cell on each feature map, we evaluate both position of fi ngertip and objects bounding box at the same time. The position of fi ngertip is checked to see which map cell contains the fi ngertip. The IoU of ground truth and a series of default boxes with different aspect ratios are compared for matching ground truth box. 3) Loss Function: During the training period, we opti- mize the following loss function. We adopt the loss function from SSD 31 for objects classifi cation and location re- gression and adding two more parts for fi ngertips detection and its location regression. Loss(x,y,p,g) =Lconf(x,c1)+Lloc(x,p1,g1) +Lloc(y,p2,g2)+Lconf(y,c2) (1) The loss function evaluates the classifi cation loss (Lcon) and location loss (Lloc) of the pointed object (x) together with classifi cation and location loss of fi ngertip (y respectively. And the coeffi cient and are set to be 1. 4) Prediction Selection: For each frame recorded by Zed camera, only one fi ngertip and only one pointed object are on the frame and their location and category information are expected to be predicted by the Point-to- Tell net. Since a series of predictions will raise on each map cell of all the feature maps. Therefore, for selecting the fi nal result, following rules are used: Based on the confi dence of presence of fi ngertip, a threshold is set, and the predictions with confi dence lower than will be removed. For all rest selected predictions, the one with highest confi dence of object class is adopted as the prediction of pointed object. Then the depth information of the bounding box of detected object are calculated. The closest object together with the fi ngertip location predicted by that map cell will be chosen as the fi nal result of detection. C. Point-to-Touch Fig. 5: Illustration of how UKF tracks the object and the VIs fi ngertip. In this case red bounding box represents the position of the monitor, while the green represents the VIs fi ngertip. The Point-to-Touch system can guide people who are visually impaired by simply letting them to follow the vibration signal. The accuracy and the reliability of the guidance mechanism are based on the accuracy of loca- tions estimations of objects and the VIs fi ngertips. Hence, we boil down establishing the guidance mechanism to a tracking problem and solve it with Unscented Kalman Filter(UKF). 1) Object Motion Prediction: We defi ne the state space and relevant notions pertaining to our Point-to-Touch sys- tem. We utilize the coordinates of camera system to present the VIs fi ngertips locations fp= (xfingertip,yfingertip) and the locations of objects locations. As the fi nger is pointed at the object, they are roughly aligned in the Z-axis. For object motion prediction, we build a dynamical system model which is used here for predicting the location where the object will appear in the next frame. The prediction is in both X-axis and Y-axis and the principles for both predictions are the same. Along the X-axis, we defi ne 3819 the state of our object as xk= pk,vkwhere pkdenotes location and vkdenotes speed. Under the framework of vanilla Kalman Filter, we can defi ne the state transition model as xk= Axk1+wkwhere A is the state transition matrix and wk is system noise, and defi ne the observer model as yk=Cxk+vkwhere C is the measurement matrix and vkis the measurement noise. The new state of the tracking object is obtained as follow: xk= Axk1+wk1 yk=Cxk+vk (2) However, as we are tracking the target objects and fi nger- tips during the VIs movement, the noise should be con- sidered in actual environment. Thus, the UKF is adopted here to overcome this problem. The UKF treats the system from a quite different perspective, but it maintains the same “prediction + update” principle as vanilla KF dose. The UKF uses the unscented transform 33 which deals with the system nonlinearity directly and performs much better. 2) Object Motion Updating: we have the state transition matrix A and measurement matrix C with same method stated in 33. We modify the prediction model in vanilla KF using the unscented transform defi ned in 33, from which we can obtain the new estimated position of the object xkand yk. Figure 5 illustrates a coarse location prediction of the object and VIs fi ngertip in the next frame. Object motion is updated with estimation error. We obtained the new predicted state from the previous step. Since we have both the real measurement y and estimated measurement yk, with the knowledge of estimation error y y between the measurement and the measurement estimate, we can refi ne our model to make a better estimation for x. xk|yk = xk1|yk+Kk(yk ykwk)(3) As defi ned in 33, Kkrepresents the Kalman gain in UKF and wkrepresent the sigma points weights where sigma points are sampled based on states and its covariance. The notation xk|yk means the estimated xkgiven yk. After the above update step, we have our fi nal estimated xk. As shown in Figure 5, after the update steps, the tracking algorithm corrects the previous estimated location of the object and the VIs fi ngertip to a more precise position. 3) User-feedback and Guidance: We provide the guid- ance based on the deviation between the fi ngertip and target object during the movement. Using unscented Kalman fi lter, we could have both the estimated objects coordinate objp= (cx,cy ) and the VIs fi ngertips coordinate fp= (xf,yf) in camera-system for each frame. The goal of Point-to-Touch is to guide the VI to reach their desired object/destination. As two coordinates are close enough and almost overlapping with each other, the VI are supposed to be at the target location and reach the target object. During the entire movement of the VI, the guidance signal for correcting direction is fed to them when they are led astray. By computing the two-tuple d = (xfcx,yfcy), signed coordinate offsets representing the distance between the object and the fi ngertip in left-right and up-down directions, our haptic wirstband will inform the VI through vibration to move toward the opposite direction to guide him/her to get closer to the target. Figure 5 illustrates a complete example of a VI successfully reaching his desired location and carried out the interaction under the guidance of our Point-to-Tell. III.EXPERIMENTS As a functional proof of our system, we carried out two controlled experiments on a group of blindfolded people for the two features of the system. To validate our Point-to- Tell function could help visually impaired people effi ciently gain spatial in

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论