大规模视频数据复杂事件检测课件_第1页
大规模视频数据复杂事件检测课件_第2页
大规模视频数据复杂事件检测课件_第3页
大规模视频数据复杂事件检测课件_第4页
大规模视频数据复杂事件检测课件_第5页
已阅读5页,还剩38页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、大规模视频数据复杂事件检测OutlineIntroductionStandard pipelineMED with few exemplarsA discriminative CNN representation for MEDA new pooling method for MED IntroductionChallenge 1: An event is usually characterized by a longer video clip.10 years ago: Constrained videos, e.g., New videosNow: Unconstrained videos

2、The length of videos in the TRECVID MED dataset varies from one min to one hourThe videos are unconstrained Introduction (Contd)Challenge 2:Multimedia events are higher level descriptions.landing a fishIntroduction (Contd)Challenge 3:Huge intra-classvariationsVideo 1Video 2Marriage proposalOutlineIn

3、troductionStandard pipelineMED with few exemplarsA Discriminative CNN representation for MEDA new pooling method for MED Standard Components in CDR PipelinePhaseProcessVisual AnalysisSIFTColor SIFT (CSIFT)Transformed Color Histogram (TCH)Motion SIFT (MoSIFT)STIPDense Trajectory CNN Audio AnalysisMFC

4、CAcoustic Unit Descriptors (AUDs)Text AnalysisOCRASRHigh Level Concept AnalysisSIN 11 ConceptsObject BankVideoLegend:ProcessObjectCDRVisual AnalysisLow-Level Feature VectorsAudio AnalysisText Analysis7High Level Concept AnalysisOutlineIntroductionStandard pipelineMED with few exemplarsA Discriminati

5、ve CNN representation for MEDA new pooling method for MED MotivationThere are three tasks in MEDEK 100 (100 positive exemplars per event)EK 10 (10 positive exemplars per event)EK 0 (No positive exemplar but only text descriptions)Solution for event detection with few (i.e., 10) exemplarsKnowledge ad

6、aptationRelated exemplars Leveraging related videosA video related to “marriage proposal.” A girl plays music, dances down a hallway in school, and asks a boy to prom.A video related to “marriage proposal.” A large crowd cheers after a boy asks his girlfriend to go to prom with him with a bouquet of

7、 flowers and a huge sign.Our solutionAutomatically access the relatedness of each related videos for event detection. Experiment ResultsThe frames sampled from two video sequences marked as related exemplars to the event “birthday party” by the NIST.Experiment ResultsThe frames sampled from two vide

8、o sequences marked as related to the event “town hall meeting” by NIST.Experiment ResultsTake home messagesExact positive training exemplars are difficult to obtain, but related samples are easier to obtainAppropriately leveraging related samples would help event detection The performance is more si

9、gnificant when the exact positive exemplars are fewThere are also many other cases where related samples are largely available. For details, refer to our paper How Related Exemplars Help Complex Event Detection in Web Videos? Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan and Alexander Hauptmann. I

10、CCV 2013 OutlineIntroductionStandard CDRMED with few exemplarsA Discriminative CNN representation for MEDA new pooling method for MED Video analysis costs a lotDense Trajectories and its enhanced version improved Dense Trajectories (IDT) have dominated complex event detectionsuperior performance ove

11、r other features such as the motion feature STIP and the static appearance feature Dense SIFTCredits: Heng WangVideo analysis costs a lotParalleling 1,000 cores, it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collecti

12、onVideo analysis costs a lot As a result of the unaffordable computation cost (a cluster with 1,000 cores), it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale MED datasets.It becomes important to propose an efficient re

13、presentation for complex event detection with only affordable computational resources, e.g., a single machine.Turn to CNN?One instinctive idea would be to utilize the deep learning approach, especially Convolutional Neural Networks (CNNs), given their overwhelming accuracy in image analysis and fast

14、 processing speed, which is achieved by leveraging the massive parallel processing power of GPUs.Turn to CNN?However, it has been reported that the event detection performance of CNN based video representation is worse than the improved Dense Trajectories in TRECVID MED 2013.Technical problems of ut

15、ilizing CNNs for MEDFirstly, CNN requires a large amount of labeled video data to train good models from scratch. TRECVID MED datasets have only 100 positive examples for each event.Secondly, fine-tuning from ImageNet to video data needs to change the structure of the networkse.g. convolutional pool

16、ing layer proposed in Beyond Short Snippets: Deep Networks for Video ClassificationFinally, average pooling from the frames to generate the video representation is not effective for CNN features. Cont.Average Pooling for VideosWinning solution for the TRECVID MED 2013 competitionAverage Pooling of C

17、NN frame featuresConvolutional Neural Networks (CNNs) with standard approach (average pooling) to generate video representation from frame level featuresMEDTest 13MEDTest 14Improved Dense Trajectories34.027.6CNN in CMUMED 201329.0N.A.CNN from VGG-1632.724.8Video Pooling on CNN DescriptorsVideo pooli

18、ng computes video representation over the entire video by pooling all the descriptors from all the frames in a video.For local descriptors like HOG, HOF, MBH in improved Dense Trajectories, the Fisher vector and Vector of Locally Aggregated Descriptors (VLAD) is applied to generate the video represe

19、ntation.To our knowledge, this is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as the from local descriptors to CNN descriptors in video analysis.Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14Resultsfc6fc6_relufc7fc7_reluAvera

20、ge pooling19.824.818.823.8Fisher vector28.328.427.429.1VLAD33.132.633.231.5Table: Performance comparison (mAP in percentage) on MEDTest 14 100ExFigure: Performance comparisons on MEDTest 13 and MEDTest 14, both 100Ex and 10ExLatent Concept Descriptors (LCD)Convolutional filters can be regarded as ge

21、neralized linear classifiers on the underlying data patches, and each convolutional filter corresponds to a latent concept.From this interpretation, pool5 layer of size aaM can be converted into a2 latent concept descriptors with M dimensions. Each latent concept descriptor represents the responses

22、from the M filters for a specific pooling location.Latent Concept Descriptors (LCD) EncondingLCD Results on pool5100Ex10ExAverage pooling31.218.8LCDVLAD38.225.0LCDVLAD + SPP40.325.6Table 1: Performance comparisons for pool5 on MEDTest 13100Ex10ExAverage pooling24.615.3LCDVLAD33.922.8LCDVLAD + SPP35.

23、723.2Table 2: Performance comparisons for pool5 on MEDTest 14Representation CompressionWe utilize the Product Quantization (PQ) techniques to compress the video representation.Without PQ compression, the storage size of the features for 200,000 videos would be 48.8 GB, which severely compromises the

24、 execution time due to the I/O cost.With PQ, we can store the features of the whole collection in less than 1 GB, which can be read by a normal SSD disk in a few seconds.Fast predictions can be made by an efficient look-up-table.Comparisons with previous best features IDTOursIDTRelative improvementM

25、EDTest 13 100 Ex44.634.031.2%MEDTest 13 10 Ex29.818.065.6%MEDTest 14 100Ex36.827.633.3%MEDTest 14 10Ex24.513.976.3%NotesThe proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques, or better descriptor encoding

26、techniques.The proposed representation is very generic for video analysis, not limited to multimedia event detection. We tested on MED datasets since they are the largest available video analysis datasets in the world.The proposed representation is pretty simple yet very effective, it is easy to gen

27、erate the representation using Caffe/cxxnet/cuda-convnet (for CNN features part) and vlfeat/Yael (for encoding part) toolkits. Take Home MessagesUtilize VLAD/FV encoding techniques to generate video representations from frame-level CNN features, simple but effectiveFormulate the intermediate convolu

28、tional features into latent concept descriptors (LCD)Apply Product Quantization to compress the generated CNN representation For details, please refer to our paper:A Discriminative CNN Video Representation for Event Detection. Zhongwen Xu, Yi Yang and Alexander G. Hauptmann. CVPR 2015 OutlineIntrodu

29、ctionStandard CDRMED with few exemplarsA Discriminative CNN representation for MEDA new pooling method for MED MotivationsOnly some shots in a long video are relevant to the event while others are less relevant or even useless.Representative works (average pooling/max pooling) largely ignore this di

30、fference.Our solutionDefine a novel notion of semantic saliency that evaluates the relevance of each shot with the event of interest, and re-order the shots according to their semantic saliency.Propose a new isotonic regularizer that respects the order information, leading to a nearly-isotonic SVM t

31、hat enjoys more discriminaitve power.Develop an efficient implementation using the proximal gradient algorithm, enhanced with newly proven, exact closed-form proximal steps.Extensive experiments on three real-world large scale video datasets confirm the effectiveness of the proposed approach.Re-Orde

32、ring according to Semantic SaliencyOur MethodEach input video is divided into multiple shots, and each event has a short textual description. 2. CNN is used to extract features. 3. Semantic concept names and skip-gram model are used to derive a probability vector and a relevance vector, which are combined to yield the new semantic saliency an

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论