大規(guī)模視頻數(shù)據(jù)復(fù)雜事件檢測課件

上傳人：她*** IP屬地：貴州上傳時間：2022-07-21 格式：PPTX 頁數(shù)：43 大?。?2MB 積分：25 舉報 版權(quán)申訴

已閱讀5頁，還剩38頁未讀，繼續(xù)免費閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、大規(guī)模視頻數(shù)據(jù)復(fù)雜事件檢測OutlineIntroductionStandard pipelineMED with few exemplarsA discriminative CNN representation for MEDA new pooling method for MED IntroductionChallenge 1: An event is usually characterized by a longer video clip.10 years ago: Constrained videos, e.g., New videosNow: Unconstrained videos

2、The length of videos in the TRECVID MED dataset varies from one min to one hourThe videos are unconstrained Introduction (Contd)Challenge 2:Multimedia events are higher level descriptions.landing a fishIntroduction (Contd)Challenge 3:Huge intra-classvariationsVideo 1Video 2Marriage proposalOutlineIn

3、troductionStandard pipelineMED with few exemplarsA Discriminative CNN representation for MEDA new pooling method for MED Standard Components in CDR PipelinePhaseProcessVisual AnalysisSIFTColor SIFT (CSIFT)Transformed Color Histogram (TCH)Motion SIFT (MoSIFT)STIPDense Trajectory CNN Audio AnalysisMFC

4、CAcoustic Unit Descriptors (AUDs)Text AnalysisOCRASRHigh Level Concept AnalysisSIN 11 ConceptsObject BankVideoLegend:ProcessObjectCDRVisual AnalysisLow-Level Feature VectorsAudio AnalysisText Analysis7High Level Concept AnalysisOutlineIntroductionStandard pipelineMED with few exemplarsA Discriminati

5、ve CNN representation for MEDA new pooling method for MED MotivationThere are three tasks in MEDEK 100 (100 positive exemplars per event)EK 10 (10 positive exemplars per event)EK 0 (No positive exemplar but only text descriptions)Solution for event detection with few (i.e., 10) exemplarsKnowledge ad

6、aptationRelated exemplars Leveraging related videosA video related to “marriage proposal.” A girl plays music, dances down a hallway in school, and asks a boy to prom.A video related to “marriage proposal.” A large crowd cheers after a boy asks his girlfriend to go to prom with him with a bouquet of

7、 flowers and a huge sign.Our solutionAutomatically access the relatedness of each related videos for event detection. Experiment ResultsThe frames sampled from two video sequences marked as related exemplars to the event “birthday party” by the NIST.Experiment ResultsThe frames sampled from two vide

8、o sequences marked as related to the event “town hall meeting” by NIST.Experiment ResultsTake home messagesExact positive training exemplars are difficult to obtain, but related samples are easier to obtainAppropriately leveraging related samples would help event detection The performance is more si

9、gnificant when the exact positive exemplars are fewThere are also many other cases where related samples are largely available. For details, refer to our paper How Related Exemplars Help Complex Event Detection in Web Videos? Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan and Alexander Hauptmann. I

10、CCV 2013 OutlineIntroductionStandard CDRMED with few exemplarsA Discriminative CNN representation for MEDA new pooling method for MED Video analysis costs a lotDense Trajectories and its enhanced version improved Dense Trajectories (IDT) have dominated complex event detectionsuperior performance ove

11、r other features such as the motion feature STIP and the static appearance feature Dense SIFTCredits: Heng WangVideo analysis costs a lotParalleling 1,000 cores, it takes about one week to extract the IDT features for the 200,000 videos with duration of 8,000 hours in the TRECVID MEDEval 14 collecti

12、onVideo analysis costs a lot As a result of the unaffordable computation cost (a cluster with 1,000 cores), it would be extremely difficult for a relatively smaller research group with limited computational resources to process large scale MED datasets.It becomes important to propose an efficient re

13、presentation for complex event detection with only affordable computational resources, e.g., a single machine.Turn to CNN?One instinctive idea would be to utilize the deep learning approach, especially Convolutional Neural Networks (CNNs), given their overwhelming accuracy in image analysis and fast

14、 processing speed, which is achieved by leveraging the massive parallel processing power of GPUs.Turn to CNN?However, it has been reported that the event detection performance of CNN based video representation is worse than the improved Dense Trajectories in TRECVID MED 2013.Technical problems of ut

15、ilizing CNNs for MEDFirstly, CNN requires a large amount of labeled video data to train good models from scratch. TRECVID MED datasets have only 100 positive examples for each event.Secondly, fine-tuning from ImageNet to video data needs to change the structure of the networkse.g. convolutional pool

16、ing layer proposed in Beyond Short Snippets: Deep Networks for Video ClassificationFinally, average pooling from the frames to generate the video representation is not effective for CNN features. Cont.Average Pooling for VideosWinning solution for the TRECVID MED 2013 competitionAverage Pooling of C

17、NN frame featuresConvolutional Neural Networks (CNNs) with standard approach (average pooling) to generate video representation from frame level featuresMEDTest 13MEDTest 14Improved Dense Trajectories34.027.6CNN in CMUMED 201329.0N.A.CNN from VGG-1632.724.8Video Pooling on CNN DescriptorsVideo pooli

18、ng computes video representation over the entire video by pooling all the descriptors from all the frames in a video.For local descriptors like HOG, HOF, MBH in improved Dense Trajectories, the Fisher vector and Vector of Locally Aggregated Descriptors (VLAD) is applied to generate the video represe

19、ntation.To our knowledge, this is the first work on the video pooling of CNN descriptors and we broaden the encoding methods such as the from local descriptors to CNN descriptors in video analysis.Discriminative Ability Analysis on Training Set of TRECVID MEDTest 14Resultsfc6fc6_relufc7fc7_reluAvera

20、ge pooling19.824.818.823.8Fisher vector28.328.427.429.1VLAD33.132.633.231.5Table: Performance comparison (mAP in percentage) on MEDTest 14 100ExFigure: Performance comparisons on MEDTest 13 and MEDTest 14, both 100Ex and 10ExLatent Concept Descriptors (LCD)Convolutional filters can be regarded as ge

21、neralized linear classifiers on the underlying data patches, and each convolutional filter corresponds to a latent concept.From this interpretation, pool5 layer of size aaM can be converted into a2 latent concept descriptors with M dimensions. Each latent concept descriptor represents the responses

22、from the M filters for a specific pooling location.Latent Concept Descriptors (LCD) EncondingLCD Results on pool5100Ex10ExAverage pooling31.218.8LCDVLAD38.225.0LCDVLAD + SPP40.325.6Table 1: Performance comparisons for pool5 on MEDTest 13100Ex10ExAverage pooling24.615.3LCDVLAD33.922.8LCDVLAD + SPP35.

23、723.2Table 2: Performance comparisons for pool5 on MEDTest 14Representation CompressionWe utilize the Product Quantization (PQ) techniques to compress the video representation.Without PQ compression, the storage size of the features for 200,000 videos would be 48.8 GB, which severely compromises the

24、 execution time due to the I/O cost.With PQ, we can store the features of the whole collection in less than 1 GB, which can be read by a normal SSD disk in a few seconds.Fast predictions can be made by an efficient look-up-table.Comparisons with previous best features IDTOursIDTRelative improvementM

25、EDTest 13 100 Ex44.634.031.2%MEDTest 13 10 Ex29.818.065.6%MEDTest 14 100Ex36.827.633.3%MEDTest 14 10Ex24.513.976.3%NotesThe proposed representation is extendible and the performance can be further improved by better CNN models and/or appropriate fine-tuning techniques, or better descriptor encoding

26、techniques.The proposed representation is very generic for video analysis, not limited to multimedia event detection. We tested on MED datasets since they are the largest available video analysis datasets in the world.The proposed representation is pretty simple yet very effective, it is easy to gen

27、erate the representation using Caffe/cxxnet/cuda-convnet (for CNN features part) and vlfeat/Yael (for encoding part) toolkits. Take Home MessagesUtilize VLAD/FV encoding techniques to generate video representations from frame-level CNN features, simple but effectiveFormulate the intermediate convolu

28、tional features into latent concept descriptors (LCD)Apply Product Quantization to compress the generated CNN representation For details, please refer to our paper:A Discriminative CNN Video Representation for Event Detection. Zhongwen Xu, Yi Yang and Alexander G. Hauptmann. CVPR 2015 OutlineIntrodu

29、ctionStandard CDRMED with few exemplarsA Discriminative CNN representation for MEDA new pooling method for MED MotivationsOnly some shots in a long video are relevant to the event while others are less relevant or even useless.Representative works (average pooling/max pooling) largely ignore this di

30、fference.Our solutionDefine a novel notion of semantic saliency that evaluates the relevance of each shot with the event of interest, and re-order the shots according to their semantic saliency.Propose a new isotonic regularizer that respects the order information, leading to a nearly-isotonic SVM t

31、hat enjoys more discriminaitve power.Develop an efficient implementation using the proximal gradient algorithm, enhanced with newly proven, exact closed-form proximal steps.Extensive experiments on three real-world large scale video datasets confirm the effectiveness of the proposed approach.Re-Orde

32、ring according to Semantic SaliencyOur MethodEach input video is divided into multiple shots, and each event has a short textual description. 2. CNN is used to extract features. 3. Semantic concept names and skip-gram model are used to derive a probability vector and a relevance vector, which are combined to yield the new semantic saliency an

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

大規(guī)模視頻數(shù)據(jù)復(fù)雜事件檢測課件

文檔簡介

溫馨提示

最新文檔

評論

大規(guī)模視頻數(shù)據(jù)復(fù)雜事件檢測課件

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔