版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
are
Integratingsenses:
HowAIislearningto
see,hear,andinteract
RolandMemisevic
SeniorDirectorofEngineeringatQualcommAIResearch
JointworkwithSunnyPanchal,ApratimBhattacharyya,
GuillaumeBerger,AntoineMercier,RezaPourreza,SanjayHaresh,andothers
September24,2024
SnapdragonandQualcommbrandedproductsareproductsofQualcommTechnologies,Inc.and/oritssubsidiaries.
Agenda
?Keyconcept:streamingarchitecture
?Importanceofdatasetsforend-to-endtraining
?Efficienthuman-AIinteractionandvideo-basedreasoning
?ImprovingstreamingvideoLLMsusingauxiliarytasks
?Q&A
2
MODALITYANDUSECASECAPABILITYANDKPI
Longercontextwindow
Allowsin-depthconversations
VoiceUI
Voiceisanaturalandintuitiveinterfaceforconversation
GenerativeAI
Largemultimodalmodelso
Utilizingmoresensinginputmodalitiestobetterunderstandtheworld
oPersonalization
Fine-tunedmodelscustomizedtoconsumers,enterprises,orindustries(e.g.,LoRA)
capabilities
continueto
increase
Higherresolution
Processhigherfidelityimagesforbetteraccuracy
Video&3D
Generatingcontentforaricherandmorerealisticexperience
Agents
Executemulti-steptaskswithreasoningautonomouslytoachieveagoal
3
LoRA:low-rankadaptation
Full-stackAIoptimization
forLMs
Runscompletely
onthedevice
Significantlyreduces
runtimelatencyandpowerconsumption
Continuouslyimproves
theQualcomm?AIStack
LM:Languagevisionmodel
Designinganefficientdiffusionmodelthroughknowledge
distillationforhighaccuracy
Knowledgedistillationforpruningandremovingofattentionblocks,resultinginaccuratemodelwithimproved
performanceandpowerefficiency
Qualcomm?AIEnginedirect
forimprovedperformanceandminimizedmemoryspillage
AIaccelerationontheQualcomm?HexagonmNPUoftheSnapdragon?8Gen3MobileProcessor
4
HybridAI
Distributeworkloadsamongcloudand
edge/devicestodelivermorepowerful,
efficient,andhighlyoptimizedexperiences
Centralcloud
Easeofdevelopment&deploymentTraining|Verylargemodels
Aggregation|Absoluteperformance
Edgecloud(on-premornearby)
Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation
On
device
Immediacy|Reliability|Personalization|Privacy|SecurityCost|Energy
Toscale,thecenterofgravityofAIprocessingismovingtotheedge
5
World’sfirst
largemultimodalmodel(LMM)
onan
Androidphone
LLM:LargeLanguageModel;LLaVA:LargeLanguageandVisionAssistant
LLMscannowsee
7+billionparameterLMM,LLaVA,
withtext,speech,andimageinputs
Multi-turnintuitiveconversationsaboutanimageataresponsive
tokenrate
Full-stackAIoptimization
toachievehighperformanceatlowpower
Enhancedprivacy,reliability,personalization,andcostwithon-deviceprocessing
6
7
Goal:TrainingAImodelstoseeandinteractwithhumans
SMARTHOMEMOBILEROBOTICS
8
Visually-groundedLLM
Vision
Action
recognition
Orchestrator
Situatedvision-languagemodels
?Processalivevideostreaminrealtimeanddynamicallyinteractwithusers
LLM
?Determinewhattosayandwhentosayit
Frontend
?Enablethepathtohumanoids
TTS
Open-ended,asynchronous
interactionwithsituatedagentsisanopenchallenge
?Limitedtoturn-basedinteractionsaboutofflinedocumentsorimages
?Limitedtocapturingmomentarysnapshotsofrealityin
aVQA-styledialogue
Researchingvisually-groundedLLMswiththeabilitytoreasonandinteractwiththeenvironment
WhattoSayandWhentoSayit:Video-LanguageModelandBenchmarkforSituatedInteractions(2024);OpenEQA:EmbodiedQuestionAnsweringintheEraofFoundationModels(2024);VQA:visualquestionanswering9
201020122014
SPEECHTOTEXT
Audio
Pipeline
Neuralnetwork
Text
OBJECT
RECOGNITION
Pixels
Pipeline
Neuralnetwork
Objects
LANGUAGE
TRANSLATION
English
Pipeline
Neuralnetwork
French
Neuralnetworkshavereplacedincreasinglycomplexcomputationalpipelines
10
INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM
End-to-endbackpropforagents
11
Keyconcept:
Multi-modalstreamingarchitecture
INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM
TRAINEDEND-TO-END
EXTERNALINPUT
(e.g.,camera)
LANGUAGEORACTIONS
AUTO-REGRESSIVELLM
?Anauto-regressivelanguage
modelisausefulcomponent
ofamulti-modalagentbecauseitisalreadyabletoperform
adialoguewithauser
?Additionally,languagemakes
iteasytoencodesurrogatetasksforadegreeof“commonsense”toemerge
End-to-endlearningrequiresa
multi-modalstreamingarchitecture
13
End-to-endlearning
requiresa
multi-modalstreaming
architecture
AUTO-REGRESSIVELLM
LANGUAGEORACTIONS
EXTERNAL
INPUT
(e.g.,camera)
CONTEXTWINDOW
F
T
T
T
F
T
T
T
F
T
T
R
O
O
O
R
O
O
O
R
O
O
A
K
K
K
A
K
K
K
A
K
K
M
E
E
E
M
E
E
E
M
E
E
E
N
N
N
E
N
N
N
E
N
N
?Visualfoundationmodelsthatcombineanimagefeatureextractorwithalanguagemodelback-bonehavebecomeincreasinglycommon
?Therearemultipledifferentwaystocombinevisualinformationwithlanguagemodeltokes,e.g.:
?Cross-attention(e.g.,Flamingo)
?Dedicatedvisiontokens(e.g.,Llava)
…goodforapplicationslikeCaptioningandVisualQuestionAnswering
However,…
…aliveagentthatcanutilizeareal-timecamerafeed
requiresasystemthatcancontinuouslyattendtovisualinput
?Challenges:
?Freelyinterleavedvisionframesandlanguagetokens
?Dependencesbetweenvisionframe-rateandtokenrate
?Trainingdata,allowingamodeltolearnwhattosayandwhen
?Recentwork:“VideoLLM-online:OnlineVideoLargeLanguageModelforStreamingVideo“,Chenetal.,2024andourwork,whichIwillpresentinthenextslides
14
Flamingo:aVisualLanguageModelforFew-ShotLearning”,Alayracetal2022“VisualInstructionTuning”,Liuetal.2023
Importanceofdatasetsforend-to-endtraining
16
Datasetsforend-to-endtrainingofvisualassistants
Keyrequirementforend-to-endtraining:
alignedvideofeed(frames)+assistant’scomments(tokens)
“HoloAssist:anEgocentric
HumanInteractionDatasetforInteractiveAIAssistantsintheRealWorld”
Wangetal.2024
1stpersonvideosshowingavarietyoftasks(20tasksacross16objects)
“CanFoundationModelsWatch,TalkandGuideYouStepbySteptoMakeaCake?”
Baoetal.2023
1stpersonvideosshowingpreparationofcupcakes
“LiveFitnessCoachingasaTestbedfor
SituatedInteractions”
Panchaletal.2024
3rdpersonvideosshowingfitnessexercisesandtheircorrections
Fitnessquestionsdataset
148
300k
470+
exercises
short-clipvideos
hours
1900
unique
participants
1.1M+
high-level
question-answerpairs
400k+
fine-grained
question-answerpairs
FIT-Coach
benchmarkanddataset
Fitnessfeedbackdataset
9+
hoursoffitness
coachingsession
148
exercisesessions
~3.5
minutes
longsessionswith5to6
exercises
21
unique
participants
Anovelinteractivevisualcoachingbenchmarkanddatasetasatest-bedforreal-time,real-world
situatedinteraction
Aimedatthedevelopmentofinteractivemulti-modalvision-language
modelsbasedinthecontrolledbutchallengingfitnesscoachingdomain
LiveFitnessCoachingasaTestbedforSituatedInteraction,Panchal,Bhattacharyya,etal.202417
18
Fitnessassistantdatasetandbenchmark
Shortvideoclipsshowingtheuserperformingindividualexercises,
alongwithlabelsforperformanceandcommonmistakes(~300kclipsofduration~5-10secondseach)
Long-rangevideosshowingtheuserexercising,alongwithalignedcommentsbythecoach
(~200sessionsacross5-6exerciseseach)
Numberofvideos
UniqueParticipants
AverageDuration(s)
ExercisesperVideo
TotalNumberofExercises
TotalClasses
SHORTCLIPS
LONG-RANGE
Train
Test
Train
Testt
290,775
1,800+5.6±1.1
1
148
1866
16,429
100
5.6±1.2
1
148
1690
153
21
213.4±3.1
5-6
23
—
69
7
213.7±3.3
5-6
23
—
FitnessQuestions
TotalHigh-levelQuestions
TotalFine-grainedQuestions
1,193,056
404,082
78,390
80,694
—
—
—
—
FitnessFeedbacks
AverageFeedbacksperExercise
AverageSilencePeriod(s)tt
AverageFeedbackLength(words)
2.0±10.1n/a
9.0±6.1
2.4±6.9
n/a
9.1±5.0
5.0±1.35.2±1.46.3±3.8
5.0±1.25.3±1.26.6±4.0
19
Fitnessassistantdatasetandbenchmark
LongfitnesssessionsdatasetShortfitnessclipsdataset
20
OurdatasetmeetsalltheneedsofinteractiveAIassistants
DATASET
DOMAIN
HUMANACTIONS
INTERACTIVE
MISTAKES
CORRECTIVEFEEDBACKS
DOMAINEXPERTISE
LENGTH
ActionRecognitionDatasets
NTURGB+D
FineGym
Fitness
Fitness
√
√
x
x
x
x
x
x
√
√
708
ProceduralActivityDatasets
YouCook2
Cooking
x
x
x
x
x
176
Epic-Kitchens
Cooking
x
x
x
x
x
100
HowTo100M
Daily-life
√
x
x
x
x
134k
Ego-4D
Daily-life
x
x
x
x
x
3670
Ego-Exo4D
Daily-life
x
x
√
x
x
1422
Assembly-101
Toyassm.
x
x
√
x
x
513
InteractiveAIAssistantDatasets
WTAG
Cooking
x
x
√
√
x
10
HoloAssist
Obj.manip.
x
x
√
√
x
166
QEVD(Ours)
Fitness
√
√
√
√
√
474
Efficienthuman-AIinteractionandvideo-basedreasoning
22
Detailedarchitecture:
Learningwhattosayandwhentosayit
AUTO-REGRESSIVELLM
Visualstream
PROMPT
LANGUAGEBACKBONE
EXTERNALINPUT
(e.g.,camera)
LANGUAGEORACTIONS
SELF-ATTN
SELF-ATTN
SELF-ATTN
!!!
3DCNN
SELF-ATTN
CROSS-ATTN4…
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<next>
SELF-ATTN
CROSS-ATTN
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<feedback>
3DCNN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
smooth
SELF-ATTN
CROSS-ATTN
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<next>
3DCNN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
onooo
AUTO-REGRESSIVELLM
Steppablecausal3dconvolutions
enableefficientstreamingmotionperception
Existingvision
languagemodelsusea2dCNNor
visiontransformerasthevisual
featureextractor
Thismakesthemunsuitablefortaskssuchasfitnesscoaching,whichinvolveunderstanding
ofhumanbehaviorsand
motionpatterns
EXTERNAL
INPUT
(e.g.,camera)
LANGUAGEORACTIONS
Weusea3dCNNasthefeatureextractor,whichwehaveshowntobewell-suitedtoend-to-end
learning(“Isend-to-endlearningenoughforfitnessactivity
recognition?”,Mercieretal.2023)
Efficientvisualstreamingatinferencetimecanbeenabledusing
SteppableConv
PreviousNew
steppable,causalconvolutions:
StandardConv
Enhanceyourappwiththe
abilitytosee&interactwith
humansviaanyRGBcamera:
/quic/sense
CausalConv
timestepstimestep
“Isend-to-endlearningenoughforfitnessactivityrecognition?”,Mercieretal.202323
ImprovingstreamingvideoLLMsusingauxiliarytasks
Languagegenerationisnotonlyausefultask,butitalsohelpsamodelacquireadegreeof“commonsense”
Usingalanguagedecodertoprovidesurrogatetaskstothemodelattrainingtime
Pre-trainingamodelon
adifficultcaptioningtask(Something-something
byGoyaletal.2017)…
…allowsustoimprovepredictionaccuracyonaseparateHomeCookingTask:
“Ontheeffectivenessoftaskgranularityfortransferlearning”(Mahdisoltani,etal.2018)
Generatingcomplextextualdescriptions
Generatingsimpletextualdescriptions
Classificationon
178classactions
Classificationon
40actiongroups
Baselineclassification
onimages
Trainingfromscratch
7,7
34,3
59,7
55,8
62,8
47,1
54,4
*“Thesomething-somethingvideodatabaseforlearningandevaluatingvisualcommonsense”(Goyaletal.2017)25
26
Avision-languagemodelcanlearnlow-levelvisualskillsbyencodingvisualinformationaslanguage
Encodingvisualinformationaslanguage
isanaturalwaytoteachavision-languagemodellow-levelvisualskills,suchas
objectidentification,detection,etc.
Theuseofthesevisualskillsatinferencetimeislikeperformingchain-of-thoughtreasoningforvisualinferencetasks
“Look,RememberandReason:Groundedreasoninginvideoswithlanguagemodels”
Bhattacharyya,etal.2024
13
18
18
21
21
33
3
21
21
33
3
21
21
33
3
21
Method
StaticCamera
MovingCamera
Top1
Top5
Top1
Top5
ALOE(Dinget.Al.)
74.0
94.0
59.7
90.1
TFCV3D(Zhanget.al.)
79.7
95.5
-
-
LRR(w/oSurrogateTasks)
68.5
88.7
62.7
86.7
LRR(fine-tuned)
84.1
97.2
80.4
96.7
LRR(joint)
81.0
97.3
73.7
95.6
Example:Something-Else(Materzynskaetal.,2020):Example:CATER(Girdharetal.,2020):
Method
Base
Compositional
Top1
Top5
Top1
Top5
STIN+OIE+NL(Materzynskaetal.,2020,MIT)
78.1
94.5
56.2
81.3
Video-ChatGPT(Maazetal.,2023)
52.6
75.8
38.6
67.8
LRR(w/oSurrogateTasks)
52.6
75.8
50.1
70.8
LRR(fine-tuned)
80.2
96.1
62.0
86.3
LRR(joint)
-
-
61.1
85.4
Stochasticprobingallowsustodistillvisualskillsintothemodel
?Encodingtheextractedlow-levelinformationastokensgrowsthecontextwindowanditcanbeinefficient
?Relyingonexplicitrepresentationsoflow-levelcomputervision
features(suchasboundingboxpositions)mayalsoleadtobrittleness
?Wethereforeproposetodistilllow-levelvisualskillsintothemodelusingaprocesswerefertoasStochasticProbing:
Stochasticprobing:Duringtraining,promptamodelatrandomtime-stepstoperformlow-levelvisualtasks
ACRE
Compositional
Systematic
InferenceSpeed*(sec)
ALOE(Dinget.Al.)
LRR
LRR(StochasticProbing)
91.7
99.3
93.9
99.5
99.2
-
0.061
1.415
98.2
*timingonanA100GPU
Stochasticprobingboostsefficiencyatinferencetime
Trainingonvisualskillscanboostperformanceoverclassicapproaches
27
Asimilarapproach:“DistillingStep-by-Step!OutperformingLargerLanguageModelswithLessTrainingDataandSmallerModelSizes”,Hsie,etal.,2023
End-to-endtraininginconjunctionwithstochasticprobingallowsamodeltoprovideusefulandaccuratefeedbackinreal-time
28
29
Qualitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback
Question:Provideanappropriatefeedbackfortheuser
Video-LLaMA:Weseeayoungmanstandinginakitchen,wearingaredshirtandwhiteshorts.
Video-ChatGPT:Theuserhassuccessfullydemonstratedtheabilitytoperformabalancingactonapairofstools.
Coach-LLaMA:Thisisawesome.Let’skeeptheintensityhigh!
Groundtruth
Stream-VLM
LLaMA-VID
LLaVA-Next
30
Quantitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback
Zero-shotpromptingresults:
METHOD
METEOR
ROUGE-L
BERT
LLM-Acc.
InstructBLIP
0.047
0.040
0.839
1.64
Video-LLaVA
0.057
0.025
0.847
1.82
Video-ChatGPT
0.098
0.078
0.850
2.27
Video-LLaMA
0.101
0.077
0.859
2.28
LLaMA-VID
0.100
0.079
0.859
2.33
LLaVA-Next
0.104
0.078
0.858
2.39
Fine-tuningresults:
METHOD
METEOR
ROUGE-L
BERT
LLM-Acc.
T-F-Score
Socratic-Llama-2-7B
0.094
0.071
0.860
2.39
0.50t
Video-ChatGPT*
0.108
0.093
0.863
2.42
0.50t
LLaMA-VID*
0.106
0.090
0.860
2.40
0.50t
STREAM-VLM
0.125
0.116
0.863
2.56
0.59
STREAM-VLM(w/o3DCNN)
0.090
0.083
0.857
2.17
0.51
STREAM-VLM(w/oAction-Tokens
0.125
0.110
0.861
2.56
0.50t
31
Outlook:CLEVRskillsdatasetforroboticsfoundationmodels
DATASET/SIMULATOR
#TASKS
LANGUAGE
MULTIMODALPROMPTS
ACTIONGRANULARITY
COMPOSITIONALITY
#DEMONSTRATIONS
Real
RoboTurk
3
x
x
ActionDeltas
x
111hrs
BridgeData
71
x
x
ActionDeltas
x
7.2k
Open-X
√
x
ActionDeltas
x
1M
RH20T
√
x
ActionDeltas
x
100k
FMB
7
x
x
ActionDeltas
√
22.5k
Simulated
CALVIN
34
√
x
ActionDeltas
√t
—
Behaviour-1K
1000
x
x
ActionDeltas
x
—
Maniskill2
20
x
x
ActionDeltas
x
≈70k
VIMA
17
√
√
Poses
x
650k
ClevrSkill(our)
36
√
√
ActionDeltas+Poses
√
330k
RunningAIondevicesavesmemory
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 貴州城市職業(yè)學院《安全評價理論與技術》2023-2024學年第一學期期末試卷
- 貴陽職業(yè)技術學院《人機工程研究》2023-2024學年第一學期期末試卷
- 2025青海省建筑安全員《A證》考試題庫
- 生態(tài)保護修復和水土流失綜合治理項目可行性研究報告-生態(tài)修復需求迫切
- 貴陽人文科技學院《工科大學化學-有機化學基礎》2023-2024學年第一學期期末試卷
- 廣州中醫(yī)藥大學《物流信息系統(tǒng)》2023-2024學年第一學期期末試卷
- 2025陜西建筑安全員C證考試題庫
- 2025云南省建筑安全員《A證》考試題庫
- 廣州應用科技學院《鋼筋混凝土原理》2023-2024學年第一學期期末試卷
- 2025山西省建筑安全員C證(專職安全員)考試題庫
- 17J008擋土墻(重力式、衡重式、懸臂式)圖示圖集
- 2025年濟南鐵路局招聘筆試參考題庫含答案解析
- 2024至2030年中國大顆粒尿素行業(yè)投資前景及策略咨詢研究報告
- 《長方體和正方體》復習(教案)
- 超聲技術報告范文
- 思想道德與法治(同濟大學)知到智慧樹章節(jié)答案
- 小學語文閱讀理解24個萬能答題公式
- 湖南省懷化市2023-2024學年七年級上學期語文期末試卷(含答案)
- 《廊坊市綠色建筑專項規(guī)劃(2020-2025)》
- 2024-2030年中國濕巾行業(yè)發(fā)展趨勢及競爭策略分析報告
- 2023-2024學年全國小學二年級上語文人教版期末試卷(含答案解析)
評論
0/150
提交評論