2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術白皮書(英文版)-高通_第1頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術白皮書(英文版)-高通_第2頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術白皮書(英文版)-高通_第3頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術白皮書(英文版)-高通_第4頁
2024多模態(tài)AI的感官融合-視覺、聽覺與交互技術白皮書(英文版)-高通_第5頁
已閱讀5頁,還剩58頁未讀 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

are

Integratingsenses:

HowAIislearningto

see,hear,andinteract

RolandMemisevic

SeniorDirectorofEngineeringatQualcommAIResearch

JointworkwithSunnyPanchal,ApratimBhattacharyya,

GuillaumeBerger,AntoineMercier,RezaPourreza,SanjayHaresh,andothers

September24,2024

SnapdragonandQualcommbrandedproductsareproductsofQualcommTechnologies,Inc.and/oritssubsidiaries.

Agenda

?Keyconcept:streamingarchitecture

?Importanceofdatasetsforend-to-endtraining

?Efficienthuman-AIinteractionandvideo-basedreasoning

?ImprovingstreamingvideoLLMsusingauxiliarytasks

?Q&A

2

MODALITYANDUSECASECAPABILITYANDKPI

Longercontextwindow

Allowsin-depthconversations

VoiceUI

Voiceisanaturalandintuitiveinterfaceforconversation

GenerativeAI

Largemultimodalmodelso

Utilizingmoresensinginputmodalitiestobetterunderstandtheworld

oPersonalization

Fine-tunedmodelscustomizedtoconsumers,enterprises,orindustries(e.g.,LoRA)

capabilities

continueto

increase

Higherresolution

Processhigherfidelityimagesforbetteraccuracy

Video&3D

Generatingcontentforaricherandmorerealisticexperience

Agents

Executemulti-steptaskswithreasoningautonomouslytoachieveagoal

3

LoRA:low-rankadaptation

Full-stackAIoptimization

forLMs

Runscompletely

onthedevice

Significantlyreduces

runtimelatencyandpowerconsumption

Continuouslyimproves

theQualcomm?AIStack

LM:Languagevisionmodel

Designinganefficientdiffusionmodelthroughknowledge

distillationforhighaccuracy

Knowledgedistillationforpruningandremovingofattentionblocks,resultinginaccuratemodelwithimproved

performanceandpowerefficiency

Qualcomm?AIEnginedirect

forimprovedperformanceandminimizedmemoryspillage

AIaccelerationontheQualcomm?HexagonmNPUoftheSnapdragon?8Gen3MobileProcessor

4

HybridAI

Distributeworkloadsamongcloudand

edge/devicestodelivermorepowerful,

efficient,andhighlyoptimizedexperiences

Centralcloud

Easeofdevelopment&deploymentTraining|Verylargemodels

Aggregation|Absoluteperformance

Edgecloud(on-premornearby)

Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation

On

device

Immediacy|Reliability|Personalization|Privacy|SecurityCost|Energy

Toscale,thecenterofgravityofAIprocessingismovingtotheedge

5

World’sfirst

largemultimodalmodel(LMM)

onan

Androidphone

LLM:LargeLanguageModel;LLaVA:LargeLanguageandVisionAssistant

LLMscannowsee

7+billionparameterLMM,LLaVA,

withtext,speech,andimageinputs

Multi-turnintuitiveconversationsaboutanimageataresponsive

tokenrate

Full-stackAIoptimization

toachievehighperformanceatlowpower

Enhancedprivacy,reliability,personalization,andcostwithon-deviceprocessing

6

7

Goal:TrainingAImodelstoseeandinteractwithhumans

SMARTHOMEMOBILEROBOTICS

8

Visually-groundedLLM

Vision

Action

recognition

Orchestrator

Situatedvision-languagemodels

?Processalivevideostreaminrealtimeanddynamicallyinteractwithusers

LLM

?Determinewhattosayandwhentosayit

Frontend

?Enablethepathtohumanoids

TTS

Open-ended,asynchronous

interactionwithsituatedagentsisanopenchallenge

?Limitedtoturn-basedinteractionsaboutofflinedocumentsorimages

?Limitedtocapturingmomentarysnapshotsofrealityin

aVQA-styledialogue

Researchingvisually-groundedLLMswiththeabilitytoreasonandinteractwiththeenvironment

WhattoSayandWhentoSayit:Video-LanguageModelandBenchmarkforSituatedInteractions(2024);OpenEQA:EmbodiedQuestionAnsweringintheEraofFoundationModels(2024);VQA:visualquestionanswering9

201020122014

SPEECHTOTEXT

Audio

Pipeline

Neuralnetwork

Text

OBJECT

RECOGNITION

Pixels

Pipeline

Neuralnetwork

Objects

LANGUAGE

TRANSLATION

English

Pipeline

Neuralnetwork

French

Neuralnetworkshavereplacedincreasinglycomplexcomputationalpipelines

10

INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM

End-to-endbackpropforagents

11

Keyconcept:

Multi-modalstreamingarchitecture

INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM

TRAINEDEND-TO-END

EXTERNALINPUT

(e.g.,camera)

LANGUAGEORACTIONS

AUTO-REGRESSIVELLM

?Anauto-regressivelanguage

modelisausefulcomponent

ofamulti-modalagentbecauseitisalreadyabletoperform

adialoguewithauser

?Additionally,languagemakes

iteasytoencodesurrogatetasksforadegreeof“commonsense”toemerge

End-to-endlearningrequiresa

multi-modalstreamingarchitecture

13

End-to-endlearning

requiresa

multi-modalstreaming

architecture

AUTO-REGRESSIVELLM

LANGUAGEORACTIONS

EXTERNAL

INPUT

(e.g.,camera)

CONTEXTWINDOW

F

T

T

T

F

T

T

T

F

T

T

R

O

O

O

R

O

O

O

R

O

O

A

K

K

K

A

K

K

K

A

K

K

M

E

E

E

M

E

E

E

M

E

E

E

N

N

N

E

N

N

N

E

N

N

?Visualfoundationmodelsthatcombineanimagefeatureextractorwithalanguagemodelback-bonehavebecomeincreasinglycommon

?Therearemultipledifferentwaystocombinevisualinformationwithlanguagemodeltokes,e.g.:

?Cross-attention(e.g.,Flamingo)

?Dedicatedvisiontokens(e.g.,Llava)

…goodforapplicationslikeCaptioningandVisualQuestionAnswering

However,…

…aliveagentthatcanutilizeareal-timecamerafeed

requiresasystemthatcancontinuouslyattendtovisualinput

?Challenges:

?Freelyinterleavedvisionframesandlanguagetokens

?Dependencesbetweenvisionframe-rateandtokenrate

?Trainingdata,allowingamodeltolearnwhattosayandwhen

?Recentwork:“VideoLLM-online:OnlineVideoLargeLanguageModelforStreamingVideo“,Chenetal.,2024andourwork,whichIwillpresentinthenextslides

14

Flamingo:aVisualLanguageModelforFew-ShotLearning”,Alayracetal2022“VisualInstructionTuning”,Liuetal.2023

Importanceofdatasetsforend-to-endtraining

16

Datasetsforend-to-endtrainingofvisualassistants

Keyrequirementforend-to-endtraining:

alignedvideofeed(frames)+assistant’scomments(tokens)

“HoloAssist:anEgocentric

HumanInteractionDatasetforInteractiveAIAssistantsintheRealWorld”

Wangetal.2024

1stpersonvideosshowingavarietyoftasks(20tasksacross16objects)

“CanFoundationModelsWatch,TalkandGuideYouStepbySteptoMakeaCake?”

Baoetal.2023

1stpersonvideosshowingpreparationofcupcakes

“LiveFitnessCoachingasaTestbedfor

SituatedInteractions”

Panchaletal.2024

3rdpersonvideosshowingfitnessexercisesandtheircorrections

Fitnessquestionsdataset

148

300k

470+

exercises

short-clipvideos

hours

1900

unique

participants

1.1M+

high-level

question-answerpairs

400k+

fine-grained

question-answerpairs

FIT-Coach

benchmarkanddataset

Fitnessfeedbackdataset

9+

hoursoffitness

coachingsession

148

exercisesessions

~3.5

minutes

longsessionswith5to6

exercises

21

unique

participants

Anovelinteractivevisualcoachingbenchmarkanddatasetasatest-bedforreal-time,real-world

situatedinteraction

Aimedatthedevelopmentofinteractivemulti-modalvision-language

modelsbasedinthecontrolledbutchallengingfitnesscoachingdomain

LiveFitnessCoachingasaTestbedforSituatedInteraction,Panchal,Bhattacharyya,etal.202417

18

Fitnessassistantdatasetandbenchmark

Shortvideoclipsshowingtheuserperformingindividualexercises,

alongwithlabelsforperformanceandcommonmistakes(~300kclipsofduration~5-10secondseach)

Long-rangevideosshowingtheuserexercising,alongwithalignedcommentsbythecoach

(~200sessionsacross5-6exerciseseach)

Numberofvideos

UniqueParticipants

AverageDuration(s)

ExercisesperVideo

TotalNumberofExercises

TotalClasses

SHORTCLIPS

LONG-RANGE

Train

Test

Train

Testt

290,775

1,800+5.6±1.1

1

148

1866

16,429

100

5.6±1.2

1

148

1690

153

21

213.4±3.1

5-6

23

69

7

213.7±3.3

5-6

23

FitnessQuestions

TotalHigh-levelQuestions

TotalFine-grainedQuestions

1,193,056

404,082

78,390

80,694

FitnessFeedbacks

AverageFeedbacksperExercise

AverageSilencePeriod(s)tt

AverageFeedbackLength(words)

2.0±10.1n/a

9.0±6.1

2.4±6.9

n/a

9.1±5.0

5.0±1.35.2±1.46.3±3.8

5.0±1.25.3±1.26.6±4.0

19

Fitnessassistantdatasetandbenchmark

LongfitnesssessionsdatasetShortfitnessclipsdataset

20

OurdatasetmeetsalltheneedsofinteractiveAIassistants

DATASET

DOMAIN

HUMANACTIONS

INTERACTIVE

MISTAKES

CORRECTIVEFEEDBACKS

DOMAINEXPERTISE

LENGTH

ActionRecognitionDatasets

NTURGB+D

FineGym

Fitness

Fitness

x

x

x

x

x

x

708

ProceduralActivityDatasets

YouCook2

Cooking

x

x

x

x

x

176

Epic-Kitchens

Cooking

x

x

x

x

x

100

HowTo100M

Daily-life

x

x

x

x

134k

Ego-4D

Daily-life

x

x

x

x

x

3670

Ego-Exo4D

Daily-life

x

x

x

x

1422

Assembly-101

Toyassm.

x

x

x

x

513

InteractiveAIAssistantDatasets

WTAG

Cooking

x

x

x

10

HoloAssist

Obj.manip.

x

x

x

166

QEVD(Ours)

Fitness

474

Efficienthuman-AIinteractionandvideo-basedreasoning

22

Detailedarchitecture:

Learningwhattosayandwhentosayit

AUTO-REGRESSIVELLM

Visualstream

PROMPT

LANGUAGEBACKBONE

EXTERNALINPUT

(e.g.,camera)

LANGUAGEORACTIONS

SELF-ATTN

SELF-ATTN

SELF-ATTN

!!!

3DCNN

SELF-ATTN

CROSS-ATTN4…

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<next>

SELF-ATTN

CROSS-ATTN

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<feedback>

3DCNN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

smooth

SELF-ATTN

CROSS-ATTN

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<next>

3DCNN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

onooo

AUTO-REGRESSIVELLM

Steppablecausal3dconvolutions

enableefficientstreamingmotionperception

Existingvision

languagemodelsusea2dCNNor

visiontransformerasthevisual

featureextractor

Thismakesthemunsuitablefortaskssuchasfitnesscoaching,whichinvolveunderstanding

ofhumanbehaviorsand

motionpatterns

EXTERNAL

INPUT

(e.g.,camera)

LANGUAGEORACTIONS

Weusea3dCNNasthefeatureextractor,whichwehaveshowntobewell-suitedtoend-to-end

learning(“Isend-to-endlearningenoughforfitnessactivity

recognition?”,Mercieretal.2023)

Efficientvisualstreamingatinferencetimecanbeenabledusing

SteppableConv

PreviousNew

steppable,causalconvolutions:

StandardConv

Enhanceyourappwiththe

abilitytosee&interactwith

humansviaanyRGBcamera:

/quic/sense

CausalConv

timestepstimestep

“Isend-to-endlearningenoughforfitnessactivityrecognition?”,Mercieretal.202323

ImprovingstreamingvideoLLMsusingauxiliarytasks

Languagegenerationisnotonlyausefultask,butitalsohelpsamodelacquireadegreeof“commonsense”

Usingalanguagedecodertoprovidesurrogatetaskstothemodelattrainingtime

Pre-trainingamodelon

adifficultcaptioningtask(Something-something

byGoyaletal.2017)…

…allowsustoimprovepredictionaccuracyonaseparateHomeCookingTask:

“Ontheeffectivenessoftaskgranularityfortransferlearning”(Mahdisoltani,etal.2018)

Generatingcomplextextualdescriptions

Generatingsimpletextualdescriptions

Classificationon

178classactions

Classificationon

40actiongroups

Baselineclassification

onimages

Trainingfromscratch

7,7

34,3

59,7

55,8

62,8

47,1

54,4

*“Thesomething-somethingvideodatabaseforlearningandevaluatingvisualcommonsense”(Goyaletal.2017)25

26

Avision-languagemodelcanlearnlow-levelvisualskillsbyencodingvisualinformationaslanguage

Encodingvisualinformationaslanguage

isanaturalwaytoteachavision-languagemodellow-levelvisualskills,suchas

objectidentification,detection,etc.

Theuseofthesevisualskillsatinferencetimeislikeperformingchain-of-thoughtreasoningforvisualinferencetasks

“Look,RememberandReason:Groundedreasoninginvideoswithlanguagemodels”

Bhattacharyya,etal.2024

13

18

18

21

21

33

3

21

21

33

3

21

21

33

3

21

Method

StaticCamera

MovingCamera

Top1

Top5

Top1

Top5

ALOE(Dinget.Al.)

74.0

94.0

59.7

90.1

TFCV3D(Zhanget.al.)

79.7

95.5

-

-

LRR(w/oSurrogateTasks)

68.5

88.7

62.7

86.7

LRR(fine-tuned)

84.1

97.2

80.4

96.7

LRR(joint)

81.0

97.3

73.7

95.6

Example:Something-Else(Materzynskaetal.,2020):Example:CATER(Girdharetal.,2020):

Method

Base

Compositional

Top1

Top5

Top1

Top5

STIN+OIE+NL(Materzynskaetal.,2020,MIT)

78.1

94.5

56.2

81.3

Video-ChatGPT(Maazetal.,2023)

52.6

75.8

38.6

67.8

LRR(w/oSurrogateTasks)

52.6

75.8

50.1

70.8

LRR(fine-tuned)

80.2

96.1

62.0

86.3

LRR(joint)

-

-

61.1

85.4

Stochasticprobingallowsustodistillvisualskillsintothemodel

?Encodingtheextractedlow-levelinformationastokensgrowsthecontextwindowanditcanbeinefficient

?Relyingonexplicitrepresentationsoflow-levelcomputervision

features(suchasboundingboxpositions)mayalsoleadtobrittleness

?Wethereforeproposetodistilllow-levelvisualskillsintothemodelusingaprocesswerefertoasStochasticProbing:

Stochasticprobing:Duringtraining,promptamodelatrandomtime-stepstoperformlow-levelvisualtasks

ACRE

Compositional

Systematic

InferenceSpeed*(sec)

ALOE(Dinget.Al.)

LRR

LRR(StochasticProbing)

91.7

99.3

93.9

99.5

99.2

-

0.061

1.415

98.2

*timingonanA100GPU

Stochasticprobingboostsefficiencyatinferencetime

Trainingonvisualskillscanboostperformanceoverclassicapproaches

27

Asimilarapproach:“DistillingStep-by-Step!OutperformingLargerLanguageModelswithLessTrainingDataandSmallerModelSizes”,Hsie,etal.,2023

End-to-endtraininginconjunctionwithstochasticprobingallowsamodeltoprovideusefulandaccuratefeedbackinreal-time

28

29

Qualitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback

Question:Provideanappropriatefeedbackfortheuser

Video-LLaMA:Weseeayoungmanstandinginakitchen,wearingaredshirtandwhiteshorts.

Video-ChatGPT:Theuserhassuccessfullydemonstratedtheabilitytoperformabalancingactonapairofstools.

Coach-LLaMA:Thisisawesome.Let’skeeptheintensityhigh!

Groundtruth

Stream-VLM

LLaMA-VID

LLaVA-Next

30

Quantitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback

Zero-shotpromptingresults:

METHOD

METEOR

ROUGE-L

BERT

LLM-Acc.

InstructBLIP

0.047

0.040

0.839

1.64

Video-LLaVA

0.057

0.025

0.847

1.82

Video-ChatGPT

0.098

0.078

0.850

2.27

Video-LLaMA

0.101

0.077

0.859

2.28

LLaMA-VID

0.100

0.079

0.859

2.33

LLaVA-Next

0.104

0.078

0.858

2.39

Fine-tuningresults:

METHOD

METEOR

ROUGE-L

BERT

LLM-Acc.

T-F-Score

Socratic-Llama-2-7B

0.094

0.071

0.860

2.39

0.50t

Video-ChatGPT*

0.108

0.093

0.863

2.42

0.50t

LLaMA-VID*

0.106

0.090

0.860

2.40

0.50t

STREAM-VLM

0.125

0.116

0.863

2.56

0.59

STREAM-VLM(w/o3DCNN)

0.090

0.083

0.857

2.17

0.51

STREAM-VLM(w/oAction-Tokens

0.125

0.110

0.861

2.56

0.50t

31

Outlook:CLEVRskillsdatasetforroboticsfoundationmodels

DATASET/SIMULATOR

#TASKS

LANGUAGE

MULTIMODALPROMPTS

ACTIONGRANULARITY

COMPOSITIONALITY

#DEMONSTRATIONS

Real

RoboTurk

3

x

x

ActionDeltas

x

111hrs

BridgeData

71

x

x

ActionDeltas

x

7.2k

Open-X

x

ActionDeltas

x

1M

RH20T

x

ActionDeltas

x

100k

FMB

7

x

x

ActionDeltas

22.5k

Simulated

CALVIN

34

x

ActionDeltas

√t

Behaviour-1K

1000

x

x

ActionDeltas

x

Maniskill2

20

x

x

ActionDeltas

x

≈70k

VIMA

17

Poses

x

650k

ClevrSkill(our)

36

ActionDeltas+Poses

330k

RunningAIondevicesavesmemory

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論