并行處理與體系結(jié)構(gòu)課件hitsz-lec05

上傳人：9*** IP屬地：湖北上傳時(shí)間：2023-02-06 格式：PPTX 頁(yè)數(shù)：30 大?。?81.78KB 積分：30 舉報(bào) 版權(quán)申訴

并行處理與體系結(jié)構(gòu)課件hitsz-lec05_第2頁(yè)

并行處理與體系結(jié)構(gòu)課件hitsz-lec05_第3頁(yè)

并行處理與體系結(jié)構(gòu)課件hitsz-lec05_第4頁(yè)

并行處理與體系結(jié)構(gòu)課件hitsz-lec05_第5頁(yè)

已閱讀5頁(yè)，還剩25頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

Threading&SimultaneousMultithreading

SlidesadaptedfromDavidPatterson,UC-Berkeleycs252-s0612OutlineThreadLevelParallelismMultithreadingSimultaneousMultithreadingPower4vs.Power5HeadtoHead:VLIWvs.Superscalarvs.SMTCommentaryConclusion3PerformancebeyondsinglethreadILPILPforarbitrarycodeislimitednowto3to6issues/cycle,therecanbemuchhighernaturalparallelisminsomeapplications(e.g.,databaseorscientificcodes)Explicit(specifiedbycompiler)ThreadLevelParallelismorDataLevelParallelismThread:aprocesswithitsowninstructionsanddata(ormuchharderoncompiler:carefullyselectedcodesegmentsinthesameprocessthatrarelyinteract)Athreadmaybeoneprocessthatispartofaparallelprogramofmultipleprocesses,oritmaybeanindependentprogramEachthreadhasallthestate(instructions,data,PC,registerstate,andsoon)necessarytoallowittoexecuteDataLevelParallelism:Performidentical(lock-step)operationsondatawhenhavelotsofdata.4ThreadLevelParallelism(TLP)ILP(lastlectures)exploitsimplicitlyparalleloperationswithinalooporstraight-linecodesegmentTLPisexplicitlyrepresentedbytheuseofmultiplethreadsofexecutionthatareinherentlyparallelGoal:UsemanyinstructionstreamstoimproveThroughputofcomputersthatrunmanyprogramsExecutiontimeofmulti-threadedprogramsTLPcouldbemorecost-effectivetoexploitthanILPformanyapplications.5NewApproach:MultithreadedExecutionMultithreading:multiplethreadstosharethefunctionalunitsofoneprocessorviaoverlappedexecutionprocessormustduplicateindependentstateofeachthread,e.g.,aseparatecopyoftheregisterfile,aseparatePC,andifrunningasindependentprograms,aseparatepagetablememorysharedthroughthevirtualmemorymechanisms,whichalreadysupportmultipleprocessesHWforfastthreadswitch(0.1to10clocks)ismuchfasterthanafullprocessswitch(100sto1000sofclocks)thatcopiesstate(state=registers,memory,andfileaccesstables)Whenswitchamongthreads?Alternateinstructionsfromnewthreads(finegrain)Whenathreadisstalled,perhapsforacachemiss,anotherthreadcanbeexecuted(coarsegrain)Incache-lessmultiprocessors,atstartofeachmemoryaccess6Formostapplications,theprocessingunit(s)stall80%ormoreoftimeduring“execution”From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism,ISCA1995.(FromUWash.)Just18%ofissueslotsOKforan8-waysuperscalar.<=#1<=#218

18%CPUissueslots

usefullybusy7MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%8Fine-GrainedMultithreadingSwitchesbetweenthreadsoneachinstructioncycle,causingtheexecutionofmultiplethreadstobeinterleavedUsuallydoneinaround-robinfashion,skippinganystalledthreadsCPUmustbeabletoswitchthreadseveryclockAdvantageisthatitcanhidebothshortandlongstalls,sinceinstructionsfromotherthreadsexecutedwhenonethreadstallsDisadvantageisitslowsdownexecutionofindividualthreads,sinceathreadreadytoexecutewithoutstallswillbedelayedbyinstructionsfromotherthreadsUsedonSun’sNiagarachip(with8cores,willseelater)9Course-GrainedMultithreadingSwitchesthreadsforcostlystalls,suchasL2cachemisses(oronanydatamemoryreferenceifnocaches)AdvantagesRelievesneedtohaveveryfastthread-switching(ifusecaches).Doesnotslowdownanythread,sinceinstructionsfromotherthreadsissuedonlywhenactivethreadencountersacostlystall

Disadvantageisthatitishardtoovercomethroughputlossesfromshorterstalls,becauseofpipelinestart-upcostsSinceCPUnormallyissuesinstructionsfromjustonethread,whenastalloccurs,thepipelinemustbeemptiedorfrozenNewthreadmustfillpipelinebeforeinstructionscancompleteBecauseofthisstart-upoverhead,coarse-grainedmultithreadingisefficientforreducingpenaltyonlyofhighcoststalls,wherestalltime>>pipelinerefilltimeUsedIBMAS/400(1988,forsmalltomediumbusinesses)10(UWash=>Intel)SimultaneousMulti-threading…

“Hyper-threading”123456789MMFXFXFPFPBRCCCycleOnethread,8funcunitsM=Load/Store,FX=FixedPoint,FP=FloatingPoint,BR=Branch,CC=ConditionCodes123456789MMFXFXFPFPBRCCCycleTwothreads,8unitsBusy:13/72=18.0%Busy:30/72=41.7%11UsebothILPandTLP?(UWash:“Yes”)TLPandILPexploittwodifferentkindsofparallelstructureinaprogramCouldaprocessororientedtowardILPbeusedtoexploitTLP?functionalunitsareoftenidleindatapathsdesignedforILPbecauseofeitherstallsordependencesinthecodeCouldtheTLPbeusedasasourceofindependentinstructionsthatmightkeeptheprocessorbusyduringstalls?CouldTLPbeusedtoemploythefunctionalunitsthatwouldotherwiselieidlewheninsufficientILPexists?

12SimultaneousMultithreading(SMT)Simultaneousmultithreading(SMT):insightthatadynamicallyscheduledprocessoralreadyhasmanyHWmechanismstosupportmultithreadingLargesetofvirtualregistersthatcanbeusedtoholdtheregistersetsofindependentthreadsRegisterrenamingprovidesuniqueregisteridentifiers,soinstructionsfrommultiplethreadscanbemixedindatapathwithoutconfusingsourcesanddestinationsacrossthreadsOut-of-ordercompletionallowsthethreadstoexecuteoutoforder,andgetbetterutilizationoftheHWJustneedtoaddaper-threadrenamingtableandkeepingseparatePCsIndependentcommitmentcanbesupportedby“l(fā)ogically”keepingaseparatereorderbufferforeachthreadSource:MicrprocessorReport,December6,1999

“CompaqChoosesSMTforAlpha”13DesignChallengesinSMTSinceSMTmakessenseonlywithfine-grainedimplementation,impactoffine-grainedschedulingonsinglethreadperformance?Doesdesignatingapreferredthreadallowsacrificingneitherthroughputnorsingle-threadperformance?Unfortunately,withapreferredthread,processorislikelytosacrificesomethroughputwhenthepreferredthreadstallsLargerregisterfileisneededtoholdmultiplecontextsTrynottoaffectclockcycletime,especiallyinInstructionissue-morecandidateinstructionsneedtobeconsideredInstructioncompletion-choosingwhichinstructionstocommitmaybechallengingEnsurethatcacheandTLBconflictsgeneratedbySMTdonotdegradeperformance14MultithreadingCategoriesTime(processorcycle)Pipes:1234SuperscalarNewThread/cycFine-GrainedManyCyc/threadCoarse-GrainedSeparateJobsMultiprocessingFUs:1234SimultaneousMultithreadingThread1Thread2Thread3Thread4Thread5Idleslot16/48=33.3%27/48=56.3%27/48=56.3%29/48=60.4%42/48=87.5%15Power4Single-threadedpredecessortoPower5.Eightexecutionunitsinanout-of-orderengine,eachunitmayissueoneinstructioneachcycle.Instructionpipeline(IF:instructionfetch,IC:instructioncache,BP:branchpredict,D0:decodestage0,Xfer:transfer,GD:groupdispatch,MP:mapping,ISS:instructionissue,RF:registerfileread,EX:execute,EA:computeaddress,DC:datacaches,F6:six-cyclefloating-pointexecutionpipe,Fmt:dataformat,WB:writeback,andCP:groupcommit)16Power4-1threadPower5-2threads2fetch(PC),

2initialdecodes2completes(architectedregistersets)See/servers/eserver/pseries/news/related/2004/m2040.pdfPower5instructionpipeline(IF=instructionfetch,IC=instructioncache,BP=branchpredict,D0=decodestage0,Xfer=transfer,GD=groupdispatch,MP=mapping,ISS=instructionissue,RF=registerfileread,EX=execute,EA=computeaddress,DC=datacaches,F6=six-cyclefloating-pointexecutionpipe,Fmt=dataformat,WB=writeback,andCP=groupcommit)Page43.17Power5dataflow...Whyonly2threads?With4,somesharedresource(physicalregisters,cache,memorybandwidth)wouldoftenbottleneck

LSU=load/storeunit,FXU=fixed-pointexecutionunit,FPU=floating-pointunit,BXU=branchexecutionunit,andCRL=conditionregisterlogicalexecutionunit.18Power5threadperformance...Relativepriorityofeachthreadcontrollableinhardware.Forbalancedoperation,boththreadsrunslowerthanifthey“owned”themachine.19ChangesinPower5tosupportSMTIncreasedassociativityofL1instructioncacheandtheinstructionaddresstranslationbuffersAddedperthreadloadandstorequeuesIncreasedsizeoftheL2(1.92vs.1.44MB)andL3cachesAddedseparateinstructionprefetchandbufferingperthreadIncreasedthenumberofvirtualregistersfrom152to240IncreasedthesizeofseveralissuequeuesThePower5coreisabout24%largerthanthePower4corebecauseoftheadditionofSMTsupport20InitialPerformanceofSMTPentium4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_ratePentium4isdual-threadedSMTSPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmarkRunningonPentium4witheachof26SPECbenchmarkspairedwitheveryother(26*26runs)gavespeed-upsfrom0.90to1.58;averagewas1.20Power5,8processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_ratePower5running2“same”copiesofeachapplicationgavespeedupsfrom0.89to1.41,comparedto1.01and1.07averagesforPentium4.MostgainedsomeFloatingPt.applicationshadmostcacheconflictsandleastgains21ProcessorMicroarchitectureFetch/Issue/ExecuteFunct.UnitsClockRate(GHz)Transis-tors

DiesizePowerIntelPentium4ExtremeSpeculativedynamicallyscheduled;deeplypipelined;SMT3/3/47int.1FP3.8125M122mm2115WAMDAthlon64FX-57Speculativedynamicallyscheduled3/3/46int.3FP2.8114M

115mm2104WIBMPower5

(1CPUonly)Speculativedynamicallyscheduled;SMT;

2CPUcores/chip8/4/86int.2FP1.9200M300mm2(est.)80W(est.)IntelItanium2Staticallyscheduled

VLIW-style6/5/119int.2FP1.6592M423mm2130

WHeadtoHeadILPcompetition22PerformanceonSPECint200023PerformanceonSPECfp200024NormalizedPerformance:EfficiencyRankItanium2PentIum4AthlonPower5Int/Trans4213FP/Trans4213Int/area4213FP/area4213Int/Watt4312FP/Watt243125NoSilverBulletforILPNoobviousover-allleaderinperformanceTheAMDAthlonleadsonSPECIntperformancefollowedbythePentium4,Itanium2,andPower5Itanium2andPower5,whichperformsimilarlyonSPECFP,clearlydominatetheAthlonandPentium4onSPECFPItanium2isthemostinefficientprocessorbothforFl.Pt.andintegercodeforallbutoneefficiencymeasure(SPECFP/Watt)AthlonandPentium4bothmakegooduseoftransistorsandareaintermsofefficiency,IBMPower5isthemosteffectiveuserofenergyonSPECFPandessentiallytiedonSPECINT26LimitstoILPDoublingissueratesabovetoday’s3-6instructionsperclock,sayto6to12instructions,probablyrequiresaprocessortoissue3or4datamemoryaccessespercycle,resolve2or3branchespercycle,renameandaccessmorethan20registerspercycle,andfetch12to24instructionspercycle.Thecomplexitiesofimplementingthesecapabilitiesislikelytomeansacrificesinthemaximumclockrate

E.g,widestissueprocessoristheItanium2,butitalsohastheslowestclockrate,despitethefactthatitconsumesthemostelectricalpower!27MosttechniquesforincreasingperformanceincreasepowerconsumptionThekeyquestioniswhetheratechniqueisenergyefficient:doesitincreaseperformancefasterthanitincreasespowerconsumption?Multipleissueprocessortechniquesallareenergyinefficient:Issuingmultipleinstructionsincurssomeoverheadinlogicthatgrowsfaster(I2)thantheissuerategrowsGrowinggapbetweenpeakissueratesandsustainedperformanceNumberoftransistorsswitching=f(peakissuerate),andperformance=f(sustainedrate),

growinggapbetweenpeakandsustainedperformance

increasingenergyp

人人文庫(kù)> 全部分類(lèi)> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

并行處理與體系結(jié)構(gòu)課件hitsz-lec05

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

并行處理與體系結(jié)構(gòu)課件hitsz-lec05

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔