版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
ComputerArchitecture
----AQuantitativeApproach計(jì)算機(jī)體系結(jié)構(gòu)計(jì)算機(jī)體系結(jié)構(gòu)Chapter4(2)
Instruction-LevelParallelism
SoftwareApproaches 王奕Estelle.ywang@ComputerArchitecture
----AQLectureforILP:
Softwareapproaches(軟件方法)BasicCompilerTechniqueforExposingILPLoopunrolling(基本的發(fā)現(xiàn)ILP的編譯技術(shù)是循環(huán)展開)StaticBranchPrediction(靜態(tài)分支預(yù)測(cè))StaticmultipleIssue:VLIW(靜態(tài)多指令發(fā)射VLIW)AdvancedCompilorSupportforExposingandExploitingILP(對(duì)發(fā)現(xiàn)和開發(fā)ILP的高級(jí)編譯器支持)Softwarepipelining(軟件流水)GlobalCodescheduling(全局代碼調(diào)度)HardwareSupportforExposingMoreParallelismatcompiletime(對(duì)編譯時(shí)開發(fā)ILP的硬件支持)ConditionalorPredicated(斷言的)instructions(條件指令或預(yù)測(cè)指令)Compilerspeculationwithhardwaresupport(在硬件支持下的編譯器投機(jī)技術(shù))LectureforILP:
SoftwareapprFPLoop:WherearetheHazards?Loop: LD F0,0(R1) ;F0=vectorelement ADDDF4,F0,F2 ;addscalarfromF2 SD 0(R1),F4 ;storeresult SUBI R1,R1,8 ;decrementpointer8B(DW) BNEZ R1,Loop ;branchR1!=zero NOP ;delayedbranchslotAssumptionsofthelatencyoftheFPoperations:Instruction Instruction Latency
producingresult usingresult incyclesFPALUop AnotherFPALUop 3FPALUop Storedouble 2Loaddouble FPALUop 1Loaddouble Storedouble 0Integerop Integerop 0
Wherearethestalls?FPLoop:WherearetheHazardsReducingstallsfromschedullinginBBanddelayedbranchLoop:LDF0,0(R1)ADDDF4,F0,F2SD0(R1),F4SUBIR1,R1,#8BNEZR1,LoopFDXMWFDsA1A2A3A4WFsDssXMWFssDXMWFsDXMW
10CCFFLoop:LDF0,0(R1)SUBIR1,R1,#8ADDDF4,F0,F2BNEZR1,Loop
SD+8(R1),F4FDXMWFDXM
WFDA1A2A3A4WFDXMW
FDsXMW
6CCF
s
DXMWReducingstallsfromschedulliUnrollLoopFourTimes(straightforwardway)
Rewritelooptominimizestalls?1Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;dropSUBI&BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;dropSUBI&BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;dropSUBI&BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SUBI R1,R1,#32 ;alterto4*8/////////////////13 SD +8(R1),F1614 BNEZ R1,LOOP15 NOP
15+4x(1+2)=27clockcycles,or6.8periterationAssumesR1ismultipleof41cyclestall2cyclesstallUnrollLoopFourTimes(straigUnrolledLoopThatMinimizesStallsWhatassumptionsmadewhenmovedcode?OKtomovestorepastSUBIeventhoughchangesregisterOKtomoveloadsbeforestores:getrightdata?Whenisitsafeforcompilertodosuchchanges?1Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD +16(R1),F1213 BNEZ R1,LOOP14 SD 8(R1),F16 ;8-32=-24
14clockcycles,or3.5periterationUnrolledLoopThatMinimizesSUsingLoopunrollingandschedulingwithstaticMultipleIssueIntegerInstructionFPinstructionClockcycleLoop:L.DF0,0(R1)1L.DF0,-8(R1)2L.DF0,-16(R1)ADD.DF4,F0.F23L.DF0,-24(R1)ADD.DF8,F6.F24L.DF0,-32(R1)ADD.DF12,F10.F25S.DF4,0(R1)ADD.DF16,F14.F26S.DF8,-8(R1)ADD.DF20,F18.F27S.DF12,-16(R1)8DADDUIR1,R1,#-409S.DF16,16(R1)10BNER1,R2,Loop11S.DF20,8(R1)12UsingLoopunrollingandschedStaticBranchPrediction
靜態(tài)分支預(yù)測(cè)Staticbranchpredictorsareusedinprocessorswhenbranchbehaviorisexpectedhighlypredictableatcompiletime.(靜態(tài)分支預(yù)測(cè)一般用于分支行為在編譯器時(shí)就具有很高有可預(yù)測(cè)性的情形)SeveraldifferentmethodsAlwayspredictabranchastakenoruntaken(總是預(yù)測(cè)轉(zhuǎn)移成功或不成功)Predictonthebasisofbranchdirection(基于轉(zhuǎn)移方向的預(yù)測(cè))Backward-goingbranchtobetaken,(向后預(yù)測(cè)為成功)Forward-goingbranchtobenottaken.(向前預(yù)測(cè)為不成功)Profile-basedPrediction(基于以往概要信息(含多方面的行為)的預(yù)測(cè))StaticBranchPrediction
靜態(tài)分支預(yù)StaticMultipleissue:VLIW
(靜態(tài)多發(fā)射:VLIW)VLIW:VeryLongInstructionWord(超長(zhǎng)指令字)Each“instruction”hasexplicitcodingformultipleoperations(每條“指令”都顯式地包括多個(gè)操作)InEPIC,groupingcalleda“packet”InTransmeta,groupingcalleda“molecule”(with“atoms”asops)Tradeoffinstructionspaceforsimpledecoding
(為了編碼簡(jiǎn)單,犧牲了一些代碼空間)ThelonginstructionwordhasroomformanyoperationsBydefinition,alltheoperationsthecompilerputsinthelonginstructionwordareindependent=>executeinparallelE.g.,2integeroperations,2FPops,2Memoryrefs,1branch16to24bitsperfield=>7*16or112bitsto7*24or168bitswideNeedcompilingtechniquethatschedulesacrossseveralbranchesStaticMultipleissue:VLIW
(靜LoopUnrollinginVLIWMemory Memory FP FP Int.op/ Clock
reference1 reference2 operation1 op.2 branchLDF0,0(R1) LDF6,-8(R1) 1LDF10,-16(R1) LDF14,-24(R1) 2LDF18,-32(R1) LDF22,-40(R1) ADDDF4,F0,F2 ADDDF8,F6,F2 3LDF26,-48(R1) ADDDF12,F10,F2 ADDDF16,F14,F2 4 ADDDF20,F18,F2 ADDDF24,F22,F2 5SD0(R1),F4 SD-8(R1),F8 ADDDF28,F26,F2 6SD-16(R1),F12 SD-24(R1),F16 7SD-32(R1),F20 SD-40(R1),F24 SUBIR1,R1,#48 8SD-0(R1),F28 BNEZR1,LOOP 9
Unrolled7timestoavoiddelays7resultsin9clocks,or1.3clocksperiteration(1.8X)Average:2.5opsperclock,50%efficiencyNote:NeedmoreregistersinVLIW(15vs.6inSS)LoopUnrollinginVLIWMemory ProblemsforVLIWTechnicalproblems(技術(shù)問題)Increaseincodesize(代碼的增長(zhǎng))LoopunrollingUnusedfunctionslotsLimitationsoflockstepoperation(鎖定同步操作的限制)AstallinanyfunctionunitmaycausetheentireprocessortostallLogisticalproblem(邏輯問題)Binarycodecompatibility(二進(jìn)制代碼的兼容性)Majorchallengeforallmultiple-issueprocessorsExploitlargeamountsofILPProblemsforVLIWTechnicalproAdvancedCompilerSupportforExploitingILP(編譯器對(duì)開發(fā)ILP的高級(jí)支持)DetectingandEnhancingLoop-levelParallelism(檢測(cè)并增強(qiáng)循環(huán)級(jí)并行)EliminatingDependentComputations(消除相關(guān)計(jì)算)Softwarepipelining:Symbolicloopunrolling(軟件流水:符號(hào)循環(huán)展開)GlobalCodeScheduling(全局代碼調(diào)度)TraceScheduling:focusonCriticalpath
(路徑調(diào)度:關(guān)注關(guān)鍵路徑)SuperblocksAdvancedCompilerSupportforDetectingandEnhancingLoop-levelParallelismLoop-carrieddependence(循環(huán)傳遞相關(guān)----存在循環(huán)之間的相關(guān)性)DataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterationsAloopisparallelifitcanbewrittenwithoutacycleinthedependences.(一個(gè)循環(huán)中,如果相關(guān)性沒有構(gòu)成一個(gè)環(huán),就說這個(gè)循環(huán)是可并行的)AnassumptionAllarrayindices(下標(biāo))areaffine(仿射的).Aone-dimensionalarrayindexisaffine,ifitcanbewrittenintheformofai+b.Adependenceexistsiftwoconditionshold(滿足下面兩條件,即相關(guān)存在):TwoindicesJ,K,withinthelimitsoftheloop.(下標(biāo)的兩個(gè)取值,j,k)TheloopstoresintoE[aj+b]andlaterfetchfromthesameelementE[ck+d],itcansatisfyaj+b=ck+d(存數(shù)與讀取數(shù)下標(biāo)滿足aj+b=ck+d)GCD(Greatestcommondivisor)test---最大公因子測(cè)試Ifaloop-carrieddependenceexists,thenGCD(c,a)mustdivide(d-b).(GCD(c,a)必須被(d-b)整除)DetectingandEnhancingLoop-lEliminatingDependentComputations--消除相關(guān)計(jì)算DADDUIR1,R2,#4DADDUIR1,R1,#4ADDR1,R2,R3ADDR4,R1,R6ADDR8,R4,R7SUM=SUM+XDADDUIR1,R2,#8ADDR1,R2,R3ADDR4,R6,R7ADDR8,R1,R4SUM=SUM+X1+X2+X3+X4+X5SUM=((SUM+X1)+(X2+X3))+(X4+X5)R8=R2+R3+R6+R7把R1與R7的位置對(duì)換EliminatingDependentComputatSoftwarePipelining-軟件流水Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferentiterations
(如果循環(huán)的迭代之間是不相關(guān)的,則可以從不同迭代中取指執(zhí)行可以獲得更多的可并行性)Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop(-TomasuloinSW)(軟件流水是從源循環(huán)的不同迭代體中取出必要的指令,重新建立新的循環(huán),提供連續(xù)指令給多發(fā)射處理器)SoftwarePipelining-軟件流水ObservSoftwarePipeliningExampleBefore:Unrolled3times
1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4
4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8
7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOPAfter:SoftwarePipelined
1 SD 0(R1),F4; StoresM[i]
2 ADDD F4,F0,F2; AddstoM[i-1]
3 LD F0,-16(R1); LoadsM[i-2]
4 SUBI R1,R1,#85 BNEZ R1,LOOPSymbolicLoopUnrollingMaximizeresult-usedistanceLesscodespacethanunrollingFill&drainpipeonlyonceperloop
vs.oncepereachunrollediterationinloopunrolling5cyclesperiterationSWPipelineLoopUnrolledoverlappedopsTimeTimeSoftwarePipeliningExampleBefTraceScheduling(路徑調(diào)度—專用于VLIW)ParallelismacrossIFbranchesvs.LOOPbranches
(挖掘跨越if轉(zhuǎn)移和LOOP轉(zhuǎn)移的并行性)Twosteps(路徑調(diào)度技術(shù)包含兩個(gè)獨(dú)立的處理過程)TraceSelection(路徑選擇)Findlikelysequenceofbasicblocks(trace—預(yù)測(cè)路徑)of(staticallypredictedorprofilepredicted)longsequenceofstraight-linecode(首先根據(jù)轉(zhuǎn)移行為預(yù)測(cè)轉(zhuǎn)移可能的兩個(gè)路徑方向,找出使用概率大的那個(gè)方向作為擴(kuò)展基本塊的方向,這個(gè)方向的后繼指令稱為預(yù)測(cè)路徑)TraceCompaction(路徑壓縮)SqueezetraceintofewVLIWinstructions(將選定路徑上的操作封裝成超長(zhǎng)指令)NeedbookkeepingcodeincasepredictioniswrongThisisaformofcompiler-generatedspeculationCompilermustgenerate“fixup(修正)”codetohandlecasesinwhichtraceisnotthetakenbranch(預(yù)測(cè)失效要采取補(bǔ)償措施)Needsextraregisters:undoesbadguessbydiscardingTraceScheduling(路徑調(diào)度—專用于VLIW)ExampleofTraceSchedulingExampleofTraceSchedulingExample原始代碼路徑調(diào)度之后的代碼Example原始代碼路徑調(diào)度之后的代碼AdvantagesofHW(Tomasulo)vs.SW(VLIW)SpeculationHWadvantages:HWbetteratmemorydisambiguation(內(nèi)存釋意)sinceknowsactualaddressesHWbetteratbranchpredictionsinceloweroverheadHWmaintainspreciseexceptionmodelHWdoesnotexecutebookkeepinginstructions(補(bǔ)償代碼)SamesoftwareworksacrossmultipleimplementationsSmallercodesize(notasmanynoopsfilingblankinstructions)SWadvantages:WindowofinstructionsthatisexaminedforparallelismmuchhigherMuchlesshardwareinvolvedinVLIW(unlessyouareIntel…!)MoreinvolvedtypesofspeculationcanbedonemoreeasilySpeculationcanbebasedonlarge-scaleprogrambehavior,notjustlocalinformationAdvantagesofHW(Tomasulo)vsSuperscalarv.VLIWSmallercodesize(較小的代碼長(zhǎng)度)Binarycompatability(二進(jìn)制代碼的兼容性好)acrossgenerationsofhardwareSimplifiedHardwarefordecoding,issuinginstructionsNoInterlockHardware(compilerchecks?)Moreregisters,butsimplifiedHardwareforRegisterPorts(multipleindependentregisterfiles?)Superscalarv.VLIWSmallercodHardwareSupportforExpoiltingILPatcompiletimeConditional/predicatedinstruction)(條件指令或預(yù)測(cè)指令)Aconditionalinstructionreferstoaconditionwhichisevaluatedaspartoftheinstructionexecution,(條件指令的條件判斷僅僅作為指令執(zhí)行的一部分)Example:If(A==0){S=T}BNEZR1,LCMOVR2,R3,
R1ADDUR2,R3,R0L:……
theCPUalwaysexecutestheinstructionbutwritestheresultonlyiftheconditionismet.
(CPU總是會(huì)執(zhí)行這條指令,但是否寫結(jié)果要看條件是否滿足)Aconditionalbranchchangesacontroldependenceintoadatadependence.(把控制相關(guān)轉(zhuǎn)成數(shù)據(jù)相關(guān))HardwareSupportforExpoiltinConditionalinstructionsTheexecutionofallinstructioniscontrolledbyapredicate.Whenpredicateisfalse,theinstructionbecomesano-opSimplyconvertsmallblocksofcodethatarebranchdependent.EliminatenonloopbranchesCanbeusedtospeculativelymoveaninstructionthatistimecritical.ConditionalinstructionsTheexCompilerSpeculationwithHardwareSupport--硬件支持的編譯投機(jī)Movespeculatedinstructionsnotonlybeforethebranch,butbeforetheconditionevaluation.Fourmethodsforsupportingambitious(大膽的)speculationHardwareandOScooperativelyignoreexceptionsforspeculativeinstructions.(硬件與OS協(xié)同忽略投機(jī)指令引起的異常中斷)Speculativeinstructionsthatneverraiseexceptionsareused.(調(diào)度那些不影響異常中斷行為的指令作為投機(jī)指令)
Poisonbitsareattachedtotheresultregisterswrittenbyspeculativeinstructions.(采用抑制位的投機(jī)技術(shù))Amechanismisprovidedtoindicatethataninstructionisspeculative,thehardwarebufferstheresultuntiltheinstructionnolongerspeculative.CompilerSpeculationwithHardComputerArchitecture
----AQuantitativeApproach計(jì)算機(jī)體系結(jié)構(gòu)計(jì)算機(jī)體系結(jié)構(gòu)Chapter4(2)
Instruction-LevelParallelism
SoftwareApproaches 王奕Estelle.ywang@ComputerArchitecture
----AQLectureforILP:
Softwareapproaches(軟件方法)BasicCompilerTechniqueforExposingILPLoopunrolling(基本的發(fā)現(xiàn)ILP的編譯技術(shù)是循環(huán)展開)StaticBranchPrediction(靜態(tài)分支預(yù)測(cè))StaticmultipleIssue:VLIW(靜態(tài)多指令發(fā)射VLIW)AdvancedCompilorSupportforExposingandExploitingILP(對(duì)發(fā)現(xiàn)和開發(fā)ILP的高級(jí)編譯器支持)Softwarepipelining(軟件流水)GlobalCodescheduling(全局代碼調(diào)度)HardwareSupportforExposingMoreParallelismatcompiletime(對(duì)編譯時(shí)開發(fā)ILP的硬件支持)ConditionalorPredicated(斷言的)instructions(條件指令或預(yù)測(cè)指令)Compilerspeculationwithhardwaresupport(在硬件支持下的編譯器投機(jī)技術(shù))LectureforILP:
SoftwareapprFPLoop:WherearetheHazards?Loop: LD F0,0(R1) ;F0=vectorelement ADDDF4,F0,F2 ;addscalarfromF2 SD 0(R1),F4 ;storeresult SUBI R1,R1,8 ;decrementpointer8B(DW) BNEZ R1,Loop ;branchR1!=zero NOP ;delayedbranchslotAssumptionsofthelatencyoftheFPoperations:Instruction Instruction Latency
producingresult usingresult incyclesFPALUop AnotherFPALUop 3FPALUop Storedouble 2Loaddouble FPALUop 1Loaddouble Storedouble 0Integerop Integerop 0
Wherearethestalls?FPLoop:WherearetheHazardsReducingstallsfromschedullinginBBanddelayedbranchLoop:LDF0,0(R1)ADDDF4,F0,F2SD0(R1),F4SUBIR1,R1,#8BNEZR1,LoopFDXMWFDsA1A2A3A4WFsDssXMWFssDXMWFsDXMW
10CCFFLoop:LDF0,0(R1)SUBIR1,R1,#8ADDDF4,F0,F2BNEZR1,Loop
SD+8(R1),F4FDXMWFDXM
WFDA1A2A3A4WFDXMW
FDsXMW
6CCF
s
DXMWReducingstallsfromschedulliUnrollLoopFourTimes(straightforwardway)
Rewritelooptominimizestalls?1Loop: LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4 ;dropSUBI&BNEZ4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8 ;dropSUBI&BNEZ7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F12 ;dropSUBI&BNEZ10 LD F14,-24(R1)11 ADDD F16,F14,F212 SUBI R1,R1,#32 ;alterto4*8/////////////////13 SD +8(R1),F1614 BNEZ R1,LOOP15 NOP
15+4x(1+2)=27clockcycles,or6.8periterationAssumesR1ismultipleof41cyclestall2cyclesstallUnrollLoopFourTimes(straigUnrolledLoopThatMinimizesStallsWhatassumptionsmadewhenmovedcode?OKtomovestorepastSUBIeventhoughchangesregisterOKtomoveloadsbeforestores:getrightdata?Whenisitsafeforcompilertodosuchchanges?1Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD +16(R1),F1213 BNEZ R1,LOOP14 SD 8(R1),F16 ;8-32=-24
14clockcycles,or3.5periterationUnrolledLoopThatMinimizesSUsingLoopunrollingandschedulingwithstaticMultipleIssueIntegerInstructionFPinstructionClockcycleLoop:L.DF0,0(R1)1L.DF0,-8(R1)2L.DF0,-16(R1)ADD.DF4,F0.F23L.DF0,-24(R1)ADD.DF8,F6.F24L.DF0,-32(R1)ADD.DF12,F10.F25S.DF4,0(R1)ADD.DF16,F14.F26S.DF8,-8(R1)ADD.DF20,F18.F27S.DF12,-16(R1)8DADDUIR1,R1,#-409S.DF16,16(R1)10BNER1,R2,Loop11S.DF20,8(R1)12UsingLoopunrollingandschedStaticBranchPrediction
靜態(tài)分支預(yù)測(cè)Staticbranchpredictorsareusedinprocessorswhenbranchbehaviorisexpectedhighlypredictableatcompiletime.(靜態(tài)分支預(yù)測(cè)一般用于分支行為在編譯器時(shí)就具有很高有可預(yù)測(cè)性的情形)SeveraldifferentmethodsAlwayspredictabranchastakenoruntaken(總是預(yù)測(cè)轉(zhuǎn)移成功或不成功)Predictonthebasisofbranchdirection(基于轉(zhuǎn)移方向的預(yù)測(cè))Backward-goingbranchtobetaken,(向后預(yù)測(cè)為成功)Forward-goingbranchtobenottaken.(向前預(yù)測(cè)為不成功)Profile-basedPrediction(基于以往概要信息(含多方面的行為)的預(yù)測(cè))StaticBranchPrediction
靜態(tài)分支預(yù)StaticMultipleissue:VLIW
(靜態(tài)多發(fā)射:VLIW)VLIW:VeryLongInstructionWord(超長(zhǎng)指令字)Each“instruction”hasexplicitcodingformultipleoperations(每條“指令”都顯式地包括多個(gè)操作)InEPIC,groupingcalleda“packet”InTransmeta,groupingcalleda“molecule”(with“atoms”asops)Tradeoffinstructionspaceforsimpledecoding
(為了編碼簡(jiǎn)單,犧牲了一些代碼空間)ThelonginstructionwordhasroomformanyoperationsBydefinition,alltheoperationsthecompilerputsinthelonginstructionwordareindependent=>executeinparallelE.g.,2integeroperations,2FPops,2Memoryrefs,1branch16to24bitsperfield=>7*16or112bitsto7*24or168bitswideNeedcompilingtechniquethatschedulesacrossseveralbranchesStaticMultipleissue:VLIW
(靜LoopUnrollinginVLIWMemory Memory FP FP Int.op/ Clock
reference1 reference2 operation1 op.2 branchLDF0,0(R1) LDF6,-8(R1) 1LDF10,-16(R1) LDF14,-24(R1) 2LDF18,-32(R1) LDF22,-40(R1) ADDDF4,F0,F2 ADDDF8,F6,F2 3LDF26,-48(R1) ADDDF12,F10,F2 ADDDF16,F14,F2 4 ADDDF20,F18,F2 ADDDF24,F22,F2 5SD0(R1),F4 SD-8(R1),F8 ADDDF28,F26,F2 6SD-16(R1),F12 SD-24(R1),F16 7SD-32(R1),F20 SD-40(R1),F24 SUBIR1,R1,#48 8SD-0(R1),F28 BNEZR1,LOOP 9
Unrolled7timestoavoiddelays7resultsin9clocks,or1.3clocksperiteration(1.8X)Average:2.5opsperclock,50%efficiencyNote:NeedmoreregistersinVLIW(15vs.6inSS)LoopUnrollinginVLIWMemory ProblemsforVLIWTechnicalproblems(技術(shù)問題)Increaseincodesize(代碼的增長(zhǎng))LoopunrollingUnusedfunctionslotsLimitationsoflockstepoperation(鎖定同步操作的限制)AstallinanyfunctionunitmaycausetheentireprocessortostallLogisticalproblem(邏輯問題)Binarycodecompatibility(二進(jìn)制代碼的兼容性)Majorchallengeforallmultiple-issueprocessorsExploitlargeamountsofILPProblemsforVLIWTechnicalproAdvancedCompilerSupportforExploitingILP(編譯器對(duì)開發(fā)ILP的高級(jí)支持)DetectingandEnhancingLoop-levelParallelism(檢測(cè)并增強(qiáng)循環(huán)級(jí)并行)EliminatingDependentComputations(消除相關(guān)計(jì)算)Softwarepipelining:Symbolicloopunrolling(軟件流水:符號(hào)循環(huán)展開)GlobalCodeScheduling(全局代碼調(diào)度)TraceScheduling:focusonCriticalpath
(路徑調(diào)度:關(guān)注關(guān)鍵路徑)SuperblocksAdvancedCompilerSupportforDetectingandEnhancingLoop-levelParallelismLoop-carrieddependence(循環(huán)傳遞相關(guān)----存在循環(huán)之間的相關(guān)性)DataaccessesinlateriterationsaredependentondatavaluesproducedinearlieriterationsAloopisparallelifitcanbewrittenwithoutacycleinthedependences.(一個(gè)循環(huán)中,如果相關(guān)性沒有構(gòu)成一個(gè)環(huán),就說這個(gè)循環(huán)是可并行的)AnassumptionAllarrayindices(下標(biāo))areaffine(仿射的).Aone-dimensionalarrayindexisaffine,ifitcanbewrittenintheformofai+b.Adependenceexistsiftwoconditionshold(滿足下面兩條件,即相關(guān)存在):TwoindicesJ,K,withinthelimitsoftheloop.(下標(biāo)的兩個(gè)取值,j,k)TheloopstoresintoE[aj+b]andlaterfetchfromthesameelementE[ck+d],itcansatisfyaj+b=ck+d(存數(shù)與讀取數(shù)下標(biāo)滿足aj+b=ck+d)GCD(Greatestcommondivisor)test---最大公因子測(cè)試Ifaloop-carrieddependenceexists,thenGCD(c,a)mustdivide(d-b).(GCD(c,a)必須被(d-b)整除)DetectingandEnhancingLoop-lEliminatingDependentComputations--消除相關(guān)計(jì)算DADDUIR1,R2,#4DADDUIR1,R1,#4ADDR1,R2,R3ADDR4,R1,R6ADDR8,R4,R7SUM=SUM+XDADDUIR1,R2,#8ADDR1,R2,R3ADDR4,R6,R7ADDR8,R1,R4SUM=SUM+X1+X2+X3+X4+X5SUM=((SUM+X1)+(X2+X3))+(X4+X5)R8=R2+R3+R6+R7把R1與R7的位置對(duì)換EliminatingDependentComputatSoftwarePipelining-軟件流水Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferentiterations
(如果循環(huán)的迭代之間是不相關(guān)的,則可以從不同迭代中取指執(zhí)行可以獲得更多的可并行性)Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop(-TomasuloinSW)(軟件流水是從源循環(huán)的不同迭代體中取出必要的指令,重新建立新的循環(huán),提供連續(xù)指令給多發(fā)射處理器)SoftwarePipelining-軟件流水ObservSoftwarePipeliningExampleBefore:Unrolled3times
1 LD F0,0(R1)2 ADDD F4,F0,F23 SD 0(R1),F4
4 LD F6,-8(R1)5 ADDD F8,F6,F26 SD -8(R1),F8
7 LD F10,-16(R1)8 ADDD F12,F10,F29 SD -16(R1),F1210 SUBI R1,R1,#2411 BNEZ R1,LOOPAfter:SoftwarePipelined
1 SD 0(R1),F4; StoresM[i]
2 ADDD F4,F0,F2; AddstoM[i-1]
3 LD F0,-16(R1); LoadsM[i-2]
4 SUBI R1,R1,#85 BNEZ R1,LOOPSymbolicLoopUnrollingMaximizeresult-usedistanceLesscodespacethanunrollingFill&drainpipeonlyonceperloop
vs.oncepereachunrollediterationinloopunrolling5cyclesperiterationSWPipelineLoopUnrolledoverlappedopsTimeTimeSoftwarePipeliningExampleBefTraceScheduling(路徑調(diào)度—專用于VLIW)ParallelismacrossIFbranchesvs.LOOPbranches
(挖掘跨越if轉(zhuǎn)移和LOOP轉(zhuǎn)移的并行性)Twosteps(路徑調(diào)度技術(shù)包含兩個(gè)獨(dú)立的處理過程)TraceSelection(路徑選擇)Findlikelysequenceofbasicblocks(trace—預(yù)測(cè)路徑)of(staticallypredictedorprofilepredicted)longsequenceofstraight-linecode(首先根據(jù)轉(zhuǎn)移行為預(yù)測(cè)轉(zhuǎn)移可能的兩個(gè)路徑方向,找出使用概率大的那個(gè)方向作為擴(kuò)展基本塊的方向,這個(gè)方向的后繼指令稱為預(yù)測(cè)路徑)TraceCompaction(路徑壓縮)SqueezetraceintofewVLIWinstructions(將選定路徑上的操作封裝成超長(zhǎng)指令)NeedbookkeepingcodeincasepredictioniswrongThisisaformofcompiler-generatedspeculationCompilermustgenerate“fixup(修正)”codetohandlecasesinwhichtraceisnotthetakenbranch(預(yù)測(cè)失效要采取補(bǔ)償措施)Needsextraregisters:undoesbadguessbydiscardingTraceScheduling(路徑調(diào)度—專用于VLIW)ExampleofTraceSchedulingExampleofTraceSchedulingExample原始代碼路徑調(diào)度之后的代碼Example原始代碼路徑調(diào)度之后的代碼AdvantagesofHW(Tomasulo)vs.SW(VLIW)SpeculationHWadvantages:HWbetteratmemorydisambiguation(內(nèi)存釋意)sinceknowsactualaddressesHWbetteratbranchpredictionsinceloweroverheadHWmaintainspreciseexceptionmodelHWdoesnotexecutebookkeepinginstructions(補(bǔ)償代碼)Samesoftwarewo
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 部門個(gè)人工作計(jì)劃
- 2024年汽車電子設(shè)備銷售及維修合同3篇
- 2024年版魚塘租賃經(jīng)營(yíng)協(xié)議模板
- 2024年版離婚雙方權(quán)益保障合同模板版B版
- 小學(xué)教學(xué)計(jì)劃二年級(jí)
- 居住建筑及公共建筑建設(shè)項(xiàng)目節(jié)能評(píng)估報(bào)告書
- 2025年中國大黃提取物行業(yè)市場(chǎng)調(diào)研及未來發(fā)展趨勢(shì)預(yù)測(cè)報(bào)告
- 銷售客服工作計(jì)劃
- 2022初二語文教學(xué)工作計(jì)劃
- 行政文員個(gè)人工作報(bào)告
- 部編版語文四年級(jí)上冊(cè)普羅米修斯教學(xué)反思(兩篇)
- 生理學(xué)基礎(chǔ)(第4版)第十一章 內(nèi)分泌電子課件 中職 電子教案
- 石油化工安裝工程預(yù)算定額(2019版)
- 換熱器的傳熱系數(shù)K
- GB/T 24218.2-2009紡織品非織造布試驗(yàn)方法第2部分:厚度的測(cè)定
- 鑄牢中華民族共同體意識(shí)學(xué)習(xí)PPT
- 公司年會(huì)小品《老同學(xué)顯擺大會(huì)》臺(tái)詞劇本手稿
- 獎(jiǎng)勵(lì)旅游策劃與組織課件
- 雞舍通風(fēng)設(shè)計(jì)
- 2020中考英語備考策略
- 廣東省見證取樣規(guī)范
評(píng)論
0/150
提交評(píng)論