并行計算機體系結構_第1頁
并行計算機體系結構_第2頁
并行計算機體系結構_第3頁
并行計算機體系結構_第4頁
并行計算機體系結構_第5頁
已閱讀5頁,還剩96頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

ParallelComputerArchitecture

并行計算機體系結構

Lecture13May18,2009Wujunmin(jmwu@)OverviewReviewofLec11SMP中的同步MPP當前高性能計算機介紹高性能計算機未來PreliminaryDesignIssuesDesignofcachecontrollerandtagsBothprocessorandbusneedtolookupHowandwhentopresentsnoopresultsonbusDealingwithwrite-backsOverallsetofactionsformemoryoperationnotatomicCanintroduceraceconditionsAtomicoperationsNewissuesdeadlock,livelock,starvation,serialization,etc.ContentionforCacheTagsCachecontrollermustmonitorbusandprocessorCanviewastwocontrollers:bus-side,andprocessor-sideWithsingle-levelcache:dualtags(notdata)ordual-portedtagRAMmustreconcilewhenupdated,butusuallyonlylookedupRespondtobustransactionsReportingSnoopResults:How?Collectiveresponsefrom$’smustappearonbusExample:inMESIprotocol,needtoknowIsblockdirty;i.e.shouldmemoryrespondornot?Isblockshared;i.e.transitiontoEorSstateonreadmiss?Threewired-ORsignalsShared:assertedifanycachehasacopyDirty:assertedifsomecachehasadirtycopyneedn’tknowwhich,sinceitwilldowhat’snecessarySnoop-valid:assertedwhenOKtocheckothertwosignalsactuallyinhibituntilOKtocheckIllinoisMESIrequirespriorityschemeforcache-to-cachetransfersWhichcacheshouldsupplydatawheninsharedstate?CommercialimplementationsallowmemorytoprovidedataReportingSnoopResults:When?Memoryneedstoknowwhat,ifanything,todoFixednumberofclocksfromaddressappearingonbusDualtagsrequiredtoreducecontentionwithprocessorStillmustbeconservative(updatebothonwrite:E->M)PentiumPro,HPservers,SunEnterpriseVariabledelayMemoryassumescachewillsupplydatatillallsay“sorry”Lessconservative,moreflexible,morecomplexMemorycanfetchdataandholdjustincase(SGIChallenge)Immediately:Bit-per-blockinmemoryExtrahardwarecomplexityincommoditymainmemorysystemBasicdesignNon-AtomicStateTransitionsMemoryoperationinvolvesmanyactionsbymanyentities,incl.busLookupcachetags,busarbitration,actionsbyothercontrollers,...Evenifbusisatomic,overallsetofactionsisnotCanhaveraceconditionsamongcomponentsofdifferentoperationsSupposeP1andP2attempttowritecachedblockAsimultaneouslyEachdecidestoissueBusUpgrtoallowS–>MIssuesMusthandlerequestsforotherblockswhilewaitingtoacquirebusMusthandlerequestsforthisblockAe.g.ifP2wins,P1mustinvalidatecopyandmodifyrequesttoBusRdXHandlingNon-atomicity:TransientStatesIncreasescomplexitye.g.don’tuseBusUpgr,ratherothermechanismstoavoiddatatransferTwotypesofstatesStable(e.g.MESI)TransientorIntermediateMultilevelCacheHierarchiesIndependentsnoophardwareforeachlevel?processorpinsforsharedbuscontentionforprocessorcacheaccess?SnooponlyatL2andpropagaterelevanttransactionsInclusionproperty(1)contentsL1isasubsetofL2(2)anyblockinmodifiedstateinL1isinmodifiedstateinL21=>alltransactionsrelevanttoL1arerelevanttoL22=>onBusRdL2canwaveoffmemoryaccessandinformL1PL1L2PL1L2°

°

°PL1L2snoopsnoop???ProcessorChipPL1L2associativity: a1blocksize: b1numberofsets:n1CapacityS1=a1*b1*n1associativity: a2blocksize: b2numberofsets:n2MaintainingInclusionThetwocaches(L1,L2)maychoosetoreplacedifferentblockDifferencesinreferencehistoryset-associativefirst-levelcachewithLRUreplacementexample:blocksm1,m2,m3fallinsamesetofL1cache...Splithigher-levelcachesinstruction,datablocksgoindifferentcachesatL1,butmaycollideinL2whatifL2isset-associative?DifferencesinblocksizeButacommoncase worksautomaticallyL1direct-mapped, fewersetsthaninL2, andblocksizesamePreservingInclusionExplicitlyPropagatelower-level(L2)replacementstohigher-level(L1)Invalidateorflush(ifdirty)messagesPropagatebustransactionsfromL2toL1PropagateallL2transactions?useinclusionbits?PropagatemodifiedstatefromL1toL2onwrites?ifL1iswrite-through,justinvalidateifL1iswrite-backaddextrastatetoL2(dirty-but-stale)requestflushfromL1onBusRdOverviewReviewofLec11SMP中的同步MPP當前高性能計算機介紹高性能計算機未來RoleofSynchronization“Aparallelcomputerisacollectionofprocessingelementsthatcooperateandcommunicatetosolvelargeproblemsfast.”TypesofSynchronizationMutualExclusionEventsynchronizationpoint-to-pointgroupglobal(barriers)Howmuchhardwaresupport?high-leveloperations?atomicinstructions?specializedinterconnect?Mini-InstructionSetdebateatomicread-modify-writeinstructionsIBM370:includedatomiccompare&swapformultiprogrammingx86:anyinstructioncanbeprefixedwithalockmodifierHigh-levellanguageadvocateswanthardwarelocks/barriersbutit’sgoesagainstthe“RISC”flow,andhasotherproblemsSPARC:atomicregister-memoryops(swap,compare&swap)MIPS,IBMPower:noatomicoperationsbutpairofinstructionsload-locked,store-conditionallaterusedbyPowerPCandDECAlphatooRichsetoftradeoffsOtherformsofhardwaresupportSeparatelocklinesonthebusLocklocationsinmemoryLockregisters(CrayXmp)Hardwarefull/emptybits(Tera)BussupportforinterruptdispatchComponentsofaSynchronizationEventAcquiremethodAcquirerighttothesynchentercriticalsection,gopasteventWaitingalgorithmWaitforsynchtobecomeavailablewhenitisn’tbusy-waiting,blocking,orhybridReleasemethodEnableotherprocessorstoacquirerighttothesynchWaitingalgorithmisindependentoftypeofsynchronizationmakesnosensetoputinhardwareStrawmanLocklock: ld register,location

/*copylocationtoregister*/

cmp location,#0

/*comparewith0*/

bnz lock

/*ifnot0,tryagain*/

st location,#1

/*store1tomarkitlocked*/

ret

/*returncontroltocaller*/unlock: st location,#0

/*write0tolocation*/

ret

/*returncontroltocaller*/Busy-WaitWhydoesn’ttheacquiremethodwork?Releasemethod?AtomicInstructionsSpecifiesalocation,register,&atomicoperationValueinlocationreadintoaregisterAnothervalue(functionofvaluereadornot)storedintolocationManyvariantsVaryingdegreesofflexibilityinsecondpartSimpleexample:test&setValueinlocationreadintoaspecifiedregisterConstant1storedintolocationSuccessfulifvalueloadedintoregisteris0Otherconstantscouldbeusedinsteadof1and0SimpleTest&SetLocklock: t&s register,location

bnz lock /*ifnot0,tryagain*/ ret /*returncontroltocaller*/unlock: st location,#0 /*write0tolocation*/ ret /*returncontroltocaller*/Otherread-modify-writeprimitivesSwapFetch&opCompare&swapThreeoperands:location,registertocomparewith,registertoswapwithNotcommonlysupportedbyRISCinstructionsetscacheableoruncacheablePerformanceCriteriaforSynch.OpsLatency(timeperop)especiallywhenlightcontentionBandwidth(opspersec)especiallyunderhighcontentionTrafficloadoncriticalresourcesespeciallyonfailuresundercontentionStorageFairnessEnhancementstoSimpleLockReducefrequencyofissuingtest&setswhilewaitingTest&setlockwithbackoffDon’tbackofftoomuchorwillbebackedoffwhenlockbecomesfreeExponentialbackoffworksquitewellempirically:ithtime=k*ciBusy-waitwithreadoperationsratherthantest&setTest-and-test&setlockKeeptestingwithordinaryloadcachedlockvariablewillbeinvalidatedwhenreleaseoccursWhenvaluechanges(to0),trytoobtainlockwithtest&setonlyoneattemptorwillsucceed;otherswillfailandstarttestingagainImprovedHardwarePrimitives:LL-SCGoals:TestwithreadsFailedread-modify-writeattemptsdon’tgenerateinvalidationsNiceifsingleprimitivecanimplementrangeofr-m-woperationsLoad-Locked(or-linked),Store-ConditionalLLreadsvariableintoregisterFollowwitharbitraryinstructionstomanipulateitsvalueSCtriestostorebacktolocationsucceedifandonlyifnootherwritetothevariablesincethisprocessor’sLLindicatedbyconditioncodes;IfSCsucceeds,allthreestepshappenedatomicallyIffails,doesn’twriteorgenerateinvalidationsmustretryacquireSimpleLockwithLL-SClock: ll reg1,location

/*LLlocationtoreg1*/

bnzreg1,lock//其他操作 sc location,reg2

/*SCreg2intolocation*/

beqz lock

/*iffailed,startagain*/ ret unlock: st location,#0 /*write0tolocation*/ ret Candomorefancyatomicopsbychangingwhat’sbetweenLL&SCButkeepitsmallsoSClikelytosucceedDon’tincludeinstructionsthatwouldneedtobeundone(e.g.stores)SCcanfail(withoutputtingtransactiononbus)if:DetectsinterveningwriteevenbeforetryingtogetbusTriestogetbusbutanotherprocessor’sSCgetsbusfirstLL,SCarenotlock,unlockrespectivelyOnlyguaranteenoconflictingwritetolockvariablebetweenthemButcanusedirectlytoimplementsimpleoperationsonsharedvariablesImplementingLL-SCLockflagandlockaddressregisterateachprocessorLLreadsblock,setslockflag,putsblockaddressinregisterIncominginvalidationscheckedagainstaddress:ifmatch,resetflagAlsoifblockisreplacedandatcontextswitchesSCcheckslockflagasindicatorofinterveningconflictingwriteIfreset,fail;ifnot,succeedLivelockconsiderationsDon’tallowreplacementoflockvariablebetweenLLandSCsplitorset-assoc.cache,anddon’tallowmemoryaccessesbetweenLL,SC(alsodon’tallowreorderingofaccessesacrossLLorSC)Don’tallowfailingSCtogenerateinvalidations(notanordinarywrite)Performance:bothLLandSCcanmissincachePrefetchblockinexclusivestateatLLButexclusiverequestreintroduceslivelockpossibility:usebackoffTrade-offsSoFarLatency?Bandwidth?Traffic?Storage?Fairness?Whathappenswhenseveralprocessorsspinningonlockanditisreleased?trafficperPlockoperations?TicketLockOnlyoner-m-wperacquireTwocountersperlock(next_ticket,now_serving)Acquire:fetch&incnext_ticket; waitfornow_serving==next_ticketatomicopwhenarriveatlock,notwhenit’sfree(solesscontention)Release:incrementnow-servingPerformancelowlatencyforlow-contention-iffetch&inccacheableO(p)readmissesatrelease,sinceallspinonsamevariableFIFOorderlikesimpleLL-SClock,butnoinvalwhenSCsucceeds,andfairBackoff?Wouldn’titbenicetopolldifferentlocations...Array-basedQueuingLocksWaitingprocessespollondifferentlocationsinanarrayofsizepAcquirefetch&inctoobtainaddressonwhichtospin(nextarrayelement)ensurethattheseaddressesareindifferentcachelinesormemoriesReleasesetnextlocationinarray,thuswakingupprocessspinningonitO(1)trafficperacquirewithcoherentcachesFIFOordering,asinticketlock,but,O(p)spaceperlockNotsogreatfornon-cache-coherentmachineswithdistributedmemoryarraylocationIspinonnotnecessarilyinmylocalmemory(solutionlater)LockPerformanceonSGIChallengeLoop: lock; delay(c); unlock; delay(d);lArray-based6LL-SCnLL-SC,

exponentialuTicketsTicket,

proportionallllllllllllllll666666666666666nnnnnnnnnnnnnnnuuuuuuuuuuuuuuusssssssssssssss011

3579

11131513579111315135791113152345670123456701234567lllllllllllllll666666666666666nnnnnnnnnnnnnnnuuuuuuuuuuuuuuussssssssssssssslllllllllllllll666666666666666nnnnnnnnnnnnnnnuuuuuuuuuuuuuuusssssssssssssss

(a)Null(c=0,d=0)(b)Critical-section(c=3.64s,d=0)(c)Delay(c=3.64s,d=1.29s)Time(s)Time(s)Time(s)NumberofprocessorsNumberofprocessorsNumberofprocessorsPointtoPointEventSynchronizationSoftwaremethods:InterruptsBusy-waiting:useordinaryvariablesasflagsBlocking:usesemaphoresFullhardwaresupport:full-emptybitwitheachwordinmemorySetwhenwordis“full”withnewlyproduceddata(i.e.whenwritten)Unsetwhenwordis“empty”duetobeingconsumed(i.e.whenread)Naturalforword-levelproducer-consumersynchronizationproducer:writeifempty,settofull;consumer:readiffull;settoemptyHardwarepreservesatomicityofbitmanipulationwithreadorwriteProblem:flexibilitymultipleconsumers,ormultiplewritesbeforeconsumerreads?needslanguagesupporttospecifywhentousecompositedatastructures?BarriersSoftwarealgorithmsimplementedusinglocks,flags,countersHardwarebarriersWired-ANDlineseparatefromaddress/databusSetinputhighwhenarrive,waitforoutputtobehightoleaveInpractice,multiplewirestoallowreuseUsefulwhenbarriersareglobalandveryfrequentDifficulttosupportarbitrarysubsetofprocessorsevenharderwithmultipleprocessesperprocessorDifficulttodynamicallychangenumberandidentityofparticipantse.g.latterduetoprocessmigrationNotcommontodayonbus-basedmachinesstructbar_type{intcounter;structlock_typelock; intflag=0;}bar_name;BARRIER(bar_name,p){ LOCK(bar_name.lock); if(bar_name.counter==0) bar_name.flag=0; /*resetflagiffirsttoreach*/

mycount=bar_name.counter++; /*mycountisprivate*/ UNLOCK(bar_name.lock); if(mycount==p){ /*lasttoarrive*/

bar_name.counter=0; /*resetfornextbarrier*/ bar_name.flag=1; /*releasewaiters*/ } elsewhile(bar_name.flag==0){};/*busywaitforrelease*/}ASimpleCentralizedBarrierSharedcountermaintainsnumberofprocessesthathavearrivedincrementwhenarrive(lock),checkuntilreachesnumprocsProblem?AWorkingCentralizedBarrierConsecutivelyenteringthesamebarrierdoesn’tworkMustpreventprocessfromenteringuntilallhaveleftpreviousinstanceCoulduseanothercounter,butincreaseslatencyandcontentionSensereversal:waitforflagtotakedifferentvalueconsecutivetimesTogglethisvalueonlywhenallprocessesreachBARRIER(bar_name,p){ local_sense=!(local_sense);/*toggleprivatesensevariable*/ LOCK(bar_name.lock);

mycount=bar_name.counter++; /*mycountisprivate*/ if(bar_name.counter==p) UNLOCK(bar_name.lock); bar_name.flag=local_sense; /*releasewaiters*/ else { UNLOCK(bar_name.lock); while(bar_name.flag!=local_sense){};}}CentralizedBarrierPerformanceLatencyCentralizedhascriticalpathlengthatleastproportionaltopTrafficAbout3pbustransactionsStorageCostVerylow:centralizedcounterandflagFairnessSameprocessorshouldnotalwaysbelasttoexitbarrierNosuchbiasincentralizedKeyproblemsforcentralizedbarrierarelatencyandtrafficEspeciallywithdistributedmemory,trafficgoestosamenodeBarrierPerformanceonSGIChallengeCentralizeddoesquitewellWilldiscussfancierbarrieralgorithmsfordistributedmachinesHelpfulhardwaresupport:piggybackingofreadsmissesonbusAlsoforspinningonhighlycontendedlocks12345678SynchronizationSummaryRichinteractionofhardware-softwaretradeoffsMustevaluatehardwareprimitivesandsoftwarealgorithmstogetherprimitivesdeterminewhichalgorithmsperformwellEvaluationmethodologyischallengingUseofdelays,microbenchmarksShouldusebothmicrobenchmarksandrealworkloadsSimplesoftwarealgorithmswithcommonhardwareprimitivesdowellonbusWillseemoresophisticatedtechniquesfordistributedmachinesHardwaresupportstillsubjectofdebateTheoreticalresearcharguesforswaporcompare&swap,notfetch&opAlgorithmsthatensureconstant-timeaccess,butcomplexOverviewReviewofLec11SMP中的同步MPP當前高性能計算機介紹高性能計算機未來第五章大規(guī)模并行處理機系統(tǒng)(MPP)MassivelyParallelProcessorMPP概述大規(guī)模并行處理機MPP(MassivelyParallelProcessor)通常是指具有下列特點的大規(guī)模的并行計算機系統(tǒng):節(jié)點中使用商品化微處理器,且每個節(jié)點有一個或多個微處理器;節(jié)點內使用物理上分布的存儲器;具有高通信帶寬和低延遲的互連網絡,節(jié)點間緊耦合;能擴展成具有成百上千個處理器;一個異步多指令流多數(shù)據流MIMD機IntelParagon、IBMSP2、IntelTFLOPS和我國的曙光-1000等都是MPP兩種實現(xiàn)途徑NCC-NUMA體系結構,CrayT3ENORMA體系結構,Intel/SandiaASCIOptionRed與機群的概念很模糊差別縮小關鍵差別在于節(jié)點間的通信MPP的結構圖MPP特性可擴放性:使用物理上分布主存的體系結構平衡的處理和存儲能力平衡的計算和并行交互能力系統(tǒng)成本:使用現(xiàn)有的商品化CMOS微處理器沒有足夠大的物理地址空間沒有足夠大的TLB無阻塞高速緩存異常處理與邊界保護使用穩(wěn)定的體系結構以支持換代的可擴放性——shell結構使用物理分布主存的體系結構使用SMP節(jié)點MPP特性(cont‘d)通用性和可用性:支持通用的異步MIMD模式;支持流行的標準編程模式,如消息傳遞(PVMMPI)和數(shù)據并行(HPF)等;節(jié)點被分配到若干個“池”中,支持不同作業(yè);內部互連拓撲結構對用戶透明,用戶只看到全互連的節(jié)點集合;支持單一系統(tǒng)映象SSI(SingleSystemImage),緊耦合MPP通常使用分布式操作系統(tǒng),在硬件和OS層提供單一系統(tǒng)映像;必須使用高可用性技術主存和I/O性能非常大的總主存和磁盤容量。商用MPP尤其注重高速I/O系統(tǒng)提供可擴放的I/O子系統(tǒng)比較MPP模型Intel/SandiaASCIOptionRedIBMSP2SGI/CrayOrigin2000一個大型樣機的配置9072個處理器,1.8Tflop/s(NSL)400個處理器,100Gflop/s(MHPCC)128個處理器,51Gflop/s(NCSA)問世日期1996年12月1994年9月1996年10月處理器類型200MHz,200Mflop/sPentiumPro67MHz,267Mflop/sPOWER2200MHz,400Mflop/sMIPSR10000節(jié)點體系結構和數(shù)據存儲器2個處理器,32到256MB主存,共享磁盤1個處理器,64MB到2GB本地主存,1GB到14.5GB本地磁盤2個處理器,64MB到256MB分布共享主存和共享磁盤互連網絡和主存模型分離兩維網孔,NORMA多級網絡,NORMA胖超立方體網絡,CC-NUMA節(jié)點操作系統(tǒng)輕量級內核(LWK)完全AIX(IBMUNIX)微內核CellularIRIX自然編程機制基于PUMAPortals的MPIMPI和PVMPowerC,PowerFortran其他編程模型Nx,PVM,HPFHPF,LindaMPI,PVMMPP系統(tǒng)面臨的主要問題實際的性能差:MPP的實際可用性能通常遠低于其峰值性能;可編程性:并行程序的開發(fā)比較困難,串行程序向并行程序的自動轉換效果不好,且不同平臺間并行程序的有效移植也有一定的難度。功耗大,需要苛刻的散熱和通風條件占地面積大實例分析1:CrayT3E的體系結構性能特點分布式共享主存(NCC-NUMA)的多處理機。多個處理單元PE(ProcessingElement)通過一個三維雙向環(huán)網互連由一些千兆環(huán)通道提供與I/O設備的連接T3E的體系結構特性。T3E-900是1996年底發(fā)布的T3E增強型。屬性T3ET3E-900處理器時鐘頻率(MHz)300450峰值處理器速度(Mflops)600900處理器數(shù)量6~20486~2048系統(tǒng)峰值速度(Gflops)3.6~12285.4~1843物理主存容量(GB)1~40961~4096總峰值主存帶寬(GB/s)7.2~24507.2~2450I/O通道最大數(shù)目1~1281~128總峰值I/O帶寬(GB/s)1~1281~128峰值三維環(huán)網鏈接帶寬(MB/s)600600ASCI/MPP系統(tǒng)ASCI(AcceleratedStrategicComputingInitiative):1994年DOE該計劃為期十年,耗資十億美元制造Tflop/s的超級計算機系統(tǒng),AdvancedSimulationandComputingProgramLawrenceLivermore,LosAlamos,andSandianationallaboratoriesShiftfromtest-basedconfidencetosimulation-basedconfidence.Computermanufacturers:Intel,IBM,SGI/Cray,HPFiveuniversities:CalTech/Stanford/UniversityofChicago/UniversityofIllinoisatUrbana-Champaign/UniversityofUtahTheLosAlamosASCIQdedicatedinMay2002Hewlett-PackardASCIQ-AlphaServerSCES45/1.25GHz/40967727.00/10240.00LosAlamosNationalLaboratoryUSA2002/3Hewlett-PackardASCIQ-AlphaServerSCES45/1.25GHz/40967727.00/10240.00LosAlamosNationalLaboratoryUSA2002/4IBMASCIWhite,SPPower3375MHz/81927226.00/12288.00LawrenceLivermoreNationalLaboratoryUSA/20005ASCIRedASCIBluePacificASCIBlueMountainASCI可擴放設計策略加速發(fā)展1996年/1Tflop/s系統(tǒng),2000年/10至30Tflop/s系統(tǒng),2004年/100Tflop/s系統(tǒng),且這些系統(tǒng)應該成本相近。不僅瞄準峰值速度,而且總的系統(tǒng)持續(xù)的應用性能要105倍于1994年平衡的可擴放設計著重用于科學計算應用的高端平臺,而非大批量市場平臺和市場熱點應用;使用盡可能多的商品化市售(COTS)硬件和軟件部件,著重開發(fā)主流計算機公司未有效提供的關鍵技術;使用大規(guī)模并行體系結構,著重于縮放和集成技術,將數(shù)千個COTS節(jié)點納入一個有單一系統(tǒng)映象的高效平臺ASCI平臺性能發(fā)展圖平衡設計策略端對端性能平衡的可擴放硬件一條平衡設計準則:1Gflop/s峰值速度應與1GB主存、50GB磁盤、10TB檔案存儲器、16GB/s高速緩存帶寬、3GB/s主存帶寬、0.1GB/sI/O磁盤帶寬以及1MB/s檔案存儲器帶寬相匹配;平衡的可擴放軟件ASCI認為新的軟件開發(fā)將使性能改進10到100倍屬性1996199719982003應用性能(倍數(shù))11000100,000峰值計算速度(Gflops)100100010,000100,000主存容量(TB)0.050.5550磁盤容量(TB)0.1~11~1010~100100~1000檔案存儲容量(PB)0.131.313130I/O速度(GB/s)5505005000網絡速度(GB/s)0.131.313130硬件要求ASCI超級計算機的處理器、存儲器體系結構和I/O子系統(tǒng)的要求均有詳細說明。例如,ASCI對存儲器要求如下表所示。存儲器層次有效時延(CPU周期)讀/寫帶寬*存儲容量**片內高速緩存,L12~316~32B/cycle10-4B/flop/s片外高速緩存,L25~616B/cycle10-2B/flop/s本地主存30~80(15~30)2~8B/flop峰值(2~8B/flop持續(xù))1B/flop/s鄰近節(jié)點300~500(30~50)1~8B/flop(8B/flop)1B/flop/s1B/flop/s遠處節(jié)點1000(100~200)1B/flop1B/flop/sI/O速度(主存—磁盤)10ms0.01~0.1B/flop10~100B/flop/s檔案(磁盤—磁帶)秒級0.001B/flop(0.01~0.1B/flop)100B/flop/s(104B/flop/s)用戶存取時間0.1s(1/60s)OC3/desktop(OC12~48/desktop)100個用戶多地點0.1s未知未知注:粗體指標表示工業(yè)界無法滿足1997年要求。細體指標與之相反。大部分指標的需要在1998年滿足,括號內的指標定于2000年滿足。*每單位工作負載或每CPU時鐘的帶寬。**每單位速度(flop/s)的容量。軟件要求軟件工業(yè)遠遠落后于要求。

ASCI對軟件要求作了詳細說明:人/機界面:可視化和因特網技術;應用環(huán)境:數(shù)學算法、網格生成、域分解和科學數(shù)據管理;編程環(huán)境:編程模型、庫、編譯器、調試器、性能工具和對象技術;分布式操作軟件:I/O、文件和存儲系統(tǒng)、可靠性、通信、系統(tǒng)管理、分布式資源管理;診斷性能監(jiān)控器:系統(tǒng)狀態(tài)正常和監(jiān)控軟件要求安全性可擴放性功能性可移植性人機界面↑Δ↓Δ可視化↓Δ因特網↑Δ↑●應用環(huán)境↑●↓Δ↓Δ↑Δ編程環(huán)境↓Δ↓Δ↓Δ↓Δ分布式操作軟件↓Δ↓Δ↓Δ↓Δ診斷性能監(jiān)控器↑●↓Δ↑●↓●注:↑表示工業(yè)能滿足需求?!硎竟I(yè)無法滿足需求。Δ表示需求隨時間上升。●表示需求保持不變。定約的ASCI/MPP平臺OptionRed、BluePacific、BlueMountian和OptionWhite、ASCIQ等MPP系統(tǒng)已被安裝在3個國家實驗室IntelOptionRed典型的MPP系統(tǒng)SGIBlueMountain系統(tǒng)由48個節(jié)點的機群所組成,其中每一個節(jié)點是一個有128個處理器的Origin2000CC-NUMA系統(tǒng)。節(jié)點內的互連為胖超立方體。48個Origin2000系統(tǒng)用4兆位HiPPI一800交換開關連成一個機群,其中每個鏈路的雙向峰值帶寬為1.6Gb/s2個IBM系統(tǒng)均為高端SP系統(tǒng)HPASCIQASCIRedStorm,ASCIPurple,IBMBlueGene/L/P四個ASCI比較特性OptionRedOptionBlueOptionWhiteBluePacificBlueMountain制造商IntelIBMSGIIBM安裝場所SandiaLivermoreLosAlamosLivermore完成日期1997年6月1998年12月1998年12月2000年12月成本(百萬美元)5594<11085所選用處理器PentiumPro200MHz200Mflop/sPowerPC604332MHz664Mflop/sMIPS10000250MHz500Mflop/sPOWER3311MHz1244Mflop/s系統(tǒng)體系結構NORMA-MPPSMP機群

4CPU/節(jié)點1464節(jié)點CC-NUMA機群128CPU/節(jié)點48節(jié)點SMP機群16CPU/節(jié)點512節(jié)點節(jié)點內連接總線交叉開關胖超立方體交叉開關節(jié)點間連接分離2D網孔Omega開關千兆位開關Omega開關處理器數(shù)量9216585661448192峰值速度1.8Tflop/s3.888Tflop/s3.072Tflop/s10.2Tflop/s主存容量594GB2.5TB1.5TB4TB磁盤容量1TB75TB75TB150TBASCIOptionRedASCIBlue-PacificASCIBlue-MountainASCIWhite實例分析2:Intel/SandiaASCIOptionRed磁盤OptionRed的體系結構共有4608個節(jié)點(每個節(jié)點有兩個200MHzPentiumPro處理器)和594GB的主存,其峰值速度為1.8Tflop/s、峰值截面(Cross-Section)帶寬為51GB/s。計算節(jié)點(ComputeNode)4536個,執(zhí)行并行計算服務節(jié)點(ServiceNode)32個,用于支持登錄、軟件開發(fā)及其它交互操作I/O節(jié)點(I/ONode)24個,用于存取磁盤、磁帶、網絡(以太網、FDDI、ATM等)和其它I/O設備系統(tǒng)節(jié)點(SystemNode)2個,用于支持系統(tǒng)RAS能力:其中引導節(jié)點(BootNode)負責初始系統(tǒng)引導及提供服務;節(jié)點站(NodeStation)用于單一系統(tǒng)映象支持備份節(jié)點。1540個供給電源,616個互連底板和640個磁盤(大于1TB的容量)節(jié)點體系結構計算節(jié)點和服務節(jié)點的實現(xiàn)相同兩個節(jié)點在一塊主板上。兩個SMP節(jié)點通過網絡接口部件NIC相連在一起,只有一個NIC連向互連底板。每個節(jié)點的本地I/O包括以下部分:一個稱為節(jié)點維護端口(NodeMaintenancePort)的串行口,它連至系統(tǒng)內部以太網,并用于系統(tǒng)引導程序、診斷和RAS;擴展連接器用于節(jié)點測試;引導支持硬件包括一個快閃(Flash)ROM,內含節(jié)點可信測試(NodeConfidenceTest)、BIOS以及診斷節(jié)點失效和裝載操作系統(tǒng)所需的其它代碼。I/O和系統(tǒng)節(jié)點的主板只有2個處理器(1個節(jié)點)、l個本地單總線和1個單NIC。每個節(jié)點的主存容量可上升至64MB到1GB。133MB/s的PCI卡數(shù)量可上升到3。每個I/O節(jié)點主板上基本I/O設備,如RS232、以太網(10Mbps)和Fast-WideSCS節(jié)點結構圖系統(tǒng)互連

節(jié)點由一個內部互連設備ICF相連ICF使用了雙平面(Two-Plane)網孔拓撲。每個節(jié)點主板通過主板上的NIC網孔選路部件MRC(MeshRoutingComponent)。MRC有六個雙向端口,每個能以400MB/s的單向峰值速度傳送數(shù)據,全雙工時為800MB/s,4個端口用于平面內左、右、上、下的網孔互連,還有一個端口用于平面間互連。從任意節(jié)點發(fā)出的消息借助蟲蝕選路通過任一平面送至另一節(jié)點,這將降低時延,從而提高了系統(tǒng)可用性

OptionRed的系統(tǒng)軟件ASCIOptionRed系統(tǒng)軟件:系統(tǒng)、服務和I/O節(jié)點都運行Paragon操作系統(tǒng),它是一個基于OSF的分布式Unix系統(tǒng)。計算節(jié)點運行一個稱為Cougar的輕量級內核LWK(Light-WeightKernel)。同時提供了對這兩個系統(tǒng)間接口的支持,包括高速通信、unix編程接口和一個并行文件系統(tǒng)輕量級內核源于PUMA系統(tǒng)LWK設計更強調性能,它能有效支持多達幾千個節(jié)點的MPP,只提供并行計算所需的功能,而不是一般的操作系統(tǒng)服務;由于TFLOPS系統(tǒng)中有幾千個計算節(jié)點,Cougar被設計成主存占用量在0.5MB以下,以阻止LWK使用的聚集主存上升過快;設計中假設通信網絡是可信的并由內核控制,不需要保護檢查和消息鑒別;LWK提供一個開放的體系結構,允許用戶層庫例程的高效開發(fā)LWK進程控制線程PCT(ProcessControlThread),PCT提供進程管理、命名服務和組保護功能。精華內核Q-Kernel(QuintessentialKernel),Q-Kernel是唯一可以直接訪問地址映射和通信硬件的軟件。它提供了基本的計算、通信和地址空間保護功能。每個節(jié)點有一些用戶進程,一個PCT和一個Q-kernel。消息傳遞

ASCIOptionRed系統(tǒng)支持MPI、NX和消息傳遞入口,其中MPI是系統(tǒng)中的標準庫,而NX是為了提供對Paragon的向后兼容。消息傳遞入口(Portal)提供了最為有效的低層消息傳遞庫,入口的概念是在PUMA操作系統(tǒng)中首先提出的,它的使用可以降低消息傳遞中的存儲器拷貝開銷。使用入口的消息傳遞不屬于用戶層通信機制,仍必須跨越內核。入口是目的進程地址空間的一部分,該部分向其它進程開放以發(fā)送消息。為發(fā)送一條消息,發(fā)送進程需執(zhí)行如下的核心例程:

send_user_msg{void*buf /*發(fā)送消息緩沖區(qū)起始點*/

size_t

len /*發(fā)送消息的大小*/

inttag /*消息標記*/

proc_id

dest /*目的進程號*/

portal_idportal /*目的入口的索引*/

int*flag /*消息發(fā)送的增量標記*/}三個典型的MPP系統(tǒng)的運行性能評估IBMSP2,IntelParagon,CrayT3D節(jié)點體系結構:三個MPP中,得益于267Mflop/s的峰值速度以及為POWER2微處理器設計的良好的優(yōu)化編譯器,使得SP2有最佳的速度和利用率。

Alpha21064雖有更高的時鐘速率,但ILP較低。SP2的另一個優(yōu)點是,它允許有很大的節(jié)點主存。而Paragon只有16M。內核和服務器將使用超過6.5MB主存,NX消息緩沖區(qū)占用另外1MB主存,只剩不到8MB用于數(shù)據存儲。交換網絡的性能與可擴放性MPP中的通信相當昂貴,T3D上的點對點消息傳遞提供了2μs的最低時延,SP2和Paragon有低于40μs的類似時延。Paragon兩維網孔顯示了最高的規(guī)模可擴放性。接下來是三維環(huán)網可擴放至1024個節(jié)點。三個平臺中的并行I/OParagon中文件I/O由I/O節(jié)點提供。這些節(jié)點通常位于兩維網孔的外列。每個I/O節(jié)點連接至一個4.8GBRAID3磁盤陣列。Intel的并行文件系統(tǒng)PFS(ParallelFileSystem)提供了對文件的并行存取,計算節(jié)點的每個磁盤存取需要和I/O節(jié)點進行一次消息交換。I/O性能更多地受網絡通信量影響。

SP2的每個節(jié)點連接至一個本地磁盤。無需區(qū)分I/O節(jié)點和計算節(jié)點。在SP2中,每個節(jié)點運行一個完整的IBM/AIX操作系統(tǒng)。磁盤直接連接到每個節(jié)點。I/O節(jié)點由軟件動態(tài)定義,PFS允許用戶創(chuàng)建跨越許多SP2節(jié)點的文件。在T3D中,磁盤僅連至主機CrayC90或CrayYMP。I/O節(jié)點通過I/O網關連接到主機。每個I/O網關包括兩個節(jié)點,每一節(jié)點包含單個Alpha處理器以及4M字主存(計算處理器主存的一半)和特殊的通信硬件。一個節(jié)點處理一個方向上的I/O,用于系統(tǒng)調用和文件存取MPP小結八十年代后期及九十年代中前期迅速發(fā)展ThinkingMachine公司的CM5,Intel公司的Paragon,IBM公司的SP2,以及Cray公司的T3D主要被用于科學計算

九十年代后期,隨著一些專門生產并行機的公司的倒閉或被兼并,MPP系統(tǒng)慢慢從主流的并行處理市場退出由于消息傳遞系統(tǒng)相對共享存儲系統(tǒng)比較容易實現(xiàn),它仍成為實現(xiàn)超大規(guī)模并行處理的重要手段,不過由于價格和應用領域的原因,基于消息傳遞的MPP系統(tǒng)的研制逐漸成為了政府行為新涌現(xiàn)的高性能計算系統(tǒng)絕大多數(shù)都將是由可擴放的高速互連網絡連接的基于商用微處理器的對稱多處理機(SMP)機群

OverviewReviewofLec11SMP中的同步MPP當前高性能計算機介紹高性能計算機未來從Top500看高性能計算的現(xiàn)狀最快的高性能計算機:1.1PTflops(IBMRoadrunner)中國制造的最快的高性能計算機:180Tflops(Dawning5000A)最普遍的高性能計算

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論