




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
Lesson3UtilisationoftheGPUArchitectureforHPC
(第三課GPU用于高性能計(jì)算)
Vocabulary(詞匯)ImportantSentences(重點(diǎn)句)QuestionsandAnswers(問(wèn)答)Problems(問(wèn)題)1Introduction
GraphicsProcessingUnits(GPUs),whichcommonlyaccompanystandardCentralProcessingUnits(CPUs)inconsumerPCs,arespecialpurposeprocessorsdesignedtoefficientlyperformthecalculationsnecessarytogeneratevisualoutputfromprogramdata.Videogameshaveparticularlyhighrenderingdemands,andthismarkethasdriventhedevelopmentofGPUs,whichincomparisontoCPUs,offerextremelyhighperformanceforthemonetarycost.
Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.[1]Inparticular,thereispotentialtouseGPUstoboosttheperformanceofthetypesofsimulationscommonlydoneontraditionalHPC(HighPerformanceComputing).systemssuchasHPCx.Therearechallengestobeovercome,however,torealisethispotential.
ThedemandsplacedonGPUsfromtheirnativeapplicationsare,however,usuallyquiteunique,andassuchtheGPUarchitectureisquitedifferentfromthatoftheCPU.Graphicsprocessingisinherentlyextremelyparallelsocanbehighlythreadedandperformedonthelargenumbers(typicallyhundreds)ofprocessingcoresfoundintheGPUchip.TheGPUmemorysystemisquitedifferenttothestandardCPUequivalentsystem.Furthermore,theGPUarchitecturereflectsthefactthatgraphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation.SpecialisedsoftwaredevelopmentiscurrentlyrequiredtoenableapplicationstoefficientlyutilisetheGPUarchitecture.
ThisreportfirstgivesadiscussiononscientificcomputingonGPUs.Then,wedescribetheportingofanHPCbenchmarkapplicationtotheNVIDIATESLAGPUarchitecture,andgiveperformanceresultscomparingtouseofastandardCPU.2Background
2.1GPUs
ThekeydifferencebetweenGPUsandCPUsisthatwhileamodernCPUcontainsafewhigh-functionalitycores,GPUstypicallycontain100ormorebasiccores.GPUsalsoboastalargermemorybuswidththanCPUswhichresultsinfastermemoryaccess.TheGPUclockfrequencyistypicallylowerthanthatofaCPU,butthisgaphasbeenclosingoverthelastfewyears.Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelisation,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftwaredevelopment.[2]
ThissectionintroducesthearchitecturaldesignofGPUs.NVIDIA’sproductsarefocusedonherebutofferingsfromotherGPUmanufacturers,suchasATI,aresimilar.Fig.1illustratesthelayoutofaGPU.Itcanbeseenthattherearemanyprocessingcores(processors)toperformcomputation,eachgroupedintomultiprocessors.Thereareseverallevelsofmemorywhichdifferintermsofaccessspeedandscope.TheRegistershaveprocessorscope;theSharedMemory,ConstantCacheandTextureCachehavemultiprocessorscopeandtheDevice(orGlobal)memorycanbeaccessedbyallcoresonachip.NotethattheGPUmemoryaddressspaceisseparatefromthatfortheCPU,andcopyingofdatabetweenthedevicesmustbemanagedinsoftware.Typically,theCPUwillruntheprogramskeleton,andoffloadoneormorecomputationallydemandingcodesectionstotheGPU.Thus,theGPUeffectivelyacceleratestheapplication.TheCPUisreferredtoastheHostandtheGPUastheDevice.FunctionsthatrunontheDevicearecalledkernels.Fig.1architecturallayoutofNVIDIAGPUchipandmemory
OntheGPU,operationsareperformedbythreadsthataregroupedintoblocks,whichareinturnarrangedonagrid.Eachblockisexecutedbyasingleprocessor,howeverifthereareenoughresourcesavailable,severalblockscanbeactiveatthesametimeonaprocessor.Theprocessorwilltime-slicetheblockstoimproveperformance,oneblockperformingcalculationswhileanotheriswaitingforamemoryread,forexampleSomeofthememoryavailabletotheGPUexhibitsconsiderablylatency,howeverbyusingthismethodoftime-slicing,thislatencycanbehiddenforapplicationsthataresuitable.
Agroupof32threadsiscalledawarp,and16threadsahalf-warp.GPUsachievebestperformancewhenhalf-warpsofthreadsperformthesameoperation.Thisisbecauseinthissituation,thethreadscanbeexecutedinparallel.Conditionalscanmeanthatthreadsdonotperformthesameoperationsandsotheymustbeserialised.Suchthreadsaresaidtobedivergent.ThisalsoappliesforGlobalMemoryaccesses:ifthethreadsofahalf-warpaccessGlobalMemorytogetherandobeycertainrulestoqualifyasbeingcoalesced,thentheyaccessthememoryinparallelanditwillonlytakethetimeofasingleaccessforallthreadsofthehalf-warptoaccessthememory.
GlobalMemoryislocatedinthegraphicscard’sGDDR3memory.Thiscanbeaccessedbyallthreads,althoughitisusuallyslowerthanon-chipmemory.Memoryaccessissignificantlyimprovedifmemoryaccessesarecoalescedasthisallowsallthethreadsofahalf-warptoaccessthememorysimultaneously.
SharedMemorycanonlybeaccessedbythreadsinthesameblock.Becauseitisonchip,theSharedMemoryspaceismuchfasterthanthelocalandGlobalMemoryspaces.Approximately16KBofsharedmemoryareavailableoneachMP(multi-processor),howevertopermiteachMPtohaveseveralblocksactiveatatime(whichimprovesperformance)itisadvisabletouseaslittleSharedMemoryaspossibleperblock.Alittlebitlessthan16KBiseffectivelyavailableduetostorageofinternalvariables.
SharedMemoryconsistsof16memorybanks.WhenSharedMemoryisallocated,eachconsecutive32bitwordisplacedonadifferentmemorybank.Toachievemaximummemoryperformance,bankconflictsmustbeavoided(twothreadstryingtoaccessthesamebankatthesametime).Inthecaseofabankconflict,theconflictingmemoryaccessesareserialised,otherwisememoryaccessbyeachhalf-warpisdoneinparallel.ConstantMemoryisread-onlymemorythatiscached.ItislocatedinGlobalMemory,howeverthereisacachelocatedoneachMulti-processor.Iftherequestedmemoryisinthecache,thenaccessisasfastasSharedMemory,howeverifitisnotthentheaccesswillbethesameasaGlobalMemoryaccess.
TextureMemoryisread-onlymemorythatiscachedandisoptimizedfor2Dspatiallocality.Thismeansthataccessing[a][b]and[a+1][b],say,willprobablygetbetterspeedthanif[a][b]and[a+54][b]wereaccessedinstead.[3]TheTextureCacheis16KBperprocessor.Thisisadifferent16KBtotheSharedMemory,sousingtheTextureCachedoesnotreduceavailableSharedMemory.
RegistermemoryexistsandaccessspeedissimilartoSharedMemory.Eachthreadinablockhasitsownindependentversionofregistervariablesdeclared.VariablesthataretoolargewillbeplacedinLocalMemorywhichislocatedinGlobalMemory.TheLocalMemoryspaceisnotcached,soaccessestoitareasexpensiveasnormalaccessestoGlobalMemory.
2.2CUDA
CUDA(ComputeUnifiedDeviceArchitecture)isaprogramminglanguagedevelopedbyNVIDIAtofacilitatewritingprogramsthatrunonCUDA-enabledGPUs.ItisanextensionofCandiscompiledusingthenvcccompiler.ThemostcommonlyusedextensionsarecudaMalloc*toallocatememoryonthedevice,cudaMemcpy*tocopydatabetweenthehostanddeviceandbetweendifferentlocationsonthedevice,kernelname<<<griddimensions,blockdimensions>>>(parameters)tolaunchakernel,threadIdx.x,blockIdx.x,blockDim.x,andgridDim.xtoidentifythethread,block,blockdimension,andgriddimensioninthexdirection.
CUDAaddressedanumberofissuesthataffecteddevelopingprogramsforGPUs,whichpreviouslyrequiredmuchspecialistknowledge.CUDAisquitesimple,soitwillnottakemuchtimeforaprogrammeralreadyfamiliarwithCtobeginusingit.CUDAalsopossessesanumberofotherbenefitsoverpreviousmethodsofGPUprogramming.OneoftheseisthatitpermitsthreadstoaccessanylocationintheGPUmemoryandtoreadandwritetoasmanymemorylocationsasnecessary.Thesewerepreviouslyquitelimitingconstraints,andsoeasingthemrepresentsasignificantadvantageforCUDA.AnothermajorbenefitispermittingaccesstoSharedMemory,whichwaspreviouslynotpossible.
TomakeadoptionofCUDAaseasyaspossible,NVIDIAhascreatedCUDAUwhichcontainsawell-writtentutorialwithexercisesaswellaslinkstocoursenotesandvideosofCUDAcoursestaughtattheUniversityofIllinois.AReferenceManualandProgrammingGuidearealsoavailable.
TheCUDASDKcontainsmanyexamplecodesthatcanbeusedtotesttheinstallationofaGPUand,asthesourcecodesareprovided,demonstrateCUDAprogrammingtechniques.Oneoftheprovidedcodesisatemplate,providingthebasicstructureonwhichprogramscanbebased.
OneofthemainfeaturesofCUDAistheprovisionofaLinearAlgebralibrary(CUBLAS)andanFFTlibrary(CUFFT).ThesegreatlyeasetheimplementationofmanyscientificcodesonaGPU.
2.3ReviewofGPUSuccesses
Inthissection,somerecentworkinvolvingusingGPUsforscientificcomputingishighlighted.
·TheTheoreticalandComputationalBiophysicsgroupattheUniversityofIllinoisatUrbana-ChampaignhasusedGPUstoachieveaccelerationsofbetween20and100timesformolecularmodellingapplications.TProfessorMikeGilesofOxfordUniversityachieveda100timesspeed-upforaLIBOR.
·MonteCarloapplicationanda50timesspeed-upfora3DLaplaceSolver.TheLaplaceSolverwasimplementedontheGPUusingonlyGlobalandSharedMemory.ItusesaJacobiiterationofaLaplacediscretisationonauniform3Dgrid.TheLIBORMonteCarlocodeusedwasquitesimilartotheoriginalCPUcode.ItusesGlobalandConstantMemory.
·ManyotherUKresearchersarealsoexperimentingwithGPUs.NVIDIAhasashowcaseofapplicationsreportedtothem.GPGPU.orgalsomaintainsalistofresearchersusingGPUs.
·RapidMindachieveda2.4timesspeed-upforBLASSGEMM,2.7timesforFFT,and32.2timesforBlack-Sholes.
2.4GPUDisadvantagesandAlternativeAccelerationTechnologies
Inthissection,somedisadvantagesoftheGPUarchitecturearediscussed,andsomealternativeaccelerationtechnologiesarebrieflydescribed.ThekeylimitationofGPUsistherequirementforahighlevelofparallelismtobeinherenttotheapplicationtoenableexploitationofthemanycores.Furthermore,graphicsprocessingtypicallydoesnotrequirethesamelevelofaccuracyandprecisionasscientificsimulation,andthisisreflectedinthefactthattypicallyGPUslackbotherrorcorrectionfunctionalityanddoubleprecisioncomputationalfunctionality.ThisisexpectedtoimprovewithfutureGPUarchitectures.
AnothercommoncriticismofGPUsisthelargepowerconsumption.TheNVIDIATeslaC870usesupto170Wpeak,and120Wtypical.TheamountofheatproducedwouldmakeitdifficulttoclusterlargenumbersofGPUstogether.
GPUsalsoplacegreaterconstraintsonprogrammersthanCPUs.Toavoidsignificantperformancedegradationitisnecessarytoavoidconditionalsinsidekernels.Avoidingnon-coalescedGlobalMemoryaccessesisverydifficultformanyapplications,whichcanalsoseverelydegradeperformance.Thelackofanyinter-blockcommunicationfunctionalitymeansthatitisnotpossibleforthreadsinablocktodeterminewhenthethreadsinanotherblockhavecompletedtheircalculation.Thismeansthatifresultsofcomputationfromotherblocksarerequiredthentheonlysolutionisforthekerneltoexitandanotherlaunch,guaranteeingthatalloftheblockshavecompleted.
Finally,GPUssufferfromlargelatencyinCPU-GPUcommunication.ThisbottleneckcanmeanthatunlesstheamountofprocessingthatisdoneontheGPUisgreatenough,itmaybefastertosimplyperformcalculationsontheCPU.Thereareotheralternativeaccelerationtechnologiesavailable,someofwhicharebrieflydescribedbelow.
ClearspeedOnealternativetoGPUsareprocessorsdesignedespeciallyforHPCapplications,suchasthoseofferedbyClearspeed.TheseproductsareusuallyquitesimilartoGPUs,withafewmodificationsthatusuallymakethemmoresuitableforHPCapplications.OneofthesedifferencesisthatallinternalandexternalmemorycontainsECC(ErrorCorrectionCode)todetectandcorrect‘softerrors’.‘Softerrors’arerandomone-biterrorsthatarecausedbyexternalfactorssuchascosmicrays.
Inthegraphicsmarketsucherrorsaretolerable,andsoGPUsdonotcontainECC,howeverforHPCapplicationsitisoftendesirableorrequired.ClearspeedproductsalsohavemorecoresthanGPUs,buttheyrunataslowerclockspeedtoreduceheatloss.Doubleprecisionisalsoavailable.
SpecialisedproductssuchasClearspeedprocessorshaveamuchsmallermarketthanthatofGPUs.ThisgivesGPUsanumberofadvantages,suchaseconomiesofscale,greateravailability,andmoremoneyspentonR&D.
IntelLarrabeeAnotheralternativethatislikelytogeneratemuchinterestwhenitisreleasedin2009-2010isIntel’sLarrabeeprocessor.Thiswillbeamany-corex86processorwithvectorcapability.IthasthesignificantadvantageoverGPUsofmakinginter-processorcommunicationpossible.ItshouldalsosolveanumberofotherproblemsthataffectGPUs,suchasthelatencyofCPU-GPUcommunication.Itwillinitiallybeaimedatthegraphicsmarket,althoughspecialisedHPCproductsbasedonitarepossibleinthefuture.ItislikelythatitwillalsocontainECCtominimise‘softerrors’.AMDisalsodevelopingasimilarproduct,currentlynamed‘AMDFusion’,howeverfewdetailshavebeenreleasedyet.
CellProcessorACellchipcontainsonePowerProcessorElement(PPE)andseveralSynergisticProcessingElements(SPEs).ThePPEactsmainlytocontroltheSPEs,whichdomostofthecalculations.CellprocessorsarequitesimilartoGPUs.ForsomeapplicationsGPUsoutperformCellProcessors,whileforotherstheoppositeistrue.
FPGAsFieldProgrammableGateArrays(FPGAs)areprogrammablesemiconductordevicesthatarebasedaroundamatrixofconfigurablelogicblocksconnectedviaprogrammableinterconnects.Asopposedtonormalmicroprocessors,wherethedesignofthedeviceisfixedbythemanufacturer,FPGAscanbeprogrammedtocomputetheexactalgorithmrequiredbyagivenapplication.Thismakesthemverypowerfulandversatile.Themaindisadvantagesarethattheyareusuallyquitedifficulttoprogram,andtheyarealsoslowifhigh-precisionisrequired.Forcertaintaskstheyarepopular,however.Severaltime-consumingalgorithmsinAstronomywhereonly4bitprecisionisnecessaryareverysuitableforFPGAs,forexample.3GPUAccelerationofanHPCBenchmark(Omitted)
4Conclusions
GPUs,originallydesignedtosatisfytherenderingcomputationaldemandsofvideogames,potentiallyofferperformancebenefitsformoregeneralpurposeapplications,includingHPCsimulations.ThedifferencesbetweentheGPUandstandardCPUarchitecturesresultintherequirementthatsignificanteffortmustbeinvestedtoenableefficientuseoftheGPUarchitectureforsuchapplications.
WedescribedtheGPUarchitectureandmethodsusedforsoftwaredevelopment,andreportedthatthereispotentialfortheuseofGPUsinHPC:therehavebeennotablesuccessesinseveralresearchareas.WedescribedtheportingofanHPCbenchmarkapplicationtotheGPUarchitecture,whereseveraldegreesofoptimisationwereperformed,andbenchmarkedtheresultingcodesagainstcoderunonastandardCPU.TheGPUwasseentoofferuptoafactorof7.5performanceimprovement.
1.?rendervt.報(bào)答,歸還,給予;呈遞,提供,開(kāi)出;演出,演奏;翻譯;使,致使;使成為,使變得,使處于某狀態(tài);遞交,呈獻(xiàn);粉刷;將(脂肪)熬成油,熔化;(用其他語(yǔ)言)表達(dá),把……譯成;放棄,讓與,交出(與up連用);歸還,交回(與back連用);付給,交納,納貢;提供(幫助等),給予(服務(wù)等);表達(dá),描繪;給……重新措詞,翻譯(常與in或into連用)vi.給予補(bǔ)償;熬油n.在圖形學(xué)領(lǐng)域,render是染色器。
2.?harnessn.馬具,挽具;(防止墜落或摔倒的)背帶,保護(hù)帶vt.給(馬等)裝上挽具;治理,利用。Vocabulary
3.?susceptibleadj.易受影響的,易動(dòng)感情的;過(guò)敏的;易受……感染的;能經(jīng)受的;好動(dòng)感情的,感情豐富的,善感的;容許……的,可能……的,可以……的。
4.?scopen.(活動(dòng)或能力的)余地,機(jī)會(huì);(處理、研究事務(wù)的)范圍;……鏡(觀察儀器);視野,視界;見(jiàn)識(shí),眼界,理解的范圍;(活動(dòng))范圍,(影響、波及)面;能力,力量;長(zhǎng)度。
5.?threadn.線,細(xì)線;線索,思路;線狀物;細(xì)細(xì)的一條;螺紋;衣服vt.將(針、線等)穿過(guò)……;將(影片)裝入放映機(jī);穿成串,串在一起;給……裝入(膠片、繩子);用……線縫;把……線編織進(jìn)。
6.?warpn.彎曲,歪斜;經(jīng)線;經(jīng)紗;vt.&vi.弄彎,變歪vt.使(行為等)不合情理;使乖戾。
7.?divergentadj.有分歧的;叉開(kāi)的;發(fā)散的,擴(kuò)散的。
8.?texturen.手感,質(zhì)感,質(zhì)地;口感;(音樂(lè)或文學(xué)的)諧和統(tǒng)一感,神韻。
9.?coalescevi.聯(lián)合,合并。
[1]?Naturally,interesthasbeengeneratedastowhethertheprocessingpowerwhichGPUsoffercanbeharnessedformoregeneralpurposecalculations.
很自然地,人們對(duì)GPU提供的處理能力是否能夠用來(lái)加強(qiáng)更多通用計(jì)算產(chǎn)生了興趣。asto,關(guān)于;Tobringundercontrolanddirecttheforceof,統(tǒng)治,管理,支配控制住和指揮……的力量:這里表示指揮和控制GPU的圖形處理能力使之加強(qiáng)通用計(jì)算。ImportantSentences
[2]?Applicationssuchasrenderingarehighlyparallelinnature,andcankeepthecoresbusy,resultinginasignificantperformanceimprovementoveruseofastandardCPU.Forapplicationslesssusceptibletosuchhighlevelsofparallelization,theextenttowhichtheavailableperformancecanbeharnessedwilldependonthenatureoftheapplicationandtheinvestmentputintosoftw
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 液壓與液力技術(shù)在地質(zhì)勘探設(shè)備中的應(yīng)用考核試卷
- 派遣員工關(guān)系管理考核試卷
- 租賃設(shè)備選型與配置考核試卷
- 紡紗工藝對(duì)紗線柔軟性的調(diào)控考核試卷
- 上海高三語(yǔ)文各區(qū)作文題
- 毛皮制品的工傷保險(xiǎn)制度考核試卷
- 毛皮制品加工車間布局設(shè)計(jì)考核試卷
- 有機(jī)化學(xué)原料的綠色化學(xué)標(biāo)準(zhǔn)制定考核試卷
- 電視設(shè)備智能安全防范技術(shù)考核試卷
- 胰島素皮下注射團(tuán)體標(biāo)準(zhǔn)解讀
- 高中英語(yǔ)語(yǔ)法之虛擬語(yǔ)氣(課件3份)
- 國(guó)際石油合作主要合同模式課件
- 花的生長(zhǎng)過(guò)程課件
- 環(huán)境保護(hù)、水土保持工作檢查記錄
- TSG 81-2022 場(chǎng)(廠)內(nèi)專用機(jī)動(dòng)車輛安全技術(shù)規(guī)程
- 客戶生命周期管理理論分析報(bào)告(共17頁(yè)).ppt
- 事業(yè)單位同意報(bào)考證明
- 音調(diào)控制電路課件
- N-TWI日產(chǎn)標(biāo)準(zhǔn)作業(yè)的設(shè)定課件
- 醫(yī)療機(jī)構(gòu)雙向轉(zhuǎn)診登記表
- 蔬菜水果報(bào)價(jià)單表
評(píng)論
0/150
提交評(píng)論