多核與眾核處理機(jī)芯片技術(shù)發(fā)展頁P(yáng)PT課件_第1頁
多核與眾核處理機(jī)芯片技術(shù)發(fā)展頁P(yáng)PT課件_第2頁
多核與眾核處理機(jī)芯片技術(shù)發(fā)展頁P(yáng)PT課件_第3頁
多核與眾核處理機(jī)芯片技術(shù)發(fā)展頁P(yáng)PT課件_第4頁
多核與眾核處理機(jī)芯片技術(shù)發(fā)展頁P(yáng)PT課件_第5頁
已閱讀5頁,還剩78頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、高性能多核和眾核處理機(jī)芯片技術(shù)發(fā)展李三立教授清華大學(xué)1引言處理機(jī)永遠(yuǎn)是計(jì)算機(jī)技術(shù)和產(chǎn)業(yè)的重要驅(qū)動(dòng)力。要進(jìn)一步發(fā)展千億次(Petaflops)高性能計(jì)算機(jī),是離不開多核與眾核芯片的發(fā)展的;計(jì)算機(jī)體系結(jié)構(gòu)的新技術(shù)大多體現(xiàn)在高性能多核與眾核芯片上。希望我們關(guān)注高性能計(jì)算技術(shù)的發(fā)展;現(xiàn)在計(jì)算機(jī)體系結(jié)構(gòu)是“系統(tǒng)”都做到“芯片上”去了(SOC)。希望我們計(jì)算機(jī)學(xué)院的“計(jì)算機(jī)組織”和“計(jì)算機(jī)體系結(jié)構(gòu)”課程的老師和學(xué)生能夠在教學(xué)與學(xué)習(xí)中增加這方面內(nèi)容,老師在申請自然科學(xué)基金和其它科研經(jīng)費(fèi)方面也注意加重這方面的研究方向;希望我們年輕教師和學(xué)生把興趣放在這一領(lǐng)域,把我國的處理機(jī)芯片技術(shù)搞上去。2我國萬萬億次超級計(jì)

2、算機(jī)CPU有望全部國產(chǎn)化 世界第一的“天河一號”超級計(jì)算機(jī)系統(tǒng)采用了“飛騰-1000”高性能多核微處理器。“天河一號”:4700萬億次的峰值速度和2566萬億次的持續(xù)速度 ;1000萬億次/秒為:1Petaflops 2019-3-8日環(huán)球網(wǎng)報(bào)道國防科大校長張育林談話3我國天河一號千萬億次超級計(jì)算機(jī)世界500強(qiáng)第一名,奧巴馬專門提到它4世界500強(qiáng)第一名天河1號插件版5提綱1。多核與眾核處理機(jī)結(jié)構(gòu)芯片技術(shù)的需要2。多核和眾核體系結(jié)構(gòu)處理機(jī)芯片的發(fā)展3。異構(gòu)多核眾核結(jié)構(gòu)芯片4。片上系統(tǒng)SOC互聯(lián)網(wǎng)絡(luò)的發(fā)展5。微電子工藝的進(jìn)一步發(fā)展6。未來exaFlops高性能計(jì)算機(jī)芯片預(yù)測7。結(jié)論6(一)。 多

3、核與眾核處理機(jī)結(jié)構(gòu)芯片技術(shù)的需要77/21/202288高性能計(jì)算應(yīng)用需求1 Zettaflops100 Exaflops10 Exaflops1 Exaflops100 Petaflops10 Petaflops1 Petaflops100 TeraflopsSystem PerformancePlasma Fusion Simulation Jardin 03Simulation of more complex biomolecular structures200020202019No schedule provided by sourceApplicationsJardin 03 S.C

4、. Jardin, “Plasma Science Contribution to the SCaLeS Report,” Princeton Plasma Physics Laboratory, PPPL-3879 UC-70, available on Internet.Malone 03 Robert C. Malone, John B. Drake, Philip W. Jones, Douglas A. Rotman, “High-End Computing in Climate Modeling,” contribution to SCaLeS report.NASA 99 R.

5、T. Biedron, P. Mehrotra, M. L. Nelson, F. S. Preston, J. J. Rehder, J. L. Rogers, D. H. Rudy, J. Sobieski, and O. O. Storaasli, “Compute as Fast as the Engineers Can Think!”NASA/TM-2019-209715, available on Internet.NASA 02 NASA Goddard Space Flight Center, “Advanced Weather Prediction Technologies:

6、 NASAs Contribution to the Operational Agencies,” available on Internet.SCaLeS 03 Workshop on the Science Case for Large-scale Simulation, June 24-25, proceedings on Internet a /scales/.DeBenedictis 04, Erik P. DeBenedictis, “Matching Supercomputing to Progress in Science,” July 2019. Present

7、ation at Lawrence Berkeley National Laboratory, also published asSandia National Laboratories SAND report SAND2019-3333P. Sandia technical reports are available by going to and accessing the technical library.HEC04 Federal Plan for High-End Computing, May, 2019.Compute as fast as the engi

8、neer can thinkNASA 99 100 1000 SCaLeS 03 Geodata Earth Station Range NASA 02Full Global Climate Malone 03 Courtesy of Erik P. DeBenedictis simulation of medium biomolecular structures (us scale) simulation of large biomolecular structures (ms scale)protein folding50 TFLOPS250 TFLOPS1 PFLOPSHEC04cpeg

9、421-2019-F/Topic-3-I等離子體全球氣候模型海量地球數(shù)據(jù)更復(fù)雜生物分子結(jié)構(gòu)模擬蛋白質(zhì)結(jié)構(gòu)生物分子結(jié)構(gòu)系統(tǒng)性能應(yīng)用1萬萬億次100萬萬億次1000萬萬億次8晶體管數(shù)目增長-Intel320億晶體管9芯片上頻率不能持續(xù)增長功耗問題停頓了10功耗引起發(fā)熱直觀圖片11CPU的水冷和風(fēng)冷水冷系統(tǒng)風(fēng)冷系統(tǒng)12解決功耗增長和晶體管增長的矛盾解決方案:新制造材料;新制冷技術(shù);多核和眾核體系結(jié)構(gòu)13多核和眾核的發(fā)展對于性能的影響多核三年的變化性能年份Intel著重在PC機(jī)發(fā)展14體系結(jié)構(gòu)進(jìn)展:單核多核眾核-片上互聯(lián)1993, Pentium2019, Pentium MMX2019, Penti

10、um II2019, Pentium III2019, Tualatin2019, Pentium 4Northwood2019, Pentium D2019, Core 2 Duo (Conroe)2019, Core 2 Quad(Kentisfield)2019, TeraScale 80-core prototypeSingle core with increased performanceMulticore processor with more and more cores!Key for Multicore:Interconnection15AMD通用單核的內(nèi)部結(jié)構(gòu) AGUAGU

11、Int Decode & RenameFADDFMISCFMUL44-entryLoad/StoreQueue36-entry FP schedulerFP Decode & RenameALUAGUALUMULTALUResResResL1Icache64KBL1Dcache64KBFetchBranchPredictionInstruction Control Unit (72 entries)FastpathMicrocode EngineScan/Align/Decodeops取指轉(zhuǎn)移預(yù)測微碼硬布線微操作數(shù)據(jù)緩存指令緩存16AMD 雙核芯片的布局雙核AMD Opteron 處理機(jī) 19

12、9mm2 90nm 工藝單核 AMD Opteron 處理機(jī) 193mm2 130nm 工藝17AMD Opteron 的多核架構(gòu)18Intel多核與眾核解決路線2005200920062008200720102004201120122013201420152016201720182019202012481625632641285121024Pentium DCore DuoCore 2 DuoConroe, Allendale, Wolfdale, Merom, PenrynCore 2 DuoKentsfield, YorkfieldCore i7Sandy BridgePolaris T

13、eraScale80 Cores / 80 ThreadsSingle Chip Cloud Computing48 Cores / 48 ThreadsKnight Corner50 Cores / 200 ThreadsCommercial PathResearch PathNehalem 核數(shù)商業(yè)路徑研究路徑19Intel的 Nehalem多核結(jié)構(gòu)要有圖形核快速通道接口20Intel 的 Nehalem四核芯片布局快速通道連接96GB/S 快速通道連接96GB/S21Intel Nehalem多核處理機(jī)層次式存儲結(jié)構(gòu)CPU Core32KB L1 D$32KB L1 I$256KB L2

14、$8MB Shared L3$CPU Core32KB L1 D$32KB L1 I$256KB L2$4-8 CoresDDR3 DRAM Memory ControllersQuickPath System InterconnectEach direction is 20b6.4Gb/sEach DRAM Channel is 64/72b wide at up to 1.33Gb/sQPI是重要特點(diǎn)22Intel 通用Nehalem的單核結(jié)構(gòu)預(yù)取緩沖預(yù)譯碼指令隊(duì)列對準(zhǔn)轉(zhuǎn)移預(yù)測循環(huán)流譯碼快速通道訪存QPI亂序執(zhí)行緩沖第三級Cache 23JFMAMJJASONDJFMAMJJASONDJF

15、MAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASONDJFMAMJJASONDPower4 (2019)1.1 to 1.3 GHz(1)(2)(2)Power4+ (2019)1.9 GHz(1)(2)(2)Power5 (2019)1.5-1.9 GHz(1)(2)(4)Power5+ (2019)1.5-2.26 GHz(1)(2)(4)CBE (2019)3.2 GHz(1)(9)(10)PowerXCell8i (2019)3.2GHz(1)(9)(10)Xenon (201

16、9)3.2 GHz(1)(3)(6)Power63.5-4.7 GHz(1)(2)(4)Power6+5 GHz(1)(2)(4)Power6+5 GHz(1)(2)(4)Pentium D3.8 GHz(1)(2)(4)Core 21.8-3.2 GHz(1)(4)(8)Dual Core Atom0.8-2.06 GHz(1)(2)(2)Sandy Bridge4.6 GHz(1)(8)(16)Xeon2.863.56 GHz(1)(2)(2)Xeon Quad Code2.133.56 GHz(1)(4)(8)Xeon Beckton2.83.56 GHz(1)(8)(16)Core 7

17、i2.663.33 GHz(1)(4)(8)Opteron Denmark1.6-2.8GHz(1)(2)(2)Opteron Barcelona1.76-2.6GHz(1)(4)(4)Opteron Istanbul2.26-2.66GHz(1)(6)(6)Opteron Sao Paolo?(1)(6)(6)Opteron Magny Cours?(1)(12)(12)Opteron Interlagos?(1)(16)(16)Ultra SPARC IV1-1.356 GHz(1)(2)(2)Ultra SPARC IV+1.5-2.16 GHz(1)(2)(2)Ultra SPARC

18、T11-1.46 GHz(1)(4)(32)Ultra SPARC T21-1.66 GHz(1)(8)(64)Ultra SPARC VII2.4-2.56 GHz(1)(4)(16)Ultra SPARC VIIIfx2.4-2.56 GHz(1)(8)(16)IBMSUN / ORACLEAMDINTEL20192019201920192019201920192019200920192019NameHertz(Processor)(Cores)(Threads)7/21/202224JPL-Dec-01-2009Chips with 8 physical cores or more其他公

19、司多核/眾核發(fā)展計(jì)劃24晶體管數(shù)(千)單線程性能(SpecINT)頻率(MHz)典型功耗(瓦)核數(shù)目小結(jié):35年處理機(jī)發(fā)展綜合趨勢25(二)。多核和眾核體系結(jié)構(gòu)處理機(jī)芯片的發(fā)展26為何要多核?CoreCacheCoreCacheCoreVoltage = 1Freq = 1Area = 1Power = 1Perf = 1Voltage = -15%Freq = -15%Area = 2Power = 1Perf = 1.8In the same process technology27GPGPGPGPGPGPGPGPGPGPGPGPGeneral Purpose Cores進(jìn)一步多核異構(gòu)芯片

20、-SOCSPSPSPSPSpecial Purpose HWCCCCCCCCCCCCCCCCInterconnect fabricHeterogeneous Multi-Core PlatformSOC通用核專用硬件互聯(lián)網(wǎng)絡(luò)28多核技術(shù)將要多樣化!Multiple parallel general-purpose processors (GPPs)Multiple application-specific processors (ASPs)Sun Niagara8 GPP cores (32 threads)IntelXScale Core32K IC32K DCMEv210MEv211MEv

21、212MEv215MEv214MEv213Rbuf64 128BTbuf64 128BHash48/64/128Scratch16KBQDRSRAM2QDRSRAM1RDRAM1RDRAM3RDRAM2GASKETPCI(64b)66 MHzIXP280016b16b1818181818181864bSPI4orCSIXStripeE/D QE/D QQDRSRAM3E/D Q1818MEv29MEv216MEv22MEv23MEv24MEv27MEv26MEv25MEv21MEv28CSRs -Fast_wr-UART-Timers-GPIO-BootROM/SlowPortQDRSRAM4

22、E/D Q1818Intel Network Processor1 GPP Core16 ASPs (128 threads)IBM Cell1 GPP (2 threads)8 ASPsPicochip DSP1 GPP core248 ASPsCisco CRS-1188 Tensilica GPPs處理機(jī)上有上千個(gè)線程處理機(jī)就是摩爾定理中的晶體管“The Processor is the new Transistor” Rowen29AMD做的GPU多核SIMD芯片結(jié)構(gòu)30多核伴隨指令的擴(kuò)展-加速31眾核處理機(jī)結(jié)構(gòu)3232Intel Terascale 80 核處理機(jī)Tilera 64核

23、處理機(jī)云存儲服務(wù)器無線網(wǎng)絡(luò)32NVIDIAs Fermi GPU architecture consists of 16 streaming multiprocessors (SMs), each consisting of 32 cores, each of which can execute one floating-point or integer instruction per clock. The SMs are supported by a second-level cache, host interface, GigaThread scheduler, and multiple

24、DRAM interfaces.NVIDIA的新GPU眾核芯片F(xiàn)ERMI 結(jié)構(gòu)SM32核33Each Fermi SM includes 32 cores, 16 load/store units, four special-function units, a 32K-word register file, 64K of configurable RAM, and thread control logic. Each core has both floating-point and integer execution units寄存器堆32K字浮點(diǎn)定點(diǎn)每個(gè)CUDA核34多核芯片的片上、片外訪存

25、速度設(shè)計(jì)考慮(數(shù)據(jù)訪問速度Memory Wall)處理部件64 寄存器片上Cache16MB/32KBLoad 1, Store 11.92TB/sLoad 2, Store 1640GB/s片外靜態(tài)CacheSRAM 2.5MB Load 20 cycles, Store 10 cycles 320GB/s (片外差6倍)板外動(dòng)態(tài)存儲器DRAM16GBLoad 36 cycles, Store 18 cycles 16GB/s (板外差120倍)35(三)。異構(gòu)多核結(jié)構(gòu)芯片36為什么要發(fā)展異構(gòu)眾核芯片1。要研制千萬億次(PetaFlops)高性能計(jì)算機(jī),單靠Intel 或AMD通用同構(gòu)型眾核

26、芯片是不行的,必須要有加速器;2。同構(gòu)眾核芯片又會遇到功耗問題,每個(gè)核都要有它Cache等配合硬件;因此,加速器要用較大量的“小核”;3。如果CPU和GPU芯片合用,因?yàn)镚PU要求大量數(shù)據(jù),所以在芯片之間傳送大量數(shù)據(jù),是瓶頸,很難達(dá)到峰值;4。因此,CPU和GPU應(yīng)該做在一個(gè)芯片上,芯片上的數(shù)據(jù)傳輸頻帶要寬很多;更進(jìn)一步,GPU仍然有編程困難的問題,如有針對專門用途的、算法和編程都比較能簡化的小核,更為合適。另一個(gè)辦法是在眾核中擴(kuò)充指令、實(shí)現(xiàn)加速。5。高性能計(jì)算機(jī)有分向的趨勢,一般通用HPC用現(xiàn)有的刀片式服務(wù)器、再加上Infiniband就可以很快造成,價(jià)廉、研制速度快;而自己專門設(shè)計(jì)板級產(chǎn)品

27、的、幾個(gè)PetaFlops的 HPC一般都只能針對一、二種應(yīng)用,有專用化的趨勢。37Enabled by: Moores Law Voltage ScalingSingle-Core EraMulti-CoreEraHeterogeneousSystems EraEnabled by: Moores Law Desire For Throughput20 years of SMP archPowerParallel SW availabilityPerformance ScalabilityMicro-Architecture受限于: Power Complexity受限于: Enabled

28、by: Moores Law Abundant data parallelism Power efficient GPUs當(dāng)前受限于: Programming models Communication overheads處理機(jī)性能的三個(gè)時(shí)代單線程性能吞吐率性能針對應(yīng)用目標(biāo)的性能We are hereWe are hereWe are here?單核多核異構(gòu)38IBM異構(gòu)型Cell-NOC:八個(gè)64位向量部件SXU和標(biāo)量部件PXUCell處理機(jī)39Observed clock speed: a wide range of operating frequencies are supported t

29、o optimize for power and yield; Peak performance (single precision): 256 GFlopsPeak performance (double precision): 26 GFlopsIBM Cell 異構(gòu)多核處理器結(jié)構(gòu)詳細(xì)結(jié)構(gòu)圖雙精度單精度向量部件SIMD標(biāo)量部件互聯(lián)網(wǎng)絡(luò)40下一步:千萬億次高性能計(jì)算機(jī)怎么辦?Intel 或 AMD通用處理機(jī)再多,也無法達(dá)到;只有具有加速器功能的異構(gòu)眾核處理機(jī)芯片才可以達(dá)到!硬件可以達(dá)到,軟件沒有充分準(zhǔn)備好(我們大學(xué)以后不一定造HPC機(jī)器,可以搞軟件,和結(jié)合算法的軟件)。41GPU對于超級計(jì)算

30、機(jī)并非理想GPU對于高性能計(jì)算的編程不適當(dāng),解決辦法是把CPU和GPU結(jié)合。 Jack Dongarra說:“The obvious upside of GPUs is that they provide compelling performance for modest prices. The downside is that they are more difficult to program, since at the very least you will need to write one program for the CPUs and another program for th

31、e GPUs. Another problem that GPUs present pertains to the movement of data. Any machine that requires a lot of data movement will never come close to achieving its peak performance. The CPU-GPU link is a thin pipe, and that becomes the strangle-point for the effective use of GPUs. In the future this

32、 problem will be addressed by having the CPU and GPU integrated in a single socket?!?2Cell處理機(jī)對于高性能計(jì)算機(jī)已經(jīng)死亡Cell is Dead for HPCChips that contain both x86 general processing cores as well as graphics processing cores are essentially heterogeneous multi-core processors, which AMD calls Fusion. The vast

33、 majority of multi-core chips today are homogenous chips that contain a number of similar processing engines. There are processors with different types of cores the Cell chips jointly developed by IBM, Sony Corp. and Toshiba Corp. which originally promised to redefine the market of multimedia chips

34、as well as CPUs for HPC market. However, since all three companies cease to develop Cell, it has no future.Jack Dongarra 說:“The Cell architecture is no longer being developed, so it is effectively dead. No new supercomputers will use Cell?!?43CPUmulti-threadingmulti-coremany-corefixed functionpartia

35、lly programmablefully programmable?programmabilityparallelismA Likely Trajectory - Collision or Convergence?CPUGPUmulti-threadingmulti-coremany-corefixed functionpartially programmablefully programmablefuture processor by 2019?programmabilityparallelismafter Justin Rattner, Intel, ISC 2019未來可能的軌跡多線程

36、多核眾核全部可編程部分可編程并行度可編程度通用性和并行度的結(jié)合-異構(gòu)眾核44IBM Cyclops-64(C64)芯片體系結(jié)構(gòu)On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbors = 48GB/sec80個(gè)核45異構(gòu)型處理機(jī)構(gòu)成1.1PetaFlops 超級計(jì)算機(jī)的組裝46其他多用途的異構(gòu)多核芯片Combination of different coresTwo main options:Different types Microcontroller + DSP, Processor + Accelerator .Different

37、 performance Big processor + small processorAdvantagesProcessors can be optimized for different tasks Operating system, multimedia, graphics, low power appsProcessors are decoupled Independent SW developmentDisadvantagesDifferent architectures - more to learn.Different toolsMore complex SW47Texas 的用

38、于移動(dòng)終端的異構(gòu)多核結(jié)構(gòu)芯片各個(gè)核并行執(zhí)行不同的任務(wù),可用在移動(dòng)終端48(四)。片上系統(tǒng)SOC 互聯(lián)網(wǎng)絡(luò)的發(fā)展49NOC的發(fā)展片上互聯(lián)網(wǎng)絡(luò)隨工藝進(jìn)步而發(fā)展片上互聯(lián)必然發(fā)展到NOC (Network On Chip)80386奔騰多核50片上眾核系統(tǒng)的互聯(lián)網(wǎng)絡(luò)之一片上眾核 + 通道SOC上面:P是處理機(jī)的核51片上眾核系統(tǒng)的互聯(lián)網(wǎng)絡(luò)之二片上眾核 + 通道 + 路由器R路由器結(jié)構(gòu)圖開關(guān)52片上互聯(lián)網(wǎng)絡(luò)的兩種典型拓?fù)浣Y(jié)構(gòu)Torus 拓?fù)浣Y(jié)構(gòu)Mesh 拓?fù)浣Y(jié)構(gòu)53時(shí)鐘:NOC的SOC的片上時(shí)鐘是分布式的RRRRRRRRRRRRRRRR每一個(gè)顏色塊代表一個(gè)時(shí)鐘域兩種研究領(lǐng)域: 非同步路由器 設(shè)計(jì)簡單,低

39、功耗 非同步互聯(lián) 高頻寬,低功耗圖中R是NOC路由器54未來Exa-Scale片上網(wǎng)絡(luò)NOCParallelism replaces clock frequency scaling and core complexityResulting ChallengesScalabilityProgrammingPower55未來Exa-Scale片上網(wǎng)絡(luò)NOCUnpredictable Traffic LoadApplication2Application1ConventionalNoC System(number of cores102)TimeExa-Scale Micro-Networking

40、System(number of cores:102104)UnbalancedResource AllocationScalabilityGood Performance onSmall-Scale NetworkFaulty Router & LinkComplex Design & VerificationNoC FeaturesRegular ArchitecturePacket-based TransmissionFlexible Bandwidth Utilization56MIT:對于眾核結(jié)構(gòu)的分析和考慮陣列式上千個(gè)小核可以解決芯片面積和擴(kuò)展性問題,但是,編程將成為難于逾越的壁壘

41、; 上千個(gè)核的并行化應(yīng)用是非常艱難的:1.任務(wù)和數(shù)據(jù)的劃分;2.通信會導(dǎo)致延遲的增加;3.較遠(yuǎn)距離的通信會引起沿路上的資源競爭;從而降低功能增加功耗;4.沒有有效的廣播式通信(硅片上金屬線太長)。57MIT:對于眾核結(jié)構(gòu)的分析和考慮為提高上千眾核芯片性能,必須有效管理通信和局域性:任務(wù)和數(shù)據(jù)兩者都要優(yōu)化劃分和(位置)置放:分析通信模式以便使延遲最小化;數(shù)據(jù)必須放在經(jīng)常使用它的執(zhí)行部件附近;某些常用程序要靠近DRAM和I/O;動(dòng)態(tài)的和不可預(yù)測的通信是很難優(yōu)化的;為此,MIT提出用廣播式光通信代替電連線的陣列式通信:廣播式通信容易實(shí)現(xiàn)共享存儲模式,從而易于編程;減少局域性的管理;價(jià)廉而且功耗小。技

42、術(shù)基礎(chǔ)研究的好題目5859ATAC ArchitecturepswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmpswitchmOptical Broadcast WDM InterconnectElectrical Mesh InterconnectMIT麻省理工學(xué)院提出的上千個(gè)眾核芯片上的廣播式光通信ATAC電連線的陣列式互聯(lián)網(wǎng)絡(luò)廣播式光通信互聯(lián)網(wǎng)絡(luò)59MIT提出的眾核芯片廣播式光通信的優(yōu)點(diǎn)光導(dǎo)通過眾核芯

43、片上的每一個(gè)核;光導(dǎo)的不同波長可以完全消除資源競爭;型號全部可以在 2ns到達(dá)所有上千個(gè)核所有核都可以接收到同樣的信號,實(shí)現(xiàn)真正的廣播式傳播。廣播式光通信互聯(lián)托撲結(jié)構(gòu)60(五)。微電子工藝的 進(jìn)一步發(fā)展61Terascale Integration CapacityTotal Transistors,300mm2 die1.5B LogicTransistors100MB Cache片上集成度到幾千億個(gè)晶體管62Freq scaling will slow downVdd scaling will slow downPower will be too high300mm2 Die頻率、電壓和功

44、耗的擴(kuò)展性問題頻率電壓功率63連線:芯片工藝線條變細(xì)引起的問題:影響時(shí)鐘分布、延遲設(shè)計(jì)、互聯(lián)結(jié)構(gòu)等等金屬層4金屬層3金屬層2金屬層164Package封裝問題:System in a Package系統(tǒng)Si ChipSi ChipLimited pins: 10mm / 50 micron = 200 pinsLimited pinsSignal distance is large 10 mm higher powerComplex package65從兩維到三維的SOC20個(gè)芯片堆疊(TSV)66Package散熱問題:Anatomy of a Silicon ChipSi ChipHeat

45、-sinkHeatPowerSignals67PackageDRAM at the BottomDRAMCPUHeat-sinkPower and IO signals go through DRAM to CPUThin DRAM dieThrough DRAM viasThe most promising solution to feed the beast68(六)。未來exaFlops高性能計(jì)算機(jī)芯片預(yù)測69PetaFlops以后的進(jìn)展The first 10 to 20 petaflop/s supercomputers should be in service by 2019 an

46、d after that comes a machine in the 100 petaflop/s range (2019). Scientists are moderately optimistic that exaflop/s (1000 petaflop/s) mainframes can be constructed by 2018 - 2020. However, are some of these expectations just plain irrational? (2019:1-2萬萬次);(2019:10萬萬次);(2018-2020:100萬萬次) Number of

47、cores per chip will double every two years Clock speed will not increase (possibly decrease) Need to deal with systems with millions of concurrent threads Need to deal with inter-chip parallelism as well as intra-chip parallelismthe future machines architecture. At best, it will require 20 Megawatts

48、 to run. So getting to the exaflop/s level or beyond may be extremely difficult. 500 x performance (peak) 100 x memory 5000 x concurrency 3x powerSpecialized software will be needed to best make use of the massive parallelism. Argonnes Leadership Computing Facility (ALCF) will install Mira, a next g

49、eneration Blue Gene system (BG/Q), in 2019. The ALCFs stated requirements for the 10 petaflops system include approximately 0.75 million cores and 0.75 petabytes of memory, with 16 cores and 16 gigabytes of memory per node.70$200M,20MWatt,64PB of RAM 的exaFlops高性能計(jì)算機(jī)“The current memory paradigm is hi

50、erarchical, based on registers, L1 and L2 caches, local memory, shared memory, and distributed memory among nodes. That is a potential model for exaFLOPS systems. However, we want exaFLOPS systems to be designed to be relatively easy to program. We therefore want a globally shared address space(全局地址

51、空間), and explicit methods to pass data between the processors in order to orchestrate the unfolding computation. That paradigm may be necessary for a machine that has a billion threads(百萬線程)” 71預(yù)計(jì)的兩種exaFLOPS HPC途徑“There are two models that we can use to get to an exaflop while staying within a 20meg

52、aW budget. The first model employs huge numbers of lightweight processors, such as IBM Blue Gene Processor running at 1.0GHz. If we use 1 million chips, and each chip has 1000 cores, then we can get to a potential billion threads of execution. The other approach is a hybrid that makes extensive use

53、of coprocessors or GPUs. It would use a 1.0GHz processor and 10 000 floating point units per socket, and 100 000 sockets per system,” 72IBM MIRA 1萬萬億次超級計(jì)算機(jī)scientists will have to scale their current computer codes tomore than 750,000 individual computing cores, providing them preliminary experience

54、on how scalability might be achieved on an exascale-class system with 100s of millions of cores. Despite a popular trend to use both central processing units (CPUs) and graphics processing units (GPU), the Mira will be based only on IBMs PowerPC chips.The IBM BlueGene/Q supercomputer design is based

55、 on sixteen-core IBM PowerPC A2 chip with 4-way simultaneous multi-threading technology. Each processor has at least 1GB of DDR3 memory. Featuring 750 thousand processing cores, the new supercomputer will be cooled-down using a special water-cooling system.IBM Blue/Gene Q-US Department of Energys (D

56、OE) Argonne National Laboratory IBM要為Laurence Livermore國家實(shí)驗(yàn)室做20PetaFlops的 Sequoia , IBM把Blue/Gene結(jié)構(gòu)發(fā)展到 50Petaflops 和100Petaflops73Mira 10PetaFlops的Power PC A2處理機(jī)PowerPC A2是具有高度多核和多線程能力的64位Power架構(gòu)的處理器。 IBM 稱之為 “線速處理器”,他被設(shè)計(jì)為進(jìn)行切換和路由工作的傳統(tǒng)網(wǎng)絡(luò)處理器與處理和封裝數(shù)據(jù)的典型服務(wù)器處理器的混合體。以A2核心為基礎(chǔ)的處理器版本從16核心, 2.3G頻率, 65W功耗到一個(gè)4核

57、心,1.4G頻率,20W功耗。每一個(gè)A2核心可以同時(shí)執(zhí)行4個(gè)多線程(補(bǔ)充:Intel的超線程是兩個(gè))。每個(gè)核心有8M緩存,并且除了通用計(jì)算處理器外,還有一系列任務(wù)專用引擎,例如XML,加密解密,壓縮和傳統(tǒng)的表達(dá)加速,4個(gè)10G以太網(wǎng)接口和2個(gè)PCIe線路。不需要其他支持芯片的情況下,最多可以鏈接有四個(gè)芯片為SMP(對稱多處理器)系統(tǒng) 。這些芯片據(jù)說極其復(fù)雜,使用了14億3千萬的晶體管,在45納米制程下核心大小428平方毫米。注:線速處理器 “wire-speed processor”. 指處理器的數(shù)據(jù)吞吐量和通信標(biāo)準(zhǔn)的數(shù)據(jù)量相當(dāng)。此概念I(lǐng)BM解釋為,處理器不再是消化數(shù)據(jù)的地方,即數(shù)據(jù)停滯。而是

58、一個(gè)過濾或者修改數(shù)據(jù)并再發(fā)送的地方。 74IBM Power PC A2 的體系結(jié)構(gòu)PLLPLLPLLPLLPLLEnginePLLPLLPLLPLLPLLPatternAccessx8 PHYx8 PHYx4 PHYx8 PHYEI3EI3EI3Misc I/O4x 10GE MAC or4x 1GE MACPervasivePCI ExpGen 2PCI ExpGen 2Host Ethernet Controller / Packet ProcessorRootEngineRoot/EP EnginePbusMacroPBus ExternalControllerPBICPBICPBus

59、PBICPBICComp / DecompCryptoXMLMCMCMem PHYMem PHYAT32MB L2AT22MB L2AT12MB L2AT02MB L2加速器75IBM Power PC A2的加速和互聯(lián)四個(gè)芯片互聯(lián)成SMP4 Channels 800-1600MHzTechnologyIBM 45nm SOICore Frequency2.3GHz 0.97V (Worst Case Process)Chip size428 mm2 (including kerf)Chip Power (4-AT node) Chip Power (1-AT node)65W 2.0GHz,

60、 0.85V Max Single Chip20W 1.4GHz, 0.77V Min Single ChipMain Voltage (VDD)0.7V to 1.1VMetal Layers11 Cu (3-1x, 2-1.3x, 3-2x, 1-4x, 2-10 x)Latch Count3.2MTransistor Count1.43BA2 Cores / Threads16 / 64L1 I & D Cache16 x (16KB + 16KB) SRAML2 Cache4 x 2MB eDRAMHardware AcceleratorsCrypto, Compression, Re

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論