基于可擴(kuò)展多芯片的深度學(xué)習(xí)推理GPU平臺架構(gòu)_第1頁
基于可擴(kuò)展多芯片的深度學(xué)習(xí)推理GPU平臺架構(gòu)_第2頁
基于可擴(kuò)展多芯片的深度學(xué)習(xí)推理GPU平臺架構(gòu)_第3頁
基于可擴(kuò)展多芯片的深度學(xué)習(xí)推理GPU平臺架構(gòu)_第4頁
基于可擴(kuò)展多芯片的深度學(xué)習(xí)推理GPU平臺架構(gòu)_第5頁
已閱讀5頁,還剩18頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、A 0.11 PJ/OP, 0.32-128 TOPS, SCALABLE MULTI-CHIP- MODULE-BASED DEEP NEURAL NETWORK ACCELERATOR DESIGNED WITH A HIGH-PRODUCTIVITY VLSI METHODOLOGY基于可擴(kuò)展多芯片的深度學(xué)習(xí)推理GPU平臺架構(gòu)2NVIDIA RESEARCH OVERVIEWResearch Teams:Graphics, Deep Learning, Robotics,Computer Vision, Parallel Architectures, Programming System

2、s, Circuits, VLSI, NetworksRecent Works:TESTCHIPSRESEARCH TESTCHIP GOALSDevelop and Demonstrate Underlying Technologies for Efficient DL InferenceRC12RC16RC17RC13RC18RTXNVSwitchCuDNNCNN Image InpaintingNoise-to-Noise DenoisingProgressive GANScalable architecture for DL inference accelerationHigh-pro

3、ductivity Design MethodologyThis Work3VAST WORLD OF AI INFERENCECreating A Massive Market OpportunityGENERAL PURPOSE COMPUTERSEMBEDDED COMPUTERSEMBEDDED DEVICES4TARGET APPLICATIONSImage Classification with Convolutional Neural NetworksAlexNetDriveNetResNetDeep Learning ModelsDifferent MCM configurat

4、ionsImage ClassificationRef: Krizhevsky et al., NeurIPS, 2012. Bojarski et al., CoRR 2016. He et al., CoRR 2015 NVIDIA 20195SCALABLE DEEP LEARNINGINFERENCE ACCELERATOR6MULTI-CHIP-MODULE (MCM) ARCHITECTUREDemonstrate:Low-effort scaling to high inference performanceGround Reference Signaling (GRS) as

5、an MCM interconnectNetwork-on-Package architectureAdvantages:Overcome reticle limits Higher yieldLower design costMix process technologiesAgility in changing product SKUsChallenges:Area and power overheads for inter-chip interfacesRef: Zimmer et al., VLSI 2019PackageChip7HIERARCHICAL COMMUNICATION A

6、RCHITECTURENetwork-on-Package (NoP) and Network-on-Chip (NoC)6x6 mesh topology connects 36 chips in package.A single NoP router per chip with 4 interface ports to NoC Configurable routing to avoid bad links/chip20ns per hop, 100 Gbps per link (at max)4x5 mesh topology connects 16 PEs, one Global PE,

7、 and one RISC-VCut-through routing with Multicast support 10ns per hop, 70Gbps per link (at 0.72V)NETWORK-ON-CHIP (NoC)NETWORK-ON-PACKAGE (NoP)8High Speed11-25 Gbps per pinHigh energy efficiencyLow voltage swing (200mV) 0.82-1.75 pJ/bitHigh area efficiencySingle-ended links4 data bumps + 1 clock bum

8、p per GRS linkGROUND-REFERENCED SIGNALING (GRS)High Bandwidth, Energy-efficient Inter-chip CommunicationGRS MacroRef: Poulton et al., JSSC 20199SCALABLE DL INFERENCE ACCELERATORRef: Zimmer et al., VLSI 2019PackageChipVector MACVector MACVector MACVector MACVector MACVector MACVector MACVector MACMan

9、agerInput BufferAccumulation Buffer+ManagerAddrGenAddr GenDistributedWeight BufferTiled Architecture with Distributed MemoryRRouter InterfaceSerdesPPUTruncReLUPoolingBiasRef: Sijstermans et al., HotChips 201810SCALABLE DL INFERENCE ACCELERATORHWRSPQCKKCCNN Layer ExecutionInput ActivationsWeightsOutp

10、ut ActivationsDistribute weights across PEsLoad Input Activation to Global PERISC-V configures control registersStream input activations to PEsStore output activations to Global PE11SCALING DL INFERENCE ACROSS NOP/NOCTiling Convolutional Layer Across Chips and Processing ElementsHWCRSP QKKCChip 0Chi

11、p 1Chip 2Chip 3CchipKchipCchipKchipCchipKchipCchipKchipCchipCKchipKPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPECpeKpeCchipKchip12Core area111.6 mm2Voltage0.52-1.1 VFrequency0.48-1.8 GHzFABRICATED MCM-BASED ACCELERATORNVResearch Prototype: 36 Chips on Package in TSMC 16nm TechnologyHigh speed interconnects using

12、 Ground Reference Signaling (GRS)100 Gbps per linkEfficient Compute tiles9.5 TOPS/W, 128 TOPSLow Design EffortSpec-to-Tapeout in 6 months with10 researchersRef: Zimmer et al., VLSI 2019 NVIDIA 201913HIGH PRODUCTIVITYDESIGN METHODOLOGY14HIGH-PRODUCTIVITY DESIGN APPROACHEnables faster time-to-market a

13、nd more features to each SoCRAISE HARDWARE DESIGN LEVEL OF ABSTRACTIONUse High-level languagese.g. C+ instead of VerilogUse Automatione.g. High-Level Synthesis (HLS)Use libraries/generatorsMatchLibAGILE VLSI DESIGNSmall teams, jointly working on architecture, implementation, VLSIContinuous integrati

14、on with automated tool flowsAgile project management techniques 24-hour spins from C+-to-layout15Leverage HLS tools to design with C+ and SystemC models MatchLib: Modular Approach To Circuits and Hardware Library“STL/Boost” for Hardware DesignSynthesizable hardware library developed by NVIDIA resear

15、ch Highly-parameterized, high QoR implementationAvailable open-source: /NVlabs/matchlibLatency-Insensitive (LI) ChannelsEnable modularity in design processDecouple computation & communication architecturesOBJECT-ORIENTED HIGH-LEVEL SYNTHESIS“Push-button” C+-to-gates flowRef: Khailany et al., DAC 201

16、86 months from spec-to-tapeoutwith 10 engineersMatchLib (C+/SystemC)C+ simulation(Functional & Perf. verif)HLSRTLLI Channels (SystemC)Architectural Model (C+/SystemC)Verification Testbench (SystemC)16PROCESSING ELEMENT IMPLEMENTATIONReuse, Modularity, Hierarchical Design17AGILE VLSI DESIGN TECHNIQUE

17、SGDSDaily “C+ to Layout” SpinsRTLGatesHLS+Verilog gen (3 hrs)Syn (2 hrs)Place & Route (12 hrs)C+void dut () if (rst.read() = 1) count = 0;counter_out.write(count); else if (enable.read() = 1) count = count + 1; counter_out.write(count);QoRAgile, incremental approach to design closure during march-to

18、-tapeout phaseSmall, abutting partitions for fast place and route iterationsGlobally asynchronous locally synchronous pausible adaptive clocking schemeFast, error-free clock domain crossings“Correct by construction” top-level timing closureRTL bugs, performance, and VLSI constraints converge togethe

19、rRef: Fojtik et al., ASYNC 2019 NVIDIA 201918EXPERIMENTAL RESULTS19MEASUREMENT SETUPMeasurements begin after weights and activations are loaded from FPGA DRAMWeights are loaded to PE memory Activations are loaded to Global PEOperating pointsMax. Performance: 1.1VMin. Energy: 0.55V20RESULTS: WEAK SCA

20、LING WITH DRIVENET2.38.7630.430.460.50.810204060801Latency (ms)Throughput (images/sec)Thousands432Number of ChipsPERFORMANCEBatch = 1Batch = 4Batch = 32951201319520701004003002001Energy (uJ/image)ENERGY EFFICIENCYBatch = 1Batch = 4Batch = 32Scaling to 32 chips achieves 27X improvement in

21、performance over 1 chip.Energy proportionality in core energy consumption with weak scaling.GRS energy can be reduced with sleep mode.Ref: Bojarski et al., CoRR 2016ThroughputLatency(Voltage: 1.1V)432Number of ChipsCore EnergyGRS Energy(Voltage: 0.55V)21RESULTS: STRONG SCALING WITH RESNET-50PERFORMA

22、NCEENERGY EFFICIENCYScaling to 32 chips achieves 12X improvement in performanceover 1 chip at Batch = 1.Communication and synchronization overheads limit speed-up.High energy efficiency at different scales.Small overhead from communication as we scale number of chips.Ref: He et al., CoRR 203.66.10510151E

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論