




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural NetworksNaveen Suda, Vikas Chandra, Ganesh Dasika*, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu CaoSchool of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, USASchool
2、of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USAARM Inc., San Jose, USA; *ARM Inc., Austin, USA.E-mail: naveen.suda, abinash.mohanty, yufeima, vrudhula, jaesun.seo, ,vikas.chandra, ABSTRACTConvolutional Neural Networks (CN
3、Ns) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is diffic
4、ult to perform real-time classification with low power consumption on todays computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodolo
5、gies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space expl
6、oration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two rep
7、resentative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classifi
8、cation on P395-D8 board.1. INTRODUCTIONConvolutional Neural Networks (CNNs),pired by visualcortex of the brain, are a category of feed-forward artificial neural networks. CNNs, which are primarily employed in computer vision applications such as character recognition 1, image classification 2 9 16 1
9、7, video classification 3, face detection 4, gesture recognition 5, etc., are also being used in a wide range of fields including speech recognition 6, natural language processing 7 and text classification 8. Over the pastdecade, the accuracy and performance of CNN-based algorithms improved signific
10、antly, mainly due to the enhanced network structures enabled by massive training datasets and increased raw computational power aided by CMOS scaling to train the models in a reasonable amount of time.A typical CNN architecture has multiple convolutional layers that extract features from the input d
11、ata, followed by classification layers. The operations in CNNs are computationally intensive with over billion operations per input image 9, thus requiring high performance server CPUs and GPUs to train the models. However, they are not energy efficient and hence various hardware accelerators have b
12、een proposed based on FPGA 10- 13, SoC (CPU + FPGA) 14 and ASIC 15. FPGA basedhardware accelerators have gained momentum owing to their reconfigurability and fast development time, especially with the availability of high-level synthesis (HLS) tools from FPGA vendors. Moreover, FPGAs provide flexibi
13、lity to implement the CNNs with limited data precision which reduces the memory footprint and bandwidth requirements, resulting in a better energy efficiency (e.g. GOPS/Watt).Previous FPGA-based CNN accelerator designs primarily focused on optimizing the computational resources without considering t
14、he impact of the external memory transfers 10 11 or optimizing the external memory transfers through data reuse12 13. The authors of 13 proposed a design space exploration methodology for CNN accelerator by optimizing both computation resources and external memory accesses, but implemented only conv
15、olution layers. In this work, we present a systematic methodology for maximizing the throughput of an FPGA-based accelerator for an entire CNN model consisting of all CNN layers: convolution, normalization, pooling and classification layers.The key contributions of this work are summarized as follow
16、s: CNN with fixed-point operations are implemented on FPGA using OpenCL framework. Critical design variables that impact the throughput are identified for optimization. Execution time of each CNN layer is analytically modeled as a function of these design variables and validated on FPGA.Categories a
17、nd Subject DescriptorsC.3 SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS: Signal processing systems.KeywordsFPGA, OpenCL, Convolutional Neural Networks, Optimization.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided tha
18、t copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to
19、 post on servers or to redistribute to lists, requires priorspecific permission and/or a fee. Request permissions P.FPGA16, February 2123, 2016, Monterey, CA, USA. 2016 ACM. ISBN 978-1-4503-3856-1/16/02$15.00DOI: /10.1145/2847263.2847276from16Logic utilization is emp
20、irically modeled using FPGA synthesis data for each CNN layer as a function of the design variables. A systematic methodology is proposed to minimize total execution time of a given CNN algorithm, subject to the FPGA hardware constraints of logic utilization, computational resources, on-chip memory
21、and external memory bandwidth. The new methodology is demonstrated by maximizing the throughput of two large-scale CNNs: AlexNet 16 and VGG17 (which achieved top accuracies in ImageNet challenges 2012 and 2014, respectively), on two Altera FPGA platforms with different hardware resources.The rest of
22、 the paper is organized as follows. Section 2 brieflyStride of 427131313553131313113dense3511333327355128224192192128 2048 204827485513131353111313133310005112243densedense333275533Max192192128Max PoolingMax 20482048Pooling48 Pooling128Figure 1: Architecture of AlexNet CNN 16.out( f , x, y) = in( fo
23、 , x, y)(2)bodescribes the operations of CNNs using AlexNet as an example.af + K / 2oi1+in ( f , x, y) K2Section 3 presents the challenges in implementing a large-scale CNN on FPGAs. It also studies the impact of precision of the weights on the accuracy of AlexNet and VGG models. Section 4 briefly p
24、resents the OpenCL implementation of CNN layers and describes the design variables used for acceleration. Section 5 describes our proposed methodology for design space exploration to maximize the throughput of the CNN accelerator with limited FPGA resources. Section 6 presents the experimental resul
25、ts of two CNNs optimized on two Altera FPGA platforms and compares them with prior work. Section 7 concludes the paper.2. BASIC OPERATIONS OF CNNA typical CNN is comprised of multiple convolutional layers, interspersed by normalization, pooling and non-linear activation function. These convolution l
26、ayers decompose the input image to different features maps varying from low-level features such as edges, lines, curves, etc., in the initial layers to high-level/abstract features in the deeper layers. These extracted features are classified to output classes by fully-connected classification layer
27、s that are similar to multi-layer perceptrons. For example, Figure 1 shows the architecture of AlexNet CNN 16, which won the ImageNet challenge in 2012. It consists of 5 convolutional layers each with a Rectified Linear Unit (ReLU) based activation function, interspersed by 2 normalization layers, 3
28、 pooling layers and concluded by 3 fully connected layers which classify the input 224224 color images to 1,000 output classes. The ImageNet database-based models are characterized by top-1 and top-5 accuracies, which represent whether the input image label matches with top-1 and top-5 predictions,
29、respectively.2.1 ConvolutionConvolution is the most critical operation of CNNs and it constitutes over 90% of the total operations in AlexNet model 13. It involves 3-dimensional multiply and accumulate operation of Nif input features with KK convolution filters to get an output feature neuron value
30、as shown in Equation (1).out( fo , x, y) = wt( fo , fi , kx , ky ) in( fi , x +kx , y + ky ) (1)fi = fo -K / 2 in( f , x, y)out( f , x, y) =o (3)o b aK 2x+ K / 2 y+ K / 21+oxy2in ( f , x + k , y + k )kx= x-K / 2 k =y y-K / 2where K in Equation (2) is the number of feature maps used for LRN computati
31、on, K in Equation (3) is the number of neurons in x, y directions in the same feature, while a and b are constants.2.3 PoolingSpatial pooling or subsampling is utilized to reduce the feature dimensions as we traverse deeper into the network. As shown in Equation (4), pooling computes the maximum or
32、average of neighboring KK neurons in the same feature map, which also provides a form of translational invariance 18. Although max-pooling is popularly used, average pooling is alsousedome CNN models 18. Reducing the dimensionality oflower-level features while preserving the important information,th
33、e pooling layer helps abstracting higher-level features without redundancy.out( f , x, y) = max/average (in( f , x + k , y + k )(4)ooxy0(k x,ky )K2.4 Activation FunctionsThe commonly used activation functions in traditional neural networks are non-linear functions such as tanh and sigmoid, which req
34、uire a longer training time in CNNs 16. Hence, Rectified Linear Unit (ReLU) defined as y = max(x,0) has become the popular activation function among CNN models as it converges faster in training. Moreover, ReLU has less computational complexity compared to exponent functions in tanh and sigmoid, als
35、o aiding hardware design.2.5 Fully Connected LayerFully-connected layer or inner product layer is the classification layer where all the input features (Nif) are connected to all of the output features (Nof) through synaptic weights (wt). Each output neuron is the weighted summation of all the input
36、 neurons as shown in Equation (5).NifNif K Kfi =0 kx =0ky =0whereout(fo,x,y)andin(fi,x,y)represent the neurons at location(x,y) in the feature maps fo and fi, respectively and wt(fo,fi,kx,ky) isthe weights at position (k ,k ) that gets convolved with inputout( fo ) = wt( fo , fi ) in( fi )x y(5)feat
37、ure map fi to get the output feature map fo.2.2 NormalizationLocal Response Normalization (LRN) or normalization implements a form of lateral inhibition 16 by normalizing each neuron value by a factor that depends on the neighboring neurons. LRN across neighboring features and within the same featur
38、e can be computed as shown in Equations (2) and (3), respectively.f =0iThe outputs of the inner-product layer traverse through ReLU based activation function to the next inner-product layer or directly to a Softmax function that converts them to probability in the range (0, 1). The final accuracy la
39、yer compares the labels of the top probabilities from softmax layer with the actual label and gives the accuracy of the CNN model.17Table 1. Operations in AlexNet CNN model 16a Normalization across 5 neighboring channels.b Max-pooling across 3x3 window.c Convolution performed in 2 groups.3. CNN MODE
40、L STUDY AND FPGA DESIGN DIRECTIONS3.1 FPGA Implementation ChallengesWhile CNNs are proven indispensable in many computer vision applications, they consume significant amount of storage, external memory bandwidth, and computational resources, which makes it difficult to implement on an embedded platf
41、orm. The challenges in implementation of a large-scale CNN on FPGAs are illustrated using AlexNet model as an example. The different layers in AlexNet along with the number of features in each layer, feature dimensions, number of synaptic weights and the total number of operations in each layer are
42、summarized in Table 1. It has over 60 million model parameters, which needs 250MB of memory to store the weights using 32-bit floating point representation and hence they cannot be stored in on-chip memory of commercially available FPGAs. They need to be stored in an external memory and transferred
43、to the FPGA during computation, which could become a performance bottleneck. The AlexNet model consists of 5 convolution layers, 2 LRN layers, 3 pooling layers and 3 fully connected layers, where each layer has different number of features and dimensions. If they are implemented independently withou
44、t resource sharing, it would be either hardware-inefficient or may not fit on the FPGA due to the limited logic resources. The problem gets exacerbated in the state of the art models such as VGG 19 and GoogLeNet 9, which have a larger number of layers. To efficiently share hardware resources, repeat
45、ed computation (e.g. convolution) should be implemented with a scalable hardware 13, such that the same3.2 Precision Study for FPGA PrimitivesTraditionally CNN models are trained in CPU/GPU environments using 32-bit floating point data. Such high precision is not necessarily required in the testing
46、or classification phase, owing to the redundancy in the over-parameterized CNN models 19. Reducing data precision of the weights/data without any impact on the accuracy directly reduces the storage requirement as well as the energy for memory transfers.Using AlexNet and VGG models, we explored the p
47、recision requirements of convolution and fully connected layer weights using Caffe tool framework 20. We obtained the pre-trained models from Caffe, rounded the convolution weights and inner product weights separately, and tested the models using the ImageNet-2012 validation dataset of 50K images. A
48、lthough data precision is reduced, Caffe tool still performs CNN operations in 32-bit floating point precision using the rounded-off weights. Figure 2 shows the top-1 and top-5 accuracies of the model for a precision sweep of the weights. It shows that the accuracy steeply drops if the weight precis
49、ion reduces below 8 bits. We use a common precision for the weights in all convolution layers, as the same hardware block will be reused for all the convolution layer iterations. We choose 8-bit precision for the convolution weights and 10-bit precision for inner product weights, which degrades the
50、accuracy by only 1% compared to full precision weights. Similarly, we choose 16-bit precision for the intermediate layer data by performing the precision study.3.3 FPGA Accelerator Design DirectionsIn our FPGA design, we first developed computing primitives of CNNs using OpenCL framework. A scalable
51、 convolution module is designed based on matrix multiplication operation in OpenCL, so that it can be reused for all convolution layers with different input and output dimensions. Similarly, we developed scalable hardware modules for normalization, pooling, and fully-connected layers. We identified
52、key design variableshardware is reused by iterating the data through themoftware.The performance limitation due to the external memorybandwidth can be alleviated by using reduced precision modelweights. Hence we performed a precision study by sweeping model weights and chose the precision values tha
53、t have minimal impact on the classification accuracy.18Layer#FeaturesFeature dimensionsStrideKernel/weight dimensions#Operations#Output features#Input featuresXYXYInput image3224224Convolution-1/ReLU-196555559631111211120800Normalization-19655555a3194400Pooling-1962727233b629856Convolution-2/ReLU-22
54、562727125648c55448084224Normalization-225627275a2052864Pooling-22561313133b389376Convolution-3/ReLU-33841313138425633299105664Convolution-4/ReLU-438413131384192c224345472Convolution-5/ReLU-525613131384192c149563648Pooling-525666233b82944Fully connected-6/ReLU-640964096921675501568Fully connected-7/R
55、eLU-740964096409633558528Fully connected-81000100040968192000Total Operations1455821344808060 604040AlexNet model Top-1 Accuracy Top-5 Accuracy2020C001614 12 1086416 14 12 10864Precision of convolution weightsPrecision of Innerproduct weightsPCIe10010080806060Figure 3: Design flow of OpenCL based FP
56、GA accelerator for CNN.designer/user from the intricacies of traditional FPGA design flow such as RTL coding, integration with interfacing IPs and timing closure, which considerably reduces the design time, while achieving performance comparable to the traditional flow, but possibly at the expense of higher on-chip memory utilization 22. The design flow of the OpenCL based FPGA accelerator for CNN used in this work is shown in Figure 3. It consists of
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 蘇州工業(yè)園區(qū)服務(wù)外包職業(yè)學(xué)院《瑤族民歌演唱》2023-2024學(xué)年第二學(xué)期期末試卷
- 山東輕工職業(yè)學(xué)院《大學(xué)英語4B級》2023-2024學(xué)年第一學(xué)期期末試卷
- 湖南體育職業(yè)學(xué)院《中國現(xiàn)當(dāng)代文學(xué)2》2023-2024學(xué)年第二學(xué)期期末試卷
- 賓川縣2024-2025學(xué)年數(shù)學(xué)三下期末學(xué)業(yè)水平測試模擬試題含解析
- 阜陽幼兒師范高等??茖W(xué)?!陡叩裙こ探Y(jié)構(gòu)》2023-2024學(xué)年第二學(xué)期期末試卷
- 河南省長葛市第三實(shí)驗(yàn)高中2024-2025學(xué)年5月高考英語試題模練習(xí)(一)含解析
- 浙江農(nóng)業(yè)商貿(mào)職業(yè)學(xué)院《數(shù)據(jù)可視化技術(shù)》2023-2024學(xué)年第二學(xué)期期末試卷
- 廣州大學(xué)《舞蹈技能(男生)實(shí)訓(xùn)》2023-2024學(xué)年第二學(xué)期期末試卷
- 古代詩歌常識知識
- 針對大學(xué)生喜愛的舞種調(diào)研
- 研發(fā)綜合項(xiàng)目管理新規(guī)制度
- GB/T 43860.1220-2024觸摸和交互顯示第12-20部分:觸摸顯示測試方法多點(diǎn)觸摸性能
- 醫(yī)療機(jī)構(gòu)制劑管理規(guī)范
- JBT 11699-2013 高處作業(yè)吊籃安裝、拆卸、使用技術(shù)規(guī)程
- 2023年 新版評審準(zhǔn)則質(zhì)量記錄手冊表格匯編
- 2024年全國版圖知識競賽(小學(xué)組)考試題庫大全(含答案)
- 博物館保安服務(wù)投標(biāo)方案(技術(shù)方案)
- (高清版)TDT 1047-2016 土地整治重大項(xiàng)目實(shí)施方案編制規(guī)程
- 2024年新疆維吾爾自治區(qū)中考一模綜合道德與法治試題
- 醫(yī)藥代表專業(yè)化拜訪技巧培訓(xùn)
- 今年夏天二部合唱譜
評論
0/150
提交評論