throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks_第1頁
throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks_第2頁
throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks_第3頁
throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks_第4頁
throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks_第5頁
已閱讀5頁,還剩5頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural NetworksNaveen Suda, Vikas Chandra, Ganesh Dasika*, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu CaoSchool of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, USASchool

2、of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USAARM Inc., San Jose, USA; *ARM Inc., Austin, USA.E-mail: naveen.suda, abinash.mohanty, yufeima, vrudhula, jaesun.seo, ,vikas.chandra, ABSTRACTConvolutional Neural Networks (CN

3、Ns) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is diffic

4、ult to perform real-time classification with low power consumption on todays computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodolo

5、gies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space expl

6、oration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two rep

7、resentative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classifi

8、cation on P395-D8 board.1. INTRODUCTIONConvolutional Neural Networks (CNNs),pired by visualcortex of the brain, are a category of feed-forward artificial neural networks. CNNs, which are primarily employed in computer vision applications such as character recognition 1, image classification 2 9 16 1

9、7, video classification 3, face detection 4, gesture recognition 5, etc., are also being used in a wide range of fields including speech recognition 6, natural language processing 7 and text classification 8. Over the pastdecade, the accuracy and performance of CNN-based algorithms improved signific

10、antly, mainly due to the enhanced network structures enabled by massive training datasets and increased raw computational power aided by CMOS scaling to train the models in a reasonable amount of time.A typical CNN architecture has multiple convolutional layers that extract features from the input d

11、ata, followed by classification layers. The operations in CNNs are computationally intensive with over billion operations per input image 9, thus requiring high performance server CPUs and GPUs to train the models. However, they are not energy efficient and hence various hardware accelerators have b

12、een proposed based on FPGA 10- 13, SoC (CPU + FPGA) 14 and ASIC 15. FPGA basedhardware accelerators have gained momentum owing to their reconfigurability and fast development time, especially with the availability of high-level synthesis (HLS) tools from FPGA vendors. Moreover, FPGAs provide flexibi

13、lity to implement the CNNs with limited data precision which reduces the memory footprint and bandwidth requirements, resulting in a better energy efficiency (e.g. GOPS/Watt).Previous FPGA-based CNN accelerator designs primarily focused on optimizing the computational resources without considering t

14、he impact of the external memory transfers 10 11 or optimizing the external memory transfers through data reuse12 13. The authors of 13 proposed a design space exploration methodology for CNN accelerator by optimizing both computation resources and external memory accesses, but implemented only conv

15、olution layers. In this work, we present a systematic methodology for maximizing the throughput of an FPGA-based accelerator for an entire CNN model consisting of all CNN layers: convolution, normalization, pooling and classification layers.The key contributions of this work are summarized as follow

16、s: CNN with fixed-point operations are implemented on FPGA using OpenCL framework. Critical design variables that impact the throughput are identified for optimization. Execution time of each CNN layer is analytically modeled as a function of these design variables and validated on FPGA.Categories a

17、nd Subject DescriptorsC.3 SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS: Signal processing systems.KeywordsFPGA, OpenCL, Convolutional Neural Networks, Optimization.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided tha

18、t copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to

19、 post on servers or to redistribute to lists, requires priorspecific permission and/or a fee. Request permissions P.FPGA16, February 2123, 2016, Monterey, CA, USA. 2016 ACM. ISBN 978-1-4503-3856-1/16/02$15.00DOI: /10.1145/2847263.2847276from16Logic utilization is emp

20、irically modeled using FPGA synthesis data for each CNN layer as a function of the design variables. A systematic methodology is proposed to minimize total execution time of a given CNN algorithm, subject to the FPGA hardware constraints of logic utilization, computational resources, on-chip memory

21、and external memory bandwidth. The new methodology is demonstrated by maximizing the throughput of two large-scale CNNs: AlexNet 16 and VGG17 (which achieved top accuracies in ImageNet challenges 2012 and 2014, respectively), on two Altera FPGA platforms with different hardware resources.The rest of

22、 the paper is organized as follows. Section 2 brieflyStride of 427131313553131313113dense3511333327355128224192192128 2048 204827485513131353111313133310005112243densedense333275533Max192192128Max PoolingMax 20482048Pooling48 Pooling128Figure 1: Architecture of AlexNet CNN 16.out( f , x, y) = in( fo

23、 , x, y)(2)bodescribes the operations of CNNs using AlexNet as an example.af + K / 2oi1+in ( f , x, y) K2Section 3 presents the challenges in implementing a large-scale CNN on FPGAs. It also studies the impact of precision of the weights on the accuracy of AlexNet and VGG models. Section 4 briefly p

24、resents the OpenCL implementation of CNN layers and describes the design variables used for acceleration. Section 5 describes our proposed methodology for design space exploration to maximize the throughput of the CNN accelerator with limited FPGA resources. Section 6 presents the experimental resul

25、ts of two CNNs optimized on two Altera FPGA platforms and compares them with prior work. Section 7 concludes the paper.2. BASIC OPERATIONS OF CNNA typical CNN is comprised of multiple convolutional layers, interspersed by normalization, pooling and non-linear activation function. These convolution l

26、ayers decompose the input image to different features maps varying from low-level features such as edges, lines, curves, etc., in the initial layers to high-level/abstract features in the deeper layers. These extracted features are classified to output classes by fully-connected classification layer

27、s that are similar to multi-layer perceptrons. For example, Figure 1 shows the architecture of AlexNet CNN 16, which won the ImageNet challenge in 2012. It consists of 5 convolutional layers each with a Rectified Linear Unit (ReLU) based activation function, interspersed by 2 normalization layers, 3

28、 pooling layers and concluded by 3 fully connected layers which classify the input 224224 color images to 1,000 output classes. The ImageNet database-based models are characterized by top-1 and top-5 accuracies, which represent whether the input image label matches with top-1 and top-5 predictions,

29、respectively.2.1 ConvolutionConvolution is the most critical operation of CNNs and it constitutes over 90% of the total operations in AlexNet model 13. It involves 3-dimensional multiply and accumulate operation of Nif input features with KK convolution filters to get an output feature neuron value

30、as shown in Equation (1).out( fo , x, y) = wt( fo , fi , kx , ky ) in( fi , x +kx , y + ky ) (1)fi = fo -K / 2 in( f , x, y)out( f , x, y) =o (3)o b aK 2x+ K / 2 y+ K / 21+oxy2in ( f , x + k , y + k )kx= x-K / 2 k =y y-K / 2where K in Equation (2) is the number of feature maps used for LRN computati

31、on, K in Equation (3) is the number of neurons in x, y directions in the same feature, while a and b are constants.2.3 PoolingSpatial pooling or subsampling is utilized to reduce the feature dimensions as we traverse deeper into the network. As shown in Equation (4), pooling computes the maximum or

32、average of neighboring KK neurons in the same feature map, which also provides a form of translational invariance 18. Although max-pooling is popularly used, average pooling is alsousedome CNN models 18. Reducing the dimensionality oflower-level features while preserving the important information,th

33、e pooling layer helps abstracting higher-level features without redundancy.out( f , x, y) = max/average (in( f , x + k , y + k )(4)ooxy0(k x,ky )K2.4 Activation FunctionsThe commonly used activation functions in traditional neural networks are non-linear functions such as tanh and sigmoid, which req

34、uire a longer training time in CNNs 16. Hence, Rectified Linear Unit (ReLU) defined as y = max(x,0) has become the popular activation function among CNN models as it converges faster in training. Moreover, ReLU has less computational complexity compared to exponent functions in tanh and sigmoid, als

35、o aiding hardware design.2.5 Fully Connected LayerFully-connected layer or inner product layer is the classification layer where all the input features (Nif) are connected to all of the output features (Nof) through synaptic weights (wt). Each output neuron is the weighted summation of all the input

36、 neurons as shown in Equation (5).NifNif K Kfi =0 kx =0ky =0whereout(fo,x,y)andin(fi,x,y)represent the neurons at location(x,y) in the feature maps fo and fi, respectively and wt(fo,fi,kx,ky) isthe weights at position (k ,k ) that gets convolved with inputout( fo ) = wt( fo , fi ) in( fi )x y(5)feat

37、ure map fi to get the output feature map fo.2.2 NormalizationLocal Response Normalization (LRN) or normalization implements a form of lateral inhibition 16 by normalizing each neuron value by a factor that depends on the neighboring neurons. LRN across neighboring features and within the same featur

38、e can be computed as shown in Equations (2) and (3), respectively.f =0iThe outputs of the inner-product layer traverse through ReLU based activation function to the next inner-product layer or directly to a Softmax function that converts them to probability in the range (0, 1). The final accuracy la

39、yer compares the labels of the top probabilities from softmax layer with the actual label and gives the accuracy of the CNN model.17Table 1. Operations in AlexNet CNN model 16a Normalization across 5 neighboring channels.b Max-pooling across 3x3 window.c Convolution performed in 2 groups.3. CNN MODE

40、L STUDY AND FPGA DESIGN DIRECTIONS3.1 FPGA Implementation ChallengesWhile CNNs are proven indispensable in many computer vision applications, they consume significant amount of storage, external memory bandwidth, and computational resources, which makes it difficult to implement on an embedded platf

41、orm. The challenges in implementation of a large-scale CNN on FPGAs are illustrated using AlexNet model as an example. The different layers in AlexNet along with the number of features in each layer, feature dimensions, number of synaptic weights and the total number of operations in each layer are

42、summarized in Table 1. It has over 60 million model parameters, which needs 250MB of memory to store the weights using 32-bit floating point representation and hence they cannot be stored in on-chip memory of commercially available FPGAs. They need to be stored in an external memory and transferred

43、to the FPGA during computation, which could become a performance bottleneck. The AlexNet model consists of 5 convolution layers, 2 LRN layers, 3 pooling layers and 3 fully connected layers, where each layer has different number of features and dimensions. If they are implemented independently withou

44、t resource sharing, it would be either hardware-inefficient or may not fit on the FPGA due to the limited logic resources. The problem gets exacerbated in the state of the art models such as VGG 19 and GoogLeNet 9, which have a larger number of layers. To efficiently share hardware resources, repeat

45、ed computation (e.g. convolution) should be implemented with a scalable hardware 13, such that the same3.2 Precision Study for FPGA PrimitivesTraditionally CNN models are trained in CPU/GPU environments using 32-bit floating point data. Such high precision is not necessarily required in the testing

46、or classification phase, owing to the redundancy in the over-parameterized CNN models 19. Reducing data precision of the weights/data without any impact on the accuracy directly reduces the storage requirement as well as the energy for memory transfers.Using AlexNet and VGG models, we explored the p

47、recision requirements of convolution and fully connected layer weights using Caffe tool framework 20. We obtained the pre-trained models from Caffe, rounded the convolution weights and inner product weights separately, and tested the models using the ImageNet-2012 validation dataset of 50K images. A

48、lthough data precision is reduced, Caffe tool still performs CNN operations in 32-bit floating point precision using the rounded-off weights. Figure 2 shows the top-1 and top-5 accuracies of the model for a precision sweep of the weights. It shows that the accuracy steeply drops if the weight precis

49、ion reduces below 8 bits. We use a common precision for the weights in all convolution layers, as the same hardware block will be reused for all the convolution layer iterations. We choose 8-bit precision for the convolution weights and 10-bit precision for inner product weights, which degrades the

50、accuracy by only 1% compared to full precision weights. Similarly, we choose 16-bit precision for the intermediate layer data by performing the precision study.3.3 FPGA Accelerator Design DirectionsIn our FPGA design, we first developed computing primitives of CNNs using OpenCL framework. A scalable

51、 convolution module is designed based on matrix multiplication operation in OpenCL, so that it can be reused for all convolution layers with different input and output dimensions. Similarly, we developed scalable hardware modules for normalization, pooling, and fully-connected layers. We identified

52、key design variableshardware is reused by iterating the data through themoftware.The performance limitation due to the external memorybandwidth can be alleviated by using reduced precision modelweights. Hence we performed a precision study by sweeping model weights and chose the precision values tha

53、t have minimal impact on the classification accuracy.18Layer#FeaturesFeature dimensionsStrideKernel/weight dimensions#Operations#Output features#Input featuresXYXYInput image3224224Convolution-1/ReLU-196555559631111211120800Normalization-19655555a3194400Pooling-1962727233b629856Convolution-2/ReLU-22

54、562727125648c55448084224Normalization-225627275a2052864Pooling-22561313133b389376Convolution-3/ReLU-33841313138425633299105664Convolution-4/ReLU-438413131384192c224345472Convolution-5/ReLU-525613131384192c149563648Pooling-525666233b82944Fully connected-6/ReLU-640964096921675501568Fully connected-7/R

55、eLU-740964096409633558528Fully connected-81000100040968192000Total Operations1455821344808060 604040AlexNet model Top-1 Accuracy Top-5 Accuracy2020C001614 12 1086416 14 12 10864Precision of convolution weightsPrecision of Innerproduct weightsPCIe10010080806060Figure 3: Design flow of OpenCL based FP

56、GA accelerator for CNN.designer/user from the intricacies of traditional FPGA design flow such as RTL coding, integration with interfacing IPs and timing closure, which considerably reduces the design time, while achieving performance comparable to the traditional flow, but possibly at the expense of higher on-chip memory utilization 22. The design flow of the OpenCL based FPGA accelerator for CNN used in this work is shown in Figure 3. It consists of

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論