37bridging high-level synthesis and application-specific arithmetic_ the case study of floating-point summations_第1頁
37bridging high-level synthesis and application-specific arithmetic_ the case study of floating-point summations_第2頁
37bridging high-level synthesis and application-specific arithmetic_ the case study of floating-point summations_第3頁
37bridging high-level synthesis and application-specific arithmetic_ the case study of floating-point summations_第4頁
37bridging high-level synthesis and application-specific arithmetic_ the case study of floating-point summations_第5頁
已閱讀5頁,還剩3頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Bridging High-Level Synthesis and Application-Specific Arithmetic:The Case Study of Floating-Point SummationsYohann UguenUniv Lyon, INSA Lyon, Inria, CITI F-69621 Villeurbanne, France Yohann.Ugueninsa-lyon.frFlorent de DinechinUniv Lyon, INSA Lyon, Inria, CITI F-69621 Villeurbanne, France Florent.de

2、-Dinechininsa-lyon.frSteven DerrienUniversity Rennes 1, IRISA Rennes, France Steven.Derrienuniv-rennes1.frconsidered as hardware description languages. This has many advantages. The language itself is more widely known than any HDL. The sequential execution model makes designing and debugging much e

3、asier. One can even use software execution on a processor for simulation. All this drastically reduces development time.The process of compiling a software program into hardware is called High-Level Synthesis (HLS), with tools such as Vi- vado HLS 11 or Catapult C 1 among others 18. These tools are

4、in charge of turning a C description into a circuit. This task requires to extract parallelism from sequential programs constructs (e.g. loops) and expose this parallelism in the target design. Todays HLS tools are reasonably efficient at this task, and can automatically synthesize highly efficient

5、pipelined dataflow architectures.They however miss one important feature: they are not able to tailor operators to the application in size, and even less in nature. This comes from the C language itself: its high-level datatypes and operators are limited to a small number (more or less matching the

6、hardware operators present in mainstream processors). Indeed, such high-level languages were designed to be compiled and run on hardware, not to describe hardware. However, HLS tool know a lot about the context of each operator. This should allow them to transform these simple op- erators into appli

7、cation-specific ones, thus exploiting FPGAs to their full potential. The broader objective of this work is to demonstrate this opportunity. For this purpose, we envision a compilation flow involving one or several source-to-sourcetransformations, as illustrated by Figure 1.Arithmetic in HLS: To bett

8、er exploit the freedom offered by hardware and FPGAs, HLS vendors have enriched the C language with integer and fixed-point types of arbitrary size2. However the operations on these types remain limited to the basic arithmetic and logic ones. Exotic or complex operators (for instance for finite fiel

9、ds) may be encapsulated in a C function that is called to instantiate the operator.1Catapult C Synthesis, Mentor Graphics, 2011, /en/ products/catapult/overview/2Arbitrary-size floating-point should follow some day, it is well supported by mature libraries and toolsAbstractFPGAs are

10、 well known for their ability to perform non-standard computations not supported by classical micro- processors. Many libraries of highly customizable application- specific IPs have exploited this capablity. However, using such IPs usually requires handcrafted HDL, hence significant design efforts.

11、High Level Synthesis (HLS) lowers the design effort thanks to the use of C/C+ dialects for programming FPGAs. However, high-level C language becomes a hindrance when one wants to express non-standard computations: this languages was designed for programming microprocessors and carries with it many r

12、estrictions due to this paradigm. This is especially true when computing with floating-point, whose data-types and evaluation semantics are defined by the IEEE-754 and C11 standards. If the high-level specification was a computation on the reals, then HLS imposes a very restricted implementation spa

13、ce.This work attempts to bridge FPGA application-specific ef- ficiency and HLS ease of use. It specifically targets the ubiq- uitous floating-point summation-reduction pattern. A source-to- source compiler transforms selected floating-point additions into sequences of simpler operators using non-sta

14、ndard arithmetic formats. This improves performance and accuracy for several benchmarks, while keeping the ease of use of a high-level C description.I. INTRODUCTIONMany case studies have demonstrated the potential of Field- Programmable Gate Arrays (FPGAs) as accelerators for a wide range of applica

15、tions, from scientific or financial computing to signal processing and cryptography. FPGAs offer massive parallelism and programmability at the bit level. These FP- GAs characteristics enables programmers to exploit a range of techniques that avoid many bottlenecks of classical von Neumann computing

16、: dataflow operation without the need of instruction decoding; massive register and memory bandwidth, without contention on a register file and single memory bus; operators and storage elements tailored to the application in nature, number and size.However, to unleash this potential, development cos

17、ts for FPGAs are orders of magnitude higher than classical pro- gramming. High performance and high design costs are the two faces of the same coin.Hardware design flow and High-level synthesis: To ad- dress this, languages such as C or Java are increasingly beingarithmetic optimization pluginC/C+wi

18、th low-levelGeCoS source-to-sourcecompiler description of context-specific arithmetic operatorsHLS tool HardwareHigh level C/C+(Vivado HLS)descriptionFigure 1: The proposed compilation flowThe case study in this work is a program transformation that applies to floating-point additions on a loops cri

19、ti- cal path. It decomposes them into elementary steps, resizes the corresponding sub-components to guarantee some user- specified accuracy, and merges and reorders these components to improve performance. The result of this complex sequence of optimizations could not be obtained from an operator ge

20、nerator, since it involves global loop information.Before detailing it, we must digress a little on the subtleties of the management of floating-point arithmetic by compilers. HLS faithful to the floats: Most recent compilers, includ-ing the HLS ones 10, attempt to follow established standards, in p

21、articular C11 and, for floating-point arithmetic, IEEE-754. This brings the huge advantage of almost bit-exact repro- ducibility the hardware will compute exactly the same results as the software. However, it also greatly reduces the freedom of optimization by the compiler. For instance, as floating

22、 point addition is not associative, C11 mandates that code written a+b+c+d should be executed as (a+b)+c)+d, although (a+b)+(c+d) would have shorter latency. This also pre- vents the parallelization of loops implementing reductions. A reduction is an associative computation which reduces a set of in

23、put values into a reduction location. Listing 1 provides the simplest example of reduction, where acc is the reduction location.The first column of Table I shows how Vivado HLS synthe- sizes Listing 1 on Kintex7. The floating-point addition takes 7 cycles, and the adder is only active one cycle out

24、of 7 due to the loop-carried dependency. Listing 2 shows a different version of Listing 1 that we coded such that Vivado HLS expresses more parallelism. Vivado HLS will not transform Listing 1 into Listing 2, because they are not semantically equivalent3 (the floating-point additions are reordered a

25、s if they were associative). However, the tool is able to exploit the parallelism in Listing 2 (second column of Table I): The main adder is now active at each cycle on a different sub-sum.Note that Listing 2 is only here as an example and might need more logic if N was not a multiple of 10.Listing

26、1: Naive reductionListing 2: Parallel reductionthus recovering the freedom of associativity (among other). Indeed, most programmers will perform the kind of non- bit- exact optimizations illustrated by Listing 2 (sometimes assisted by source-to-source compilers or “unsafe” compiler optimizations). I

27、n a hardware context, we may also assume they wish they could tailor the precision (hence the cost) to the accuracy requirements of the application a classical concern in HLS 9, 2. In this case, a pragma should specify the accuracy of the computation with respect to the exact result. A high-level co

28、mpiler is then in charge of determining the best way to ensure the prescribed accuracy.The proposed approach uses number formats that are larger or smaller than the standard ones. These, and the correspond- ing operators, are presented in Section II. The contribution of this paper, which are compile

29、r transformations to generate C description of these operators in a HLS workflow, is presented in Section III. Section IV evaluates our approach on the FPMark benchmark suite.II. THE ARITHMETIC SIDE: AN APPLICATION-SPECIFIC ACCUMULATOR IN VIVADO HLSTowards HLS faithful to the reals: Another view, ch

30、osen in this work, is to assume that thepoint of floating-The accumulator that we used for this paper is based on a more general idea developed by Kulisch. He advocated a very large floating-point accumulator 14 whose 4288 bits would cover the entire range of double precision floating-point. Such an

31、 accumulator would remove rounding errors from all the possible floating-point additions and sums of products, with the added bonus that addition would become associative.point C program is intended to describe a computation on real numbers when the user specifies it. In other words, the floats are

32、interpreted as real numbers in the initial C,3A parallel execution with the sequential semantics is also possible, but very expensive 13.#define N 100000float acc = 0, tmp1=0, . , tmp10=0; for(int i=0; iN; i+=10)tmp1+=ini;.tmp10+=ini+9;acc=tmp1+.+tmp10;#define N 100000 float acc = 0; for(int i=0; iN

33、;i+) acc+=ini;So far, Kulischs full accumulator has proven too costly to appear in mainstream processors. However, in the context of application acceleration with FPGAs, it can be tailored to the accuracy requirements of applications. Its cost then becomes comparable to classical floating point oper

34、ators, although it vastly improves accuracy 6. This operator can be found in the FloPoCo 5 generator and in Altera DSP Builder Advanced. Its core idea, illustrated on Figure 2, is to use a large fixed-point register into which the mantissas of incoming floating-point summands are shifted (top) then

35、accumulated (middle). A third component (bottom) converts the content of the accumulator back to the floating-point format. The sub- blocks visible on this figure (shifter, adder, and leading zero2MSBA 2MSBA1022LSBAbit weight7 6 5 4 3 21 0 -1 -2 -3 -4 -5 -6 -7 -8Figure3:Thebitsof(MSBA, LSBA) = (7, 8

36、).afixed-point format,hereNote that we could have implemented any other non- standard operator performing a reduction such as 16, 12.A. The parameters of a large accumulatorThe main feature of this approach is that the internal fixed- point representation is configurable in order to control accuracy

37、. It has two parameters: MSBA is the weight of the most significant bit of the ac- cumulator. For example, if MSBA = 20, the accumulator can accommodate values up to a magnitude of 220 106. LSBA is the weight of the least significant bit of the accu- mulator. For example, if LSBA = 50, the accumulat

38、or can hold data accurate to 250 1015.Such a fixed-point format is illustrated in Figure 3.The accumulator width wa is then computed as MSBA LSBA + 1, for instance 71 bits in the previous example.71 bits represents a wide range and high accuracy, and still additions on this format will have one-cycl

39、e latency for practical frequencies on recent FPGAs. If this is not enough the frequency can be improved thanks to partial carry save 6 but this was not useful in the present work. For comparison, for the same frequency, a floating-point adder has a latency of 7 to 10 cycles, depending on the target

40、.In the following, the latency of a circuit denotes the number of cycles needed for the entire application to complete.counter) are essentially the building blocks of a floating-point adder.classicalExponentMantissaSignMaxMSBXwewf-Shift valueShifterMaxMSBX LSBA + 1NegateRegisterswA+B. Implementation

41、 within a HLS toolThis accumulator has been implemented in C, using arbitrary-precision fixed point types (ap_int). The leading zero count, bit range selections and other operations are imple- mented using Vivado HLS built-in functions. For modularity purposes, the FloatToFix and FixToFloat are wrap

42、ped into C functions (respectively 28 and 22 lines of code). Their calls are inlined to enable HLS optimizations.Because the internal accumulation is performed on a fixed- point integer representation, the combinational delay between two accumulations is lower compared to a full floating point addit

43、ion. HLS tools can take advantage of this delay reduction by more agressive loop pipelining (with shorter Initiation Interval), resulting in a design with a shorter overall latency.ExponentSignMantissaFigure 2: The conversion from float to fixed-point (top), the fixed-point accumulation (middle) and

44、 the conversion from the fixed-point format to a float (bottom).The accumulator used here slightly improves the one offered by FloPoCo 6: It supports subnormal numbers 17. In FloPoCo, FloatToFix and Accumulator form a single component, which restricts its application to simple ac- cumulations simila

45、r to Listing 1. The two components of Figure 2 enable a generalization to arbitrary summations within a loop, as Section III will show.C. ValidationTo evaluate and refine this implementation, we used Listing 3, which we compared to Listings 1 and 2. In the latter, the loop was unrolled by a factor 7

46、, as it is the latency of a floating-point adder on our target FPGA (Kintex-7).AccumulatorFloatToFixNegatewAatFixToFloLZC + ShifterwewfsTable I: Synthesis results of different accumulators using Vivado HLS for Kintex 7.For test data, we use as in Muller et al. 17 the input valuesUsing this implement

47、ation method, we also created an exact floating-point multiplier with the final rounding removed as in 6. This function is called ExactProduct and repre- sents 44 lines of code. Due to lack of space we do not present it in detail. As the output of this multiplier is not standard, we also created an

48、adapted Float-to-fix block called ExactProductFloatToFix (21 lines of code).III. THE COMPILER SIDE: GECOS SOURCE-TO-SOURCETRANSFORMATIONSThe previous section has shown that Vivado HLS can be used to synthesize very efficient specialized floating point operators which rival in quality with those gene

49、rated by FloPoCo. Our goal is now to study how such optimization can be automated. More precisely, we aim at automatically optimizing Listing 1 into Listing 3, and generalizing this transformation to many more situations.For convenience, this optimization was developed as a source-to-source transfor

50、mation implemented within the open source GeCoS compiler framework 8. It is publicly available with GeCoS. Source-to-source compiler are very convenient in an HLS context, since they can be used as optimization front- ends on top of closed-source commercial tools.This work focuses on two computation

51、al patterns, namely the accumulation and the sum of product. Both are specific instances of the reduction pattern, which can be optimized by many compilers or parallel run-time environments. Reduction pattern are exposed to the compiler/runtime either though user directives (e.g #pragma reduce in op

52、enMP), or automati- cally inferred using static analysis techniques 19, 7.As the problem of detecting reductions is not the main focus on this work, our tool uses a straightforward solution to the problem using a combination of user directive and (simple) program analysis. More specifically, the use

53、r must identify a target accumulation variable through a pragma, and provide additional information such as the dynamic range of the accumulated data along with the target accuracy. In the future, we expect to improve the program analysis, so that the two later parameter could be omitted in some sit

54、uations.We found this approach easier, more general and less invasive than those attempting to convert a whole floating- point program into a fixed-point implementation 20.A. Proposed compiler directiveIn imperative languages such as C, reductions are imple- mented using for or while loop constructs

55、. Our compiler directive must therefore appear right outside such a construct. Listing 4 illustrates its usage on the code of Listing 1.ci=(float)cos (i), where i is Pthe input arrays index.Therefore the accumulation computesci.iThe parameters chosen for the accumulator are: MSBA = 17. Indeed, as we

56、 are adding cos(i) 100K times, an upper bound is 100K, which can be encoded in 17 bits. MaxMSBX = 1 as the maximum input value is 1. LSBA = -50: the accumulator itself will be accurate to the 50th fractional bit. Note that a float input will see its mantissa rounded by FloatToFix only if its exponen

57、t is smaller than 225, which is very rare. In other words, this accumulator is much more accurate than the data that is thrown to it.The results are reported in Table I for simple and double precision. The Accuracy line of the table reports the number of correct bits of each implementation, after the result has been rounded to a float. All the data in this table was obtained by generating VHDL from C synthesis using Vivado HLS followed by place and route from Vivado v2015.4, build 1412921. This table also reports synthesis results for the corresponding FloPoC

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論