![翻譯原文(DOC)_第1頁](http://file.renrendoc.com/FileRoot1/2020-1/6/f73a797c-23e8-4aca-bfb7-b9a67f673242/f73a797c-23e8-4aca-bfb7-b9a67f6732421.gif)
![翻譯原文(DOC)_第2頁](http://file.renrendoc.com/FileRoot1/2020-1/6/f73a797c-23e8-4aca-bfb7-b9a67f673242/f73a797c-23e8-4aca-bfb7-b9a67f6732422.gif)
![翻譯原文(DOC)_第3頁](http://file.renrendoc.com/FileRoot1/2020-1/6/f73a797c-23e8-4aca-bfb7-b9a67f673242/f73a797c-23e8-4aca-bfb7-b9a67f6732423.gif)
![翻譯原文(DOC)_第4頁](http://file.renrendoc.com/FileRoot1/2020-1/6/f73a797c-23e8-4aca-bfb7-b9a67f673242/f73a797c-23e8-4aca-bfb7-b9a67f6732424.gif)
![翻譯原文(DOC)_第5頁](http://file.renrendoc.com/FileRoot1/2020-1/6/f73a797c-23e8-4aca-bfb7-b9a67f673242/f73a797c-23e8-4aca-bfb7-b9a67f6732425.gif)
已閱讀5頁,還剩7頁未讀, 繼續(xù)免費(fèi)閱讀
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
High Level Design For High Speed FPGA DevicesMan. Ng mcn99Department of ComputingImperial CollegeJune 13, 2002AcknowledgementBefore starting the report, I would like to thank the following people for helping me throughout the project. Without their help, it would be impossible for me to finish the project:I would like to thank my supervisor Dr. Wayne Luk for giving me a lot of useful advices and encouragement throughout the project. He also guided me towards the problems I should focus on during the implementation. I would like to thank Professor Yang for letting me to implement his gel-image processing algorithm on hardware. He also gave me references and example sources to understand the theories underneath. And I would like to thank for his teaching in his excellent multimedia course. The course conveyed many useful concepts for me to understand the gel image processing I would like to thank Altaf and Shay. They are two Ph.D research students who helped me a lot throughout the implementation of the application.AbstractIn the project, I have discovered a systematic approach for high-level hardware design. With this approach, I successfully implemented the sophisticated gel image processing on high speed hardware. In the report, I will also introduced a new technique which can automate the process of high level hardware performance optimization by rearranging the code sequence so that the it can be run at minimum number of clock cycles. The report will be split into 4 Chapters:Chapter 1 is Introduction. It includes the background, all the related works and my contribution to the project.Chapter 2 is Optimization. In this chapter, I will focus on the techniques for optimization. I will also demonstrate some techniques which can automate the optimization process.Chapter 3 is Hardware Development. In this chapter, I will generalize the steps of converting a software programme into hardware. These include several techniques which can improve the performance or save the hardware resources.Chapter 4 is Case Study : Gel Image Processing. In this chapter, I will use gel image processing as an example to show the effect on resource and performance of the techniques discussed in chapter 2. In this chapter, I will also compare the performance of the application between two devices and the software version: Pilchard and RC1000.Chapter 5 is Conclusion. It includes the assessed achievements and expected future works.There is also an online version available for this report, the URL is:http:/www.doc.ic.ac.uk/mcn99/project/report.pdfChapter 1IntroductionSince the emergent of Handel-C 5, a C-like hardware language, a complete high level FPGA design approach is realized. However, most of developers will stick on the lower-level language such as VHDL when they are aiming to design high performance hardware. It is because developers have greater control on the actual circuit implementation in low-level approach. But low-level design probably will reach its limit when FPGA chips grow bigger and bigger. Developers will not be able to develop new application quick enough with low level design which consists of billions of gates. A high-level approach will then be the answer. The purpose of this project is to introduce a systematic way of developing high performance hardware under high-level approach.1.1 Background and Related WorksIn this section, I am going to present the materials that are necessary to understand the content of this report.1.1.1 Field Programmable Gate Arrays(FPGAs) 1Like Programmable Logic Devices(PLDs), FPGA is a piece of hardware which is programmable. However, while the size of PLDs is limited by power consumption and time delay, FPGA can easily implement designs with million of gates on a single IC. The re-programmable nature of FPGA allows developers implements design with shorter development times and lower cost than an equivalent custom VLSI chips. It worths mentioning that development of FPGA is faster than Moores Law with capacity doubling every year. With millions of gates available on the newest chip, FPGA is an ideal platform to develop reconfigurable system which is capable of execute complicate application at performance. Therefore, FPGA is the chip I am developing application for.1.1.2 Pilchard 2Pilchard is a reconfigurable computing platform employing a field programmable gate array(FPGA) which plugs into a standared personal computers 133MHz synchronous dynamic RAM Dual In-line Memory Modules(DIMMS)slot. Comparing to traditional FPGA devices which are utilizing the PCI nterface, Pilchard allows data to be transferred to and from the host computer in much shorter time, due to the higher bandwidth as well as the lower latency of the DIMM interface. However, as DIMMS is not originally designed for Input/Output(I/O), extra control signals will be needed for Pilchard to indicate the start and the end of data processing. As a result, high level behavioral design approach is preferable to low level structural design approach for developing applications for Pilchard. Thats proves why it is vital to have a systematic way of high level development for high performance FPGA.1.1.3 RC1000 3RC1000 is a 32-bit PCI card designed for reconfigurable computing applications. It has full board support package in Handel-C with libraries which ease the circuit design for this device. It also features 4 SRAM banks(2Mbytes each) on the board which can be accessed by the FPGA or host CPU. The board can be configured to be run between 4000KHz to 100MHz. This device is very different from Pilchard in many aspects. In the report, I will show that the development steps introduced in this project is general and can be applicable to application development on different devices.1.1.4 VHDL 4VHDL is one of the first high-level languages emerged in the market for designing applications with programmable logic devices. VHDL provides high-level language constructs that enable designers to describe large circuits and bring products to market rapidly. It supports the creation of design libraries in which to store components for reuse in subsequent designs.Because it is a standard language (IEEE standard 1076), VHDL provides portability of code between synthesis and simulation tools, as well as device-independent design. It also facilitates converting a design from a programmable logic to an ASIC implementation. The disadvantage of this language is it is not completely high level, the language still expects user to know the hardware behaviors of the components. Therefore, I decided to use another even higher level hardware language, i.e. Handel-C.1.1.5 Handel-C 5Handel-C is a high level C-like programming language designed for compiling program into hardware images of FPGAs or ASICs. Handel-C provides some extra features which are not appeared in C to support few hardware optimizations. One of those is the language supports specifying the width of each signal so that just optimization can be achieved by targeting the exact resources needed by Handel-C compilers. Handel-C compilers target hardware directly by mapping the program into hardware at the netlist level in xnf or edif format. The advantage of Handel-C over VHDL is that it doesnt expect users to know too much about the hardware in low level which VHDL does. It is a completely high level language! Figure 1.1 shows the design flows I will adopted in converting Handel-C program to hardware. Although several tools are involved in different steps, but users wont need to worry about the hardware detail. Because what users need to do is just clicking several buttons to launch the program for converting the file into next step, it is as simply as that.1.1.6 Extending the Handel-C language 7Dong U Lee, a Ph.D students, has invented a language which supports both hardware and software. His approach is to combine both C and Handel-C language. In the language, user can specify which part is done by software and which part is done by hardware. In the project, he also developed an more friendly interface for communication between the host and the FPGA device. However, the number of devices currently supported by this language is limited. Thats why I finally gave up on using this language.1.2 Contribution I have developed an easy but efficient optimization method which can rearrange code so that it can be run in minimum of cycles. I have developed a systematic design flow for high level hardware design target for high speed devices I have implemented the complicated 2D gel image processing on hardwareChapter 2OptimizationIn this chapter, I am going to discuss various methods to optimization the high level code. Optimization is the main part which we try to exploit and utilize parallelism to achieve speed up which PC software normally is not able to do so because of the limited resources of CPU. The main focus of this chapter will be on how to automate these optimizations processes. We will also discuss some evaluation equation so to measure the speedup we can achieve after optimization.2.1 Performance OptimizationThis is exploiting the potential parallelism of the program and then run different non-conflicting operations at the same clock cycle to acquire speed up. In normal applications, tens of even hundreds of operations which run sequentially on PCs CPU may be able to run in parallel. However, PC cant run them in parallel because of the limited hardware resource. But by designing specific hardware to run as many operations as possible in parallel, significant speed up can obtain. This is the main reason why application on FPGA can sometimes run faster than the corresponding software version even though FPGA hardware run at much slower clock speed(of course we also need to take account of the CPI, but even then PC CPU can still do the same individual operation much faster). There are several techniques we could apply:2.1.1 Balance The Delay Of Each PathBalancing the delay of each pat is important because the hardware clock speed will at most be the same as the path with longest delay. Therefore, if the delay of one particular path is much later than the others, then it means we have wasted resource as other paths is capable of running at much higher speed. By balancing the delay, it can make sure that the the 5 parallel optimization will be optimal in later stage. The delay of a path can be defined as:Tdelay = Tlogic + Trouting (2.1)where Tdelay is the total delay of the pathTlogic is the delay due to logicTrouting is the delay due to routingTherefore, reducing the delay is done by reducing one of the Tlogic or Trouting or both. There are 2 main steps to achieve this: Break up complex operation into simple operations; Use components with pre-defined placement and routing constraintsBreaking Up Complex OperationFirst of all, the simplest step to do is to break the complicated operation into several simpler operations. This step effiectively reduce the logic in each operation thus reduce Tlogic in equation 2.1. In software program, the effect of complicated operation normally will run as quick or even quicker than the simpler operations with the same result as the compiler will optimize the instructions execution for us. However, for hardware, a complex operation mean it needs a longer clock to finish, while other simpler operations need not take that long to finish. Figure 2.1 shows an example of breaking up a complex operation into simple operations. In this example, we can see that sometimes extra registers are needed to store intermediate result of the calculation to make the operation simple enough.Predefined Placement and Routing ComponentIf the result of the first method is not satisfactory enough or the timing constraints still isnt met, we can then use a timing analyzer provided by the FPGA chip developer to find out the longest paths. For example, we can use Timing Analyzer coming along with Xilinx ISE Foundation 4 for timing analysis of Xilinx FPGA chips. After finding out the longest path, we will then know which operation run too slow. At this time, we could try 2 methods to increase the speed of this operation. The first is try to write a constraint file ourselves which specify explicitly the placement and routing of this component. This could enhance the timing as the tools which automatic do the placement and routing is normally not very smart.Then we will include this constraint file before the placement and routing process. However, this approach require the developer to have fair amount of knowledge on the FPGA chip they are using. This includes knowledge on the primitives component supported by the chip and the relative placement and routing of these primitives which can achieve the minimum delay. The second approach is easier, it is to use the macro of the predefined placement and routing components defined by the chip developer. Put it simpler, the chip developer has done the job for you, and we should just use it! For Xilinx, they have a program called Core Generator which does exactly what I mentioned. How to include these components in Handel-C program is specified in Handel-C menu. However, users must know that the timing of output and input of these components requires extra care. Because of the language limitation of Handel-C. The input signal will always arrive one cycle late into the component. This step will reduce Trouting of equation 2.1 because by nicely placing the logic blocks the routing of the signal will be much shorter thus the delay due to routing will significantly reduced.Possibility of Automating This ProcessAbove, we have discussed 2 methods to achieve this step. This step is difficult to be automated, because it is difficult to define what is complex operations and what isnt. This depends on the device and chip we use as well as the functionality of the program . For device which need a very restricted timing constraint. A 16 bits multiplication may be considered as complex while for some other device which cannot run at high speed, it may be a waste to break up the operations into very simple one which can run at very high speed as the device wont be run at that speed anyway.However, we can borrow the idea from Lee again to make automation of this process possible. While compiling the source, we can include information about the device we are using as well as the timing constraint. The library of the device will include the delay of each logic unit. It will also include some information on how the FPGA development tools will route the signals. Then the compiler will be able to approximate the Tlogic and Trouting thus Tdelay of each path. The result will then be compared with the timing constraints specified. If violation is detected, the compiler will use the 2 methods mentioned above to balance the delay of each path until the constraint is met.2.1.2 Basic ParallelismThis is the first step of the actual performance optimization process and is the easiest. The following rules can be applied to automate the process.Scan through the program sequentially. Group as many operations as possible into one clock cycle until there is violation of data dependency. Then repeat the process again for the next cycle. As we have discussed how to detect data dependency in earlier section, this process is possible to be done automatically. Figure 2.2 is a simple example on how to achieve this. We can see that without parallel execution, it takes 8 cycles to complete the 8 operations. However, with parallel execution, it takes only 2 cycles.2.1.3 Re-arrange Code SequenceSometimes, a program can have high parallelism, but the execution sequence of the operations prevents the method just mentioned above to achieve that level. For example, the code in figure 2.3 will have the same effect on the above example shown in figure 2.2. But if we use the ”basic parallelism” method. We will need 4 cycles to finish the operations instead of 2.We can solve the problem by re-arranging code sequence of the code so that the code can run in least cycles as possible. For example, the code above can first change to the same sequence as in 2.2. Then we will apply ”basic parallelism” method again. For this process to be automated by compiler, the compiler will need to have fair amount knowledge and reasoning on the program. I have discovered a method to automate this process:1. Firstly, choose a group of codes to start with, preferably in the innermost loop.2. Create an empty table with variables names as index and labels as value. The label is at the formal var:n where var is the name of a variable and n is the number specifying the operation sequence.3. Scan through the code sequentially. For each variable assignment(Either modification/initialization), assign a label to the operation following the rules listed below:step 1 search the table to find out the label of the variable being assigned to.step 2a if no entry is found, the variable is first being assigned. Add an entry in the table, the content(label) is specified as:step 3ai if the variable is assigned a constant value or a signal from outside the block we are working with, specify the label as varname:1 where varname is the name of the variable being assigned.step 3aii if the variable value depends on other variables, get the labels of these variables from the table. Assign the label of the variable same as the labels we got with the biggest order but with the order incremented by 1. Eg, for a = b + c, if label for b is d:3 and c is e:4, then label for a should be e:5.step 2b if an entry is found, the variable has been assigned before.Get the label of that variable.step 3b Update the label of the variable as specified in step 3ai and step 3aii but with a change that when finding the big
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年正德職業(yè)技術(shù)學(xué)院高職單招職業(yè)技能測試近5年常考版參考題庫含答案解析
- 2025年晉中職業(yè)技術(shù)學(xué)院高職單招語文2018-2024歷年參考題庫頻考點(diǎn)含答案解析
- 2025年攀枝花攀西職業(yè)學(xué)院高職單招職業(yè)適應(yīng)性測試近5年常考版參考題庫含答案解析
- 2025年烏魯木齊貨運(yùn)從業(yè)資格考試題目及答案
- 2025年滬教版八年級(jí)歷史下冊月考試卷
- 2025年浙科版八年級(jí)科學(xué)上冊階段測試試卷
- 智能倉儲(chǔ)管理系統(tǒng)合作開發(fā)合同(2篇)
- 機(jī)場照明設(shè)施更新合同(2篇)
- 2025年中圖版八年級(jí)歷史下冊月考試卷含答案
- 2025年西師新版選擇性必修3地理下冊月考試卷
- 2025民政局離婚協(xié)議書范本(民政局官方)4篇
- 2024年03月四川農(nóng)村商業(yè)聯(lián)合銀行信息科技部2024年校園招考300名工作人員筆試歷年參考題庫附帶答案詳解
- 小學(xué)一年級(jí)數(shù)學(xué)上冊口算練習(xí)題總匯
- 潤滑油知識(shí)-液壓油
- 2024年江蘇省中醫(yī)院高層次衛(wèi)技人才招聘筆試歷年參考題庫頻考點(diǎn)附帶答案
- 臨床思維能力培養(yǎng)
- 初中公寓主任述職報(bào)告
- 九年級(jí)下冊滬教版上?;瘜W(xué)5.2酸和堿的性質(zhì)研究 課件
- ISO17025經(jīng)典培訓(xùn)教材
- 東南大學(xué)宣講介紹
- 九年級(jí)下冊-2023年中考?xì)v史總復(fù)習(xí)知識(shí)點(diǎn)速查速記(部編版)
評(píng)論
0/150
提交評(píng)論