版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,1,2008 Taiwan CUDA CourseProgramming Massively Parallel Processors:the CUDA experience Lecture 8: Application Case Study - Quantitative MRI Reconstruction, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,2,Ackno
2、wledgements,Sam S. Stone, Haoran Yi, Justin P. Haldar, Wen-mei W. Hwu, Bradley P. Sutton, Zhi-Pei Liang, Keith Thulburin*,Center for Reliable and High-Performance Computing, Beckman Institute for Advanced Science and Technology,Department of Electrical and Computer Engineering University of Illinois
3、 at Urbana-Champaign * University of Illinois, Chicago Medical Center, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,3,Overview,Magnetic resonance imaging Least-squares (LS) reconstruction algorithm Optimizing the LS reconstruction on the G80 Overcoming bottlenecks Performance
4、tuning Summary,1, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,4,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,LS,2,Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30
5、- July 2, 2008,5,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding1,FFT,LS,Spiral scan data + Gridding + FFT: Fast scan, fast reconstruction, better images,2,1 Based on Fig 1 of Lustig et al, Fast Spiral Fourier Transform for Iterative MR Image Reconstruction, IEEE Intl Symp. on
6、 Biomedical Imaging, 2004, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,6,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,Least-Squares (LS),Spiral scan data + LS Superior images at expense of significantly more computation,2, David Kirk/NVIDIA and W
7、en-mei W. Hwu Taiwan, June 30 - July 2, 2008,7,An Exciting Revolution - Sodium Map of the Brain,Images of sodium in the brain Requires powerful scanner (9.4 Tesla) Very large number of samples for increased SNR Requires high-quality reconstruction Enables study of brain-cell viability before anatomi
8、c changes occur in stroke and cancer treatment within days!,Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,8,Least-Squares Reconstruction,Compute Q = FHF,Acquire Data,Compute
9、FHd,Find ,Q depends only on scanner configuration FHd depends on scan data found using linear solver Accelerate Q and FHd on G80 Q: 1-2 days on CPU FHd: 6-7 hours on CPU : 1.5 minutes on CPU,5, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,9,Algorithms to Accelerate,for (m = 0;
10、 m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Compute Q,FHd is nearly identical Scan data M = # scan points kx, ky, kz = 3D scan data Pixel data N = # pixels x, y, z = input 3D pixel data Q = output pi
11、xel data Complexity is O(MN) Inner loop 10 FP MUL or ADD ops 2 FP trig ops 10 loads,6, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,10,From C to CUDA: Step 1What unit of work is assigned to each thread?,7,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n
12、+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) , David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,11,8,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += p
13、him*sin(exp) ,How does loop interchange help?,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei
14、 W. Hwu Taiwan, June 30 - July 2, 2008,12,9,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,How does loop fission help?,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n
15、+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,13,10,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhi
16、m for (n = 0; n N; n+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?,phi kernel Each thread computes phi at one scan point (each thread corresponds to one loop iteration),Q ker
17、nel Each thread computes Q at one pixel (each thread corresponds to one outer loop iteration), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,14,Tiling of Scan Data,LS recon uses multiple grids Each grid operates on all pixels Each grid operates on a distinct subset of scan data
18、 Each thread in the same grid operates on a distinct pixel,for (m = 0; m M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,12,Thread n operates on pixel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,15,LS recon uses multiple grids Ea
19、ch grid operates on all pixels Each grid operates on a distinct subset of scan data Each thread in the same grid operates on a distinct pixel,Tiling of Scan Data,12,for (m = 31M/32; m 32M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Thread n operates on pi
20、xel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,16,13,Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi, int startM,endM) n = blockIdx.x*TPB + threadIdx.x for (m = startM; m endM; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim * cos(exp) iQn += phim * sin(exp) ,From C to CUDA: Ste
21、p 2Where are the potential bottlenecks?,Bottlenecks Memory BW Trig ops Overheads (branches, addr calcs), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,17,Step 3: Overcoming bottlenecks,14,LS recon on CPU (SP) Q: 45 hours, 0.5 GFLOPS FHd: 7 hours, 0.7 GFLOPS Counting each trig o
22、p as 1 FLOP, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,18,Step 3: Overcoming Bottlenecks (Mem BW),15,Register allocate pixel data Inputs (x, y, z); Outputs (rQ, iQ) Exploit temporal and spatial locality in access to scan data Constant memory + constant caches Shared memory,
23、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,19,Step 3: Overcoming Bottlenecks (Mem BW),Register allocation of pixel data Inputs (x, y, z); Outputs (rQ, iQ) FP arithmetic to off-chip loads: 2 to 1 Performance 5.1 GFLOPS (Q), 5.4 GFLOPS (FHd) Still bottlenecked on memory BW,16
24、, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,20,Step 3: Overcoming Bottlenecks (Mem BW),Old bottleneck: off-chip BW Solution: constant memory FP arithmetic to off-chip loads: 284 to 1 Performance 18.6 GFLOPS (Q), 22.8 GFLOPS (FHd) New bottleneck: trig operations,17, David Ki
25、rk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,21,Sidebar: Estimating Off-Chip Loads with Const Cache,How can we approximate the number of off-chip loads when using the constant caches? Given: 128 tpb, 4 blocks per SM, 256 scan points per grid Assume no evictions due to cache conflicts
26、7 accesses to global memory per thread (x, y, z, rQ x 2, iQ x 2) 4 blocks/SM * 128 threads/block * 7 accesses/thread = 3,584 global mem accesses 4 accesses to constant memory per scan point (kx, ky, kz, phi) 256 scan points * 4 loads/point = 1,024 constant mem accesses Total off-chip memory accesses
27、 = 3,584 + 1,024 = 4,608 Total FP arithmetic ops = 4 blocks/SM * 128 threads/block * 256 iters/thread * 10 ops/iter = 1,310,720 FP arithmetic to off-chip loads: 284 to 1,18, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,22,Step 3: Overcoming Bottlenecks (Trig),Old bottleneck: t
28、rig operations Solution: SFUs Performance 98.2 GFLOPS (Q), 92.2 GFLOPS (FHd) New bottleneck: overhead of branches and address calculations,19, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,23,Sidebar: Effects of Approximations,Avoid temptation to measure only absolute error (I0
29、 I) Can be deceptively large or small Metrics PSNR: Peak signal-to-noise ratio SNR: Signal-to-noise ratio Avoid temptation to consider only the error in the computed value Some apps are resistant to approximations; others are very sensitive,20,A.N. Netravali and B.G. Haskell, Digital Pictures: Repre
30、sentation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995)., David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,24,Step 3: Overcoming Bottlenecks (Overheads),Old bottleneck: Overhead of branches and address calculations Solution: Loop unrolling and experimenta
31、l tuning Performance 179 GFLOPS (Q), 145 GFLOPS (FHd),21, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,25,Experimental Tuning: Tradeoffs,In the Q kernel, three parameters are natural candidates for experimental tuning Loop unrolling factor (1, 2, 4, 8, 16) Number of threads pe
32、r block (32, 64, 128, 256, 512) Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048) Cant optimize these parameters independently Resource sharing among threads (register file, shared memory) Optimizations that increase a threads performance often increase the threads resource consumpt
33、ion, reducing the total number of threads that execute in parallel Optimization space is not linear Threads are assigned to SMs in large thread blocks Causes discontinuity and non-linearity in the optimization space,22, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,26,Experimen
34、tal Tuning: Example,Increase in per-thread performance, but fewer threads: Lower overall performance,23, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,27,Experimental Tuning: Scan Points Per Grid,24, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,28,Sidebar
35、: Cache-Conscious Data Layout,kx, ky, kz, and phi components of same scan point have spatial and temporal locality Prefetching Caching Old layout does not fully leverage that locality New layout does fully leverage that locality,25, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008
36、,29,Experimental Tuning: Scan Points Per Grid (Improved Data Layout),26, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,30,Experimental Tuning: Loop Unrolling Factor,27, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,31,Sidebar: Optimizing the CPU Implementa
37、tion,Optimizing the CPU implementation of your application is very important Often, the transformations that increase performance on CPU also increase performance on GPU (and vice-versa) The research community wont take your results seriously if your baseline is crippled Useful optimizations Data ti
38、ling SIMD vectorization (SSE) Fast math libraries (AMD, Intel) Classical optimizations (loop unrolling, etc) Intel compiler (icc, icpc),28, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,32,Summary of Results,29,8X, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,33,Summ
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 航空航天模擬器安裝合同樣本
- 2024年美容院工作人員聘用合同樣本一
- 市場(chǎng)細(xì)分與定位工藝管理辦法
- 電商企業(yè)運(yùn)營(yíng)總監(jiān)聘用協(xié)議
- 2024年通信工程建設(shè)項(xiàng)目進(jìn)度管理合同3篇
- 供應(yīng)商關(guān)系管理最佳實(shí)踐
- 勞動(dòng)保護(hù)協(xié)議
- 制造業(yè)高層管理聘用合同范本
- 鐵路設(shè)施維護(hù)管理指南
- 財(cái)務(wù)合規(guī)與審計(jì)指導(dǎo)
- 北京朝陽(yáng)醫(yī)院RICU進(jìn)修匯報(bào)
- 魯教版初三物理-質(zhì)量和密度復(fù)習(xí)題及答案
- 阿托品化課件
- 2021年貴安新區(qū)產(chǎn)業(yè)發(fā)展控股集團(tuán)有限公司招聘筆試試題及答案解析
- 《休閑學(xué)概論》課后習(xí)題參考答案
- 安全文化培訓(xùn) (注冊(cè)安工再培訓(xùn))課件
- 色粉-MSDS物質(zhì)安全技術(shù)資料
- 骨科學(xué)研究生復(fù)試真題匯總版
- 石油化工鋼結(jié)構(gòu)工程施工及驗(yàn)收規(guī)范
- 遼海版六年級(jí)音樂上冊(cè)第8單元《3. 演唱 姐妹們上場(chǎng)院》教學(xué)設(shè)計(jì)
- (新版教材)蘇教版三年級(jí)上冊(cè)科學(xué)全冊(cè)單元測(cè)試卷
評(píng)論
0/150
提交評(píng)論