Matrix Multiplication.ppt_第1頁(yè)
Matrix Multiplication.ppt_第2頁(yè)
Matrix Multiplication.ppt_第3頁(yè)
Matrix Multiplication.ppt_第4頁(yè)
Matrix Multiplication.ppt_第5頁(yè)
已閱讀5頁(yè),還剩58頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 11,Matrix Multiplication,Outline,Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannons algorithm,Iterative, Row-oriented Algorithm,Series of inner p

2、roduct (dot product) operations,Performance as n Increases,Reason:Matrix B Gets Too Big for Cache,Computing a row of C requires accessing every element of B,Block Matrix Multiplication,Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition,Recurse Until

3、 B Small Enough,Comparing Sequential Performance,First Parallel Algorithm,Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring,First Parallel Algorithm (cont.),Agglome

4、ration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,

5、B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Complexity Analysis,Algorithm has p iterations During each iteration a process multiplies(n / p) (n / p) block of A by (n / p) n block of B: (n3 / p2) Total computation time: (n3 / p) Each process ends up passing(p-1)n2/p

6、 = (n2) elements of B,Isoefficiency Analysis,Sequential algorithm: (n3) Parallel overhead: (pn2)Isoefficiency relation: n3 Cpn2 n Cp This system does not have good scalability,Weakness of Algorithm 1,Blocks of B being manipulated have p times more columns than rows Each process must access every ele

7、ment of matrix B Ratio of computations per communication is poor: only 2n / p,Parallel Algorithm 2(Cannons Algorithm),Associate a primitive task with each matrix element Agglomerate tasks responsible for a square (or nearly square) block of C Computation-to-communication ratio rises to n / p,Element

8、s of A and B Needed to Compute a Processs Portion of C,Algorithm 1,Cannons Algorithm,Blocks Must Be Aligned,Before,After,Blocks Need to Be Aligned,A00,B00,A01,B01,A02,B02,A03,B03,A10,B10,A11,B11,A12,B12,A13,B13,A20,B20,A21,B21,A22,B22,A23,B23,A30,B30,A31,B31,A32,B32,A33,B33,Each triangle represents

9、a matrix block Only same-color triangles should be multiplied,Rearrange Blocks,A00,B00,A01,B01,A02,B02,A03,B03,A10,B10,A11,B11,A12,B12,A13,B13,A20,B20,A21,B21,A22,B22,A23,B23,A30,B30,A31,B31,A32,B32,A33,B33,Block Aij cycles left i positions Block Bij cycles up j positions,Consider Process P1,2,B02,A

10、10,A11,A12,B12,A13,B22,B32,Step 1,Consider Process P1,2,B12,A11,A12,A13,B22,A10,B32,B02,Step 2,Consider Process P1,2,B22,A12,A13,A10,B32,A11,B02,B12,Step 3,Consider Process P1,2,B32,A13,A10,A11,B02,A12,B12,B22,Step 4,Complexity Analysis,Algorithm has p iterations During each iteration process multip

11、lies two (n / p ) (n / p ) matrices: (n3 / p 3/2) Computational complexity: (n3 / p) During each iteration process sends and receives two blocks of size (n / p ) (n / p ) Communication complexity: (n2/ p),Isoefficiency Analysis,Sequential algorithm: (n3) Parallel overhead: (pn2) Isoefficiency relati

12、on: n3 C pn2 n C p This system is highly scalable,Summary,Considered two sequential algorithms Iterative, row-oriented algorithm Recursive, block-oriented algorithm Second has better cache hit rate as n increases Developed two parallel algorithms First based on rowwise block striped decomposition Se

13、cond based on checkerboard block decomposition Second algorithm is scalable, while first is not,Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 12,Solving Linear Systems,Outline,Terminology Back substitution Gaussian elimination Jacobi method Conjugate gradient method,Terminolo

14、gy,System of linear equations Solve Ax = b for x Special matrices Symmetrically banded Upper triangular Lower triangular Diagonally dominant Symmetric,Symmetrically Banded,4,2,-1,0,0,0,3,-4,5,6,0,0,1,6,3,2,4,0,0,2,-2,0,9,2,0,0,7,3,8,7,0,0,0,4,0,2,Semibandwidth 2,Upper Triangular,4,2,-1,5,9,2,0,-4,5,

15、6,0,-4,0,0,3,2,4,6,0,0,0,0,9,2,0,0,0,0,8,7,0,0,0,0,0,2,Lower Triangular,4,0,0,0,0,0,0,0,0,0,0,0,5,4,3,0,0,0,2,6,2,3,0,0,8,-2,0,1,8,0,-3,5,7,9,5,2,Diagonally Dominant,19,0,2,2,0,6,0,-15,2,0,-3,0,5,4,22,-1,0,4,2,3,2,13,0,-5,5,-2,0,1,16,0,-3,5,5,3,5,-32,Symmetric,3,0,2,2,0,6,0,7,4,3,-3,5,5,4,0,-1,0,4,2

16、,3,-1,9,0,-5,0,-3,0,0,5,5,6,5,4,-5,5,-3,Back Substitution,Used to solve upper triangular systemTx = b for x Methodology: one element of x can be immediately computed Use this value to simplify system, revealing another element that can be immediately computed Repeat,Back Substitution,1x0,+1x1,1x2,+4

17、x3,8,=, 2x1,3x2,+1x3,5,=,2x2, 3x3,0,=,2x3,4,=,Back Substitution,1x0,+1x1,1x2,+4x3,8,=, 2x1,3x2,+1x3,5,=,2x2, 3x3,0,=,2x3,4,=,x3 = 2,Back Substitution,1x0,+1x1,1x2,0,=, 2x1,3x2,3,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,+1x1,1x2,0,=, 2x1,3x2,3,=,2x2,6,=,2x3,4,=,x2 = 3,Back Substitution,1x0,+1x1,3,=, 2

18、x1,12,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,+1x1,3,=, 2x1,12,=,2x2,6,=,2x3,4,=,x1 = 6,Back Substitution,1x0,9,=, 2x1,12,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,9,=, 2x1,12,=,2x2,6,=,2x3,4,=,x0 = 9,Pseudocode,for i n 1 down to 1 do x i b i / a i, i for j 0 to i 1 do b j b j x i a j, i endfor endfor

19、,Time complexity: (n2),Data Dependence Diagram,We cannot execute the outer loop in parallel. We can execute the inner loop in parallel.,Row-oriented Algorithm,Associate primitive task with each row of A and corresponding elements of x and b During iteration i task associated with row j computes new

20、value of bj Task i must compute xi and broadcast its value Agglomerate using rowwise interleaved striped decomposition,Interleaved Decompositions,Rowwise interleaved striped decomposition,Columnwise interleaved striped decomposition,Complexity Analysis,Each process performs about n / (2p) iterations

21、 of loop j in all A total of n -1 iterations in all Computational complexity: (n2/p) One broadcast per iteration Communication complexity: (n log p),Column-oriented Algorithm,Associate one primitive task per column of A and associated element of x Last task starts with vector b During iteration i ta

22、sk i computes xi, updates b, and sends b to task i -1 In other words, no computational concurrency Agglomerate tasks in interleaved fashion,Complexity Analysis,Since b always updated by a single process, computational complexity same as sequential algorithm: (n2) Since elements of b passed from one process to another each iteration, communication complexity is (n2),Comparison,Message- passing time dominates,Computation time dominates,Gaussian Elimination,Used to solve Ax = b when A is dense Reduces Ax = b to upper triangular system Tx = c Back su

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論