版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 5,The Sieve of Eratosthenes,Chapter Objectives,Analysis of block allocation schemes Function MPI_Bcast Performance enhancements,Outline,Sequential algorithm Sources of parallelism Data decomposition options Parallel algorithm deve
2、lopment, analysis MPI program Benchmarking Optimizations,Sequential Algorithm,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,
3、38,40,42,44,46,48,50,52,54,56,58,60,3,9,15,21,27,33,39,45,51,57,5,25,35,55,7,49,Complexity: (n ln ln n),Pseudocode,1. Create list of unmarked natural numbers 2, 3, , n 2. k 2 3. Repeat (a) Mark all multiples of k between k2 and n (b) k smallest unmarked number k until k2 n 4. The unmarked numbers ar
4、e primes,Sources of Parallelism,Domain decomposition Divide data into pieces Associate computational steps with data One primitive task per array element,Making 3(a) Parallel,Mark all multiples of k between k2 and n for all j where k2 j n do if j mod k = 0 then mark j (it is not a prime) endif endfo
5、r,Making 3(b) Parallel,Find smallest unmarked number k Min-reduction (to find smallest unmarked number k) Broadcast (to get result to all tasks),Agglomeration Goals,Consolidate tasks Reduce communication cost Balance computations among processes,Data Decomposition Options,Interleaved (cyclic) Easy t
6、o determine “owner” of each index Leads to load imbalance for this problem Block Balances loads More complicated to determine owner if n not a multiple of p,Block Decomposition Options,Want to balance workload when n not a multiple of p Each process gets either n/p or n/p elements Seek simple expres
7、sions Find low, high indices given an owner Find owner given an index,Method #1,Let r = n mod p If r = 0, all blocks have same size Else First r blocks have size n/p Remaining p-r blocks have size n/p,Examples,Method #1 Calculations,First element controlled by process i Last element controlled by pr
8、ocess i Process controlling element j,Method #2,Scatters larger blocks among processes First element controlled by process i Last element controlled by process i Process controlling element j,Examples,17 elements divided among 7 processes,Comparing Methods,Assuming no operations for “floor” function
9、,Pop Quiz,Illustrate how block decomposition method #2 would divide 13 elements among 5 processes.,13(0)/ 5 = 0,13(1)/5 = 2,13(2)/ 5 = 5,13(3)/ 5 = 7,13(4)/ 5 = 10,Block Decomposition Macros,#define BLOCK_LOW(id,p,n) (i)*(n)/(p) #define BLOCK_HIGH(id,p,n) (BLOCK_LOW(id)+1,p,n)-1) #define BLOCK_SIZE(
10、id,p,n) (BLOCK_LOW(id)+1)-BLOCK_LOW(id) #define BLOCK_OWNER(index,p,n) (p)*(index)+1)-1)/(n),Local vs. Global Indices,L 0 1,L 0 1 2,L 0 1,L 0 1 2,L 0 1 2,G 0 1,G 2 3 4,G 5 6,G 7 8 9,G 10 11 12,Looping over Elements,Sequential program for (i = 0; i n; i+) Parallel program size = BLOCK_SIZE (id,p,n);
11、for (i = 0; i size; i+) gi = i + BLOCK_LOW(id,p,n); ,Decomposition Affects Implementation,Largest prime used to sieve is n First process has n/p elements It has all sieving primes if p n First process always broadcasts next sieving prime No reduction step needed,Fast Marking,Block decomposition allo
12、ws same marking as sequential algorithm: j, j + k, j + 2k, j + 3k, instead of for all j in block if j mod k = 0 then mark j (it is not a prime),Parallel Algorithm Development,1. Create list of unmarked natural numbers 2, 3, , n 2. k 2 3. Repeat (a) Mark all multiples of k between k2 and n (b) k smal
13、lest unmarked number k until k2 m 4. The unmarked numbers are primes,(c) Process 0 broadcasts k to rest of processes,5. Reduction to determine number of primes,Function MPI_Bcast,int MPI_Bcast ( void *buffer, /* Addr of 1st element */ int count, /* # elements to broadcast */ MPI_Datatype datatype, /
14、* Type of elements */ int root, /* ID of root process */ MPI_Comm comm) /* Communicator */,MPI_Bcast (,Task/Channel Graph,Analysis, is time needed to mark a cell Sequential execution time: n ln ln n Number of broadcasts: n / ln n Broadcast time: log p Expected execution time:,Code (1/4),#include #in
15、clude #include #include MyMPI.h #define MIN(a,b) (a)n, argv0); MPI_Finalize(); exit (1); ,Code (2/4),n = atoi(argv1); low_value = 2 + BLOCK_LOW(id,p,n-1); high_value = 2 + BLOCK_HIGH(id,p,n-1); size = BLOCK_SIZE(id,p,n-1); proc0_size = (n-1)/p; if (2 + proc0_size) (int) sqrt(double) n) if (!id) prin
16、tf (Too many processesn); MPI_Finalize(); exit (1); marked = (char *) malloc (size); if (marked = NULL) printf (Cannot allocate enough memoryn); MPI_Finalize(); exit (1); ,Code (3/4),for (i = 0; i low_value) first = prime * prime - low_value; else if (!(low_value % prime) first = 0; else first = pri
17、me - (low_value % prime); for (i = first; i size; i += prime) markedi = 1; if (!id) while (marked+index); prime = index + 2; MPI_Bcast (,Code (4/4),count = 0; for (i = 0; i size; i+) if (!markedi) count+; MPI_Reduce ( ,Benchmarking,Execute sequential algorithm Determine = 85.47 nanosec Execute serie
18、s of broadcasts Determine = 250 sec,Execution Times (sec),Improvements,Delete even integers Cuts number of computations in half Frees storage for larger values of n Each process finds own sieving primes Replicating computation of primes to n Eliminates broadcast step Reorganize loops Increases cache
19、 hit rate,Reorganize Loops,Cache hit rate,Lower,Higher,Comparing 4 Versions,Summary,Sieve of Eratosthenes: parallel design uses domain decomposition Compared two block distributions Chose one with simpler formulas Introduced MPI_Bcast Optimizations reveal importance of maximizing single-processor pe
20、rformance,Parallel Programmingwith MPI and OpenMP,Michael J. Quinn,Chapter 6,Floyds Algorithm,Chapter Objectives,Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D matrices Analyzing performance when computations and communications over
21、lap,Outline,All-pairs shortest path problem Dynamic 2-D arrays Parallel algorithm design Point-to-point communication Block row matrix I/O Analysis and benchmarking,All-pairs Shortest Path Problem,A,E,B,C,D,4,6,1,3,5,3,1,2,Floyds Algorithm,for k 0 to n-1 for i 0 to n-1 for j 0 to n-1 ai,j min (ai,j,
22、 ai,k + ak,j) endfor endfor endfor,Why It Works,i,k,j,Shortest path from i to k through 0, 1, , k-1,Shortest path from k to j through 0, 1, , k-1,Shortest path from i to j through 0, 1, , k-1,Dynamic 1-D Array Creation,Heap,Run-time Stack,Dynamic 2-D Array Creation,Heap,Run-time Stack,Bstorage,B,Des
23、igning Parallel Algorithm,Partitioning Communication Agglomeration and Mapping,Partitioning,Domain or functional decomposition? Look at pseudocode Same assignment statement executed n3 times No functional parallelism Domain decomposition: divide matrix A into its n2 elements,Communication,Primitive
24、tasks,Updating a3,4 when k = 1,Iteration k: every task in row k broadcasts its value w/in task column,Iteration k: every task in column k broadcasts its value w/in task row,Agglomeration and Mapping,Number of tasks: static Communication among tasks: structured Computation time per task: constant Str
25、ategy: Agglomerate tasks to minimize communication Create one task per MPI process,Two Data Decompositions,Rowwise block striped,Columnwise block striped,Comparing Decompositions,Columnwise block striped Broadcast within columns eliminated Rowwise block striped Broadcast within rows eliminated Readi
26、ng matrix from file simpler Choose rowwise block striped decomposition,File Input,Pop Quiz,Why dont we input the entire file at once and then scatter its contents among the processes, allowing concurrent message passing?,Point-to-point Communication,Involves a pair of processes One process sends a m
27、essage Other process receives the message,Send/Receive Not Collective,Function MPI_Send,int MPI_Send ( void *message, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ),Function MPI_Recv,int MPI_Recv ( void *message, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm c
28、omm, MPI_Status *status ),Coding Send/Receive, if (ID = j) Receive from I if (ID = i) Send to j ,Receive is before Send. Why does this work?,Inside MPI_Send and MPI_Recv,Sending Process,Receiving Process,Program Memory,System Buffer,System Buffer,Program Memory,Return from MPI_Send,Function blocks until message buffer free Message buffer is free when Message copied to system buffer, or Message transmitted Typical scenario Message copied to system buffer Transmission overlaps computation,Return from MPI_Recv,Func
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025-2030血管内超声(IVUS)在冠心病诊疗中的临床优势与市场接受度调研报告
- 2025-2030葡萄酒酿造工艺升级质量控制标准优化研究分析
- 2025-2030葡萄牙软木资源行业市场现状供需分析及投资评估规划分析研究报告
- 测绘公司安全生产自查报告
- 铁岭卫校高职单招试题及答案解析(2025版)
- 《深圳市工程建设标准化实施方案》深建字〔211〕97号附件
- 2025年电气工程施工员技能资格考试试题及答案解析
- 2026年互联网金融合作合同
- 2026年医用防护用品应急储备合同
- 钢结构玻璃雨棚制作安装合同协议书
- 橡胶行业职业卫生课件
- DZ/T 0262-2014集镇滑坡崩塌泥石流勘查规范
- DBJ50-T-086-2016重庆市城市桥梁工程施工质量验收规范
- 《造血干细胞移植护理指南》课件
- 中国土壤污染防治法培训
- 升降车安全技术交底(一)
- 附:江西省会计师事务所服务收费标准【模板】
- 合欢花苷类对泌尿系感染的抗菌作用
- 合伙人股权合同协议书
- 工程施工监理技术标
- 年终尾牙会领导讲话稿
评论
0/150
提交评论