并行算法2algorithm.ppt_第1页
并行算法2algorithm.ppt_第2页
并行算法2algorithm.ppt_第3页
并行算法2algorithm.ppt_第4页
并行算法2algorithm.ppt_第5页
已阅读5页,还剩126页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Overview of Parallel Computing,xxx,2,Notice!,Course Portal ,3,Agenda,1.Parallel computing Model2.technical issues about Parallel computing algorithms,4,Von Neumann Model,5,Instruction Processing,Decode instruction,Evaluate address,Fetch operands from memory,Execute operation,Store result,Fetch instr

2、uction from memory,6,Parallel Computing Model,Computing model Bridge between SW and HW general purpose HW, scalable HW transportable SW Abstract architecture for algorithm development Eg. PRAM, BSP, LogP,7,Parallel Programming Model,What programmer uses in coding applications? Specifies communicatio

3、n and synchronization Communication primitives exposed to used-level realizes the programming model e.g., Uniprocessor, Multiprogramming, Data parallel, message-passing, shared-address-space,8,Aspects of Parallel Processing,Algorithm developer,Application developer,Parallel computing model,Parallel

4、programming model,System programmer,Architecture designer,3,4,2,1,Middleware,9,Parallel Computing Models,PRAM Parallel Random Access Memory A set of p processors Global shared memory Each processor can access any memory location in one time step Globally synchronized Executing same program in lockst

5、ep,10,Illustration of PRAM,P1,P2,P3,Pp,Shared Memory,CLK,P processors connected to a single shared memory,Each processor has a unique index.,Single program executed in MIMD mode,11,Features,Model architecture Synchronized RAM with common clock, but not SIMD operation: MIMD No local memory in each RA

6、M One global shared memory single address space architecture Synchronization, communication, parallelism overhead are zero.,12,Features (Contd),Operations per step Read/write a word from/to the memory Local operation An instruction could perform the following three operations in one cycle Fetch on o

7、r two words from the memory as operands Perform an arithmetic/logic operation Store the result back in memory,13,Problems with PRAM,Inaccurate description of real-world parallel systems Unaccounted costs Latency, bandwidth, non-local memory access, memory access contention issues, synchronization co

8、sts, etc Algorithms perceived to work well in PRAM may have poor performance in practice,14,PRAM Variants,Variants arise to model some of these costs Each introduces some practical aspect of machine Gives algorithm designer better idea for optimization Variants can be grouped into 4 categories Memor

9、y access Synchronization Latency Bandwidth,15,Memory Access,Impractical to have concurrent read and write to same memory location Contention issues CRCW PRAM CPRAM-CRCW(Common PRAM-CRCW):仅允许写入相同数据 PPRAM-CRCW(Priority PRAM-CRCW):仅允许优先级最高的处理器写入 APRAM-CRCW(Arbitrary PRAM-CRCW):允许任意处理器自由写入 EREW or CREW

10、PRAM QRQW (queue-read, queue-write ) Expensive Multiple ports required for concurrent access maybe prohibitively expensive.,16,Synchronization,Standard PRAM globally synchronized Standard PRAM model do not charge a cost for synchronization Unrealistic! Synchronization is necessary and expensive in p

11、ractical parallel systems Variants model cost of synchronization APRAM (asynchrony PRAM ):每个处理器有其局部存储器、局部时钟、局部程序;无全局时钟,各处理器异步执行;处理器通过SM进行通讯;处理器间依赖关系,需在并行程序中显式地加入同步路障。 XPRAM (bulk synchronous PRAM, also known as BSP model) Provides an incentive for algorithm designers to synchronize only when necessa

12、ry,17,Latency,Standard PRAM assumes unit-cost for non-local memory access In practice, non-local memory access has severe effect on performance PRAM variant LPRAM (Local-memory PRAM) A set of nodes each with a processor and a local memory; the nodes can communicate through a globally shared memory.

13、Two types of steps are defined and separately accounted for: computation steps, where each processor performs one operation on local data, and communication steps, where each processor can write, and then read a word from global memory Charge a cost of L units to access global memory,18,Synchronizat

14、ion (cont.),BPRAM (Block-Parallel RAM) BSP assumes n nodes, each containing a processor and a memory module, interconnected by a communication medium. A BSP computation is a sequence of phases, called supersteps: in one superstep, each processor can execute operations on data residing in the local m

15、emory, send messages and, at the end, execute a global synchronization instruction. A messages sent during a superstep becomes Charge L units to access 1st message and b units for each subsequent contiguous block,19,Bandwidth,Standard PRAM assumes unlimited bandwidth In practice, bandwidth is limite

16、d PRAM Variant DRAM (Distribution random access machine) 2 level memory hierarchy Access to global memory is charged a cost based on possible data congestion? PRAM(m) Global memory segmented into modules Any given step, only m memory accesses can be serviced,20,Other Distributed Models,Distributed M

17、emory Model No global memory Each processor associated with some local memory Postal Model Processor sends request for non-local memory Instead of stalling, it continues working while data is en-route,21,Network Models,Focus on impact of topology of communications network Early focus of parallel com

18、putation Distributed Memory Model? Cost of remote memory access is a function of both topology and the access pattern Provides incentives for efficient Data mappings Communications routing,22,LogP Model,sender,receiver,o,L,g,o,t,23,Model Parameters,Latency (L) Delay incurred in communicating a messa

19、ge from source to destination Hop count and Hop delay Communication Overhead (o) Length of time a processor is engaged in sending or receiving a message Node overhead for processing a send or receive Communication bandwidth (gap) Minimum time interval between messages Processor count (P) Processor c

20、ount,24,LogP,LogP模型是一种分布存储的、点到点通信的多处理机模型,其中通信网络由4个主要参数来描述: (1)L(Latency) 表示源处理机与目的处理机进行消息(一个或几个字)通信所需要的等待或延迟时间的上限,表示网络中消息的延迟。 (2)o(overhead)表示处理机准备发送或接收每个消息的时间开销(包括操作系统核心开销和网络软件开销),在这段时间里处理不能执行其它操作。 (3)g(gap)表示一台处理机连续两次发送或接收消息时的最小时间间隔,其倒数即微处理机的通信带宽。 (4)P(Processor)处理机/存储器模块个数 假定一个周期完成一次局部操作,并定义为一个时间

21、单位,那么,L,o和g都可以表示成处理器周期的整数倍。,25,LogP特点,(1)抓住了网络与处理机之间的性能瓶颈。g反映了通信带宽,单位时间内最多有L/g个消息能进行处理机间传送。 (2)处理机之间异步工作,并通过处理机间的消息传送来完成同步。 (3)对多线程技术有一定反映。每个物理处理机可以模拟多个虚拟处理机(VP),当某个VP有访问请求时,计算不会终止,但VP的个数受限于通信带宽和上下文交换的开销。VP受限于网络容量,至多有L/g个VP。 (4)消息延迟不确定,但延迟不大于L。消息经历的等待时间是不可预测的,但在没有阻塞的情况下,最大不超过L。 (5)LogP模型鼓励编程人员采用一些好的

22、策略,如作业分配,计算与通信重叠以及平衡的通信模式等。 (6)可以预估算法的实际运行时间。,26,LogP的不足,(1)对网络中的通信模式描述的不够深入。如重发消息可能占满带宽、中间路由器缓存饱和等未加描述。 (2)LogP模型主要适用于消息传递算法设计,对于共享存储模式,则简单地认为远地读操作相当于两次消息传递,未考虑流水线预取技术、Cache引起的数据不一致性以及Cache命中率对计算的影响。 (3)未考虑多线程技术的上下文开销。 (4)LogP模型假设用点对点消息路由器进行通信,这增加了编程者考虑路由器上相关通信操作的负担。,27,Bridging Models,Candidate Ty

23、pe Architecture Model Finite number of von Neumann computers executing asynchronously Global controller for synchronization 2 level memory hierarchy Provide incentives for reference locality Assumes Unlimited Bandwidth Zero latency Does not provide incentives for Reduction of messages Synchronizatio

24、n avoidance,28,Bridging Models 2,Bulk Synchronous Parallel(BSP) P processors with local memory Router Facilities for periodic global synchronization Every l steps Models Bandwidth limitations Latency Synchronization costs Does not model Communication overhead Processor topology,29,BSP Computer,Distr

25、ibuted memory architecture 3 components Node Processor Local memory Router (Communication Network) Point-to-point, message passing (or shared variable) Barrier synchronizing facility All or subset,30,Illustration of BSP,Communication Network (g),Node (w),Node,Node,Barrier (l),31,BSP,BSP模型是个分布存储的MIMD

26、计算模型,其特点是: 它将处理器和路由器分开,强调了计算任务和通信任务的分开,而路由器仅仅完成点到点的消息传递,不提供组合、复制和广播等功能,这样做既掩盖具体的互连网络拓扑,又简化了通信协议; 采用障碍同步的方式以硬件实现的全局同步是在可控的粗粒度级,从而提供了执行紧耦合同步式并行算法的有效方式,而程序员并无过分的负担; 在分析BSP模型的性能时,假定局部操作可以在一个时间步内完成,而在每一个超级步中,一个处理器至多发送或接收h条消息(称为h-relation)。,32,Three Parameters,w parameter Maximum computation time within e

27、ach superstep Computation operation takes at most w cycles. g parameter # of cycles for communication of unit message when all processors are involved in communication - network bandwidth h relation coefficient Communication operation takes gh cycles. l parameter Barrier synchronization takes l cycl

28、es.,33,BSP Program,A BSP computation consists of S super steps. A superstep is a sequence of steps followed by a barrier synchronization. Superstep Any remote memory accesses take effect at barrier - loosely synchronous,34,BSP Program,Superstep 1,Superstep 2,Barrier,P1,P2,P3,P4,Computation,Communica

29、tion,35,Model Survey Summary,No single model is acceptable! Between models, subset of characteristics are focused in majority of models Computational Parallelism Communication Latency Communication Overhead Communication Bandwidth Execution Synchronization Memory Hierarchy Network Topology,36,Parall

30、el Machine Trends,Machine organization for most parallel machines is similar A collection of complete computers Microprocessor Cache memory Sizable DRAM memory Connected by robust communications network Motivated by cost and development of commodity components No single programming methodology is do

31、minant,37,Other considerations,Processor Count Number of nodes relative to price of most expensive supercomputer / cost of node Communication Interval lags far behind processor memory bandwidth Presence of adaptive routing and fault-recovery networking systems Affects algorithm design Parallel algor

32、ithms developed with large number of data elements per processor Attempts to exploit network topology or processor count is not very robust,38,Computational Parallelism,Number of physical processors Static versus dynamic parallelism Should number of processors be fixed? Fault-recovery networks allow

33、 for node failure Many parallel systems allow incremental upgrades by increasing node count,39,Latency,Fixed message length or variable message length? Network topology? Communication Overhead? Contention based latency? Memory hierarchy?,40,Bandwidth,Limited resource With low latency Tendency for ba

34、ndwidth abuse by flooding network,41,Synchronization,Ability to solve a wide class of problems require asynchronous parallelism Synchronization achieved via message passing Synchronization as a communication cost,42,Unified Model?,Compelling motivation Difficult Parallel machines are complicated Sti

35、ll evolving Different users from diverse disciplines Requires a common set of characteristics derived from needs of different users Again need for balance between descriptivity and prescriptivity,43,Algorithms and Concurrency,Introduction to Parallel Algorithms Tasks and Decomposition Processes and

36、Mapping Processes Versus Processors Decomposition Techniques Recursive Decomposition Data Decomposition Exploratory Decomposition Hybrid Decomposition Characteristics of Tasks and Interactions Task Generation, Granularity, and Context Characteristics of Task Interactions.,44,Concurrency and Mapping,

37、Mapping Techniques for Load Balancing Static and Dynamic Mapping Methods for Minimizing Interaction Overheads Maximizing Data Locality Minimizing Contention and Hot-Spots Overlapping Communication and Computations Replication vs. Communication Group Communications vs. Point-to-Point Communication Pa

38、rallel Algorithm Design Models Data-Parallel, Work-Pool, Task Graph, Master-Slave, Pipeline, and Hybrid Models,45,Preliminaries: Decomposition, Tasks, and Dependency Graphs,The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently A gi

39、ven problem may be decomposed into tasks in many different ways Tasks may be of same, different, or even interminate sizes A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing

40、the next. Such a graph is called a task dependency graph,46,Example: Multiplying a Dense Matrix with a Vector,Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portio

41、n of the matrix and vector accessed by Task 1. Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of the same size in terms of number of operations. Is this the

42、 maximum number of tasks we could decompose this problem into?,47,Example: Database Query Processing,Consider the execution of the query: MODEL = CIVIC AND YEAR = 2001 AND (COLOR = GREEN OR COLOR = WHITE) on the following database:,48,Example: Database Query Processing,The execution of the query can

43、 be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause.,Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next.,49,Exam

44、ple: Database Query Processing,Note that the same problem can be decomposed into subtasks in other ways as well.,An alternate decomposition of the given problem into subtasks, along with their data dependencies.,Different task decompositions may lead to significant differences with respect to their

45、eventual parallel performance.,50,Granularity of Task Decompositions,The number of tasks into which a problem is decomposed determines its granularity. Decomposition into a large number of tasks results in fine-grained decomposition and that into a small number of tasks results in a coarse grained d

46、ecomposition.,A coarse grained counterpart to the dense matrix-vector product example. Each task in this example corresponds to the computation of three elements of the result vector.,51,Degree of Concurrency,The number of tasks that can be executed in parallel is the degree of concurrency of a deco

47、mposition. Since the number of tasks that can be executed in parallel may change over program execution, the maximum degree of concurrency is the maximum number of such tasks at any point during execution. What is the maximum degree of concurrency of the database query examples? The average degree o

48、f concurrency is the average number of tasks that can be processed in parallel over the execution of the program. Assuming that each tasks in the database example takes identical processing time, what is the average degree of concurrency in each decomposition? The degree of concurrency increases as

49、the decomposition becomes finer in granularity and vice versa.,52,Critical Path Length,A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. The longest such path determines the shortest time in which the program can be executed in pa

50、rallel. The length of the longest path in a task dependency graph is called the critical path length.,53,Critical Path Length,Consider the task dependency graphs of the two database query decompositions:,What are the critical path lengths for the two task dependency graphs? If each task takes 10 tim

51、e units, what is the shortest parallel execution time for each decomposition? How many processors are needed in each case to achieve this minimum parallel execution time? What is the maximum degree of concurrency?,54,Limits on Parallel Performance,It would appear that the parallel time can be made a

52、rbitrarily small by making the decomposition finer in granularity. There is an inherent bound on how fine the granularity of a computation can be. For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. Concurrent tasks may also have to

53、exchange data with other tasks. This results in communication overhead. The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds.,55,Task Interaction Graphs,Subtasks generally exchange data with others in a decomposition. For example, even

54、in the trivial decomposition of the dense matrix-vector product, if the vector is not replicated across all tasks, they will have to communicate elements of the vector. The graph of tasks (nodes) and their interactions/data exchange (edges) is referred to as a task interaction graph. Note that task

55、interaction graphs represent data dependencies, whereas task dependency graphs represent control dependencies.,56,Task Interaction Graphs: An Example,Consider the problem of multiplying a sparse matrix A with a vector b. The following observations can be made:,As before, the computation of each elem

56、ent of the result vector can be viewed as an independent task. Unlike a dense matrix-vector product though, only non-zero elements of matrix A participate in the computation. If, for memory optimality, we also partition b across tasks, then one can see that the task interaction graph of the computat

57、ion is identical to the graph of the matrix A (the graph for which A represents the adjacency structure).,57,Task Interaction Graphs, Granularity, and Communication,In general, if the granularity of a decomposition is finer, the associated overhead (as a ratio of useful work associated with a task)

58、increases. Example: Consider the sparse matrix-vector product example from previous foil. Assume that each node takes unit time to process and each interaction (edge) causes an overhead of a unit time. Viewing node 0 as an independent task involves a useful computation of one time unit and overhead

59、(communication) of three time units. Now, if we consider nodes 0, 4, and 5 as one task, then the task has useful computation totaling to three time units and communication corresponding to four time units (four edges). Clearly, this is a more favorable ratio than the former case.,58,Processes and Mapping,In general, the number of tasks in a decomposition exceeds the number of processing elements available. For this reason, a parallel algorithm must also provide a mapping of tasks to processes. Note: We refer to the mapping as being

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论