已阅读5页,还剩78页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
第五章 并行程序设计的 一般方法,Creating a Parallel Program,In theory, can be done by programmer, compiler, run-time system, or OS In practice, parallel programs created with Explicitly parallel language (e.g., High Performance Fortran) Library for implementing a programming model Shared memory library (POSIX, PARMACS) Message passing library (Message Passing Interface) What will you realize at end of this section? Parallel programming is difficult!,几个术语( Terminology),A Task (subtask) is a piece of work Ocean: grid point, row, plane Apache: single query Task granularity Small fine-grain task Large course-grain task A process (thread) performs tasks According to OS: process = thread(s) + address space A process is executed on a processor,Identifying portions of the work that can be Identifying portions of the work that can be performed concurrently performed concurrently Mapping the concurrent pieces of work onto multiple processes running in parallel Distributing the input, output, and intermediate associated with the problem data Managing accesses to data shared by multiple processors. Synchronizing the processors at various stages of the parallel program execution.,Parallel Algorithm设计初步入门,Steps for Creating a Parallel Program,Decomposition into tasks Assignment of tasks to processes Orchestration of data access, communication, etc. Mapping processes to processors,Assignment,Orchestration,Mapping,P0,P1,Processes,Parallel Program,并行算法设计一般过程,并行计算机,划分,Decompose computation into set of tasks Determine size and number of tasks in parallel program Could be static or dynamic Remember Amdahls Law!,含义?,Decomposition基本概念,The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently A given problem may be docomposed into tasks in many different ways. Tasks may be of same, different, or even interminate sizes. A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a task dependency graph.,Exploit potential concurrency Minimize inter-process communication Exploit potential concurrency, limit sizes of processes,Decomposition 目的,How to measure effectiveness of decomposition before allocation? Partitioning criteria conflicts.,Difficulties:,进程间通信开销,开发并行性,Assignment,Assign tasks to processes or thread (static vs. dynamic) Balance workload (load balancing) Reduce communication Minimize overhead,Assignment,不严格时,将它们等同,Assignment + Decomposition = Partitioning,划分技术,commonly used techniques include: recursive decomposition , or divide and conquer data decomposition, or domain partition function decomposition exploratory decomposition speculative decomposition hybrid,How to decompose a task into various subtasks?,no single recipe that works for all problems !,Recursive Decomposition,Generally suited to problems that are solved using the divide-and-conquer strategy. A given problem is first decomposed into a set of sub-problems. These sub-problems are recursively decomposed further until a desired granularity is reached.,Example 1,A classic example of a divide-and-conquer algorithm on which we can apply recursive decomposition is Quicksort.,In this example, once the list has been partitioned around the pivot, each sub list can be processed concurrently,Example 2,The problem of finding the minimum number in a given list (or indeed any other associative operation such as sum, AND, max, n!, etc.) can be fashioned as a divide-and-conquer algorithm. The following algorithm illustrates this. 即:归约运算 S := S c , 其中是算子 We first start with a simple serial loop for computing the minimum entry in a given list:,1. procedure SERIAL_MIN (A, n) 2. begin 3. min = A0; 4. for i := 1 to n 1 do 5. if (Ai min) min := Ai; 6. endfor; 7. return min; 8. end SERIAL_MIN,相应的串行算法,1. procedure RECURSIVE_MIN (A, n) 2. begin 3. if ( n = 1 ) then 4. min := A 0 ; 5. else 6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( 15. end RECURSIVE_MIN,相应的递归算法,例1:,例2: 求,树的仿真模型如下图所示,设Tree0为一条长为L0的直线段,在Tree0的中点和三等分点以一定角度延伸出L0/2或L0/3长的线段,从而得到Tree1。将Tree1的各分支做同样的变换得到Tree2,以此类推,直到得到一棵完整的树Tn,其中n为树的仿真精度,即迭代次数。,Example 3,树生长的仿真,Tree0,Tree1,Tree2,Data Decomposition,Identify the data on which computations are performed. Partition this data across various tasks. This partitioning induces a decomposition of the problem. Data can be partitioned in various ways - this critically impacts performance of a parallel algorithm.,Output Data Decomposition,Often, each element of the output can be computed independently of others (but simply as a function of the input). A partition of the output across tasks decomposes the problem naturally.,根据值域来划分,Example,Consider the problem of multiplying two n x n matrices A and B to yield matrix C. The output matrix C can be partitioned into four tasks as follows:,Task 1: Task 2: Task 3: Task 4:,Input Data Partitioning,Generally applicable if each output can be naturally computed as a function of the input. In many cases, this is the only natural decomposition because the output is not clearly known a-priori (e.g., the problem of finding the minimum in a list, sorting a given list, etc.). A task is associated with each input data partition. The task performs as much of the computation with its part of the data. Subsequent processing combines these partial results.,Example,In the database counting example, the input (i.e., the transaction set) can be partitioned. This induces a task decomposition in which each task generates partial counts for all itemsets. These are combined subsequently for aggregate counts.,Intermediate Data Partitioning,Computation can often be viewed as a sequence of transformation from the input to the output data. In these cases, it is often beneficial to use one of the intermediate stages as a basis for decomposition.,例: 数值计算中,矩阵运算经常会用到划分 按行划分 主要划分方法 按列划分 划分为子矩阵,Perimeter to Area comm-to-comp ratio (area to volume in 3-d) Depends on n,p: decreases with n, increases with p,Domain Decomposition,Works well for scientific, engineering, graphics, . applications Exploits local-biased nature of physical problems Information requirements often short-range Or long-range but fall off with distance Simple example: nearest-neighbor grid computation,在地图上求最短路径,例:用牛顿法求f(x)=0的根。,Example,dy/dx=f (x) xn+1 = xn + f (xn)/ f (xn),当 | xn+1 - xn | ,(Ai,Bi,例:交通信息服务网格 地图查询服务 按功能划分 路况展示服务 出行方案服务,Functional Decomposition,从控制,过程, 功能,任务级发现并行性,Partition the computation into many small tasks of approximately uniform computational requirements. Associate data with each task. Common in problems where there are no obvious data structures to partition, or where the data structures are highly unstructured.,commonly used in search space exploration unlike the data decomposition, the search space is not known beforehand computation can terminate as soon as a solution is found the amount of work can be more or less then in the sequential case,Exploratory Decomposition,Speculative Decomposition,In some applications, dependencies between tasks are not known a-priori. For such applications, it is impossible to identify independent tasks. There are generally two approaches to dealing with such applications: conservative approaches, which identify independent tasks only when they are guaranteed to not have dependencies, and, optimistic approaches, which schedule tasks even when they may potentially be erroneous. Conservative approaches may yield little concurrency and optimistic approaches may require roll-back mechanism in the case of an error.,Example,A classic example of speculative decomposition is in discrete event simulation. The central data structure in a discrete event simulation is a time-ordered event list. Events are extracted precisely in time order, processed, and if required, resulting events are inserted back into the event list. Consider your day today as a discrete event system - you get up, get ready, drive to work, work, eat lunch, work some more, drive back, eat dinner, and sleep. Each of these events may be processed independently, however, in driving to work, you might meet with an unfortunate accident and not get to work at all. Therefore, an optimistic scheduling of other events will have to be rolled back.,Example,Another example is the simulation of a network of nodes (for instance, an assembly line or a computer network through which packets pass). The task is to simulate the behavior of this network for various inputs and node delay parameters (note that networks may become unstable for certain values of service rates, queue sizes, etc.).,Hybrid Decompositions,Often, a mix of decomposition techniques is necessary for decomposing a problem. Consider the following examples: In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive decompositions is more desirable. In discrete event simulation, there might be concurrency in task processing. A mix of speculative decomposition and data decomposition may work well. Even for simple problems like finding a minimum of a list of numbers, a mix of data and recursive decomposition works well.,通信,数据关联 数据依赖,通信是客观存在的,通信是昂贵的,Why ?,Orchestration,Choreograph data access, communication, and synchronization Reduce cost of communication and synchronization Preserve data locality (data layout) Schedule tasks (order of execution) Reduce overhead of managing parallelism Must have good primitives (architecture and model),Orchestration,Processes,Parallel Program,Orchestration for Performance,Exploit Temporal and Spatial Locality Temporal locality affects replication Touch too much data = capacity misses Computation Blocking,空间局部性(Spatial Locality),Granularities Communication grain Allocation grain Coherence grain (for cache coherent shared memory) What benefits do you get from larger block size? Potential disadvantage is false sharing Two or more processors accessing same cache block but dont share any of the data,Minimizing Communication Overhead,Maximizing Data Locality Minimizing Contention and Hot Spots Overlapping Computation with Communication Replicating Data or Computation Using Optimized Collective Communication Routines Overlapping Communication,组合(Granularity) 主要用来解决计算和通信的平衡问题,Fine process communication graph,Coarse process communication graph,Placement,. other phases, scheduling. Dynamic?,MP:,Partitioning,Coarse process communication graph,组合举例,可按如图所示方法重新组合上图:,该种方法组合后可得下图:,粒度的大小,Increasing communications demand and mapping/scheduling overhead Higher C-to-C Ratio,Higher degree of Parallelism,Medium Grain,Coarse Grain,Fine Grain,Task size affects Communication-to-Computation ratio (C-to-C ratio) and communication overheads,According to task (grain) size,Impact of Task Granularity,Granularity = Amount of work associated with task Large tasks Worse load balancing Lower overhead Less contention Less communication Small tasks Better load balancing Too much synchronization Too much management overhead Might have too much communication (affinity scheduling),Cost-Effective ?,Isnt Speedup(P) Costup(P) E.g., for SGI PowerChallenge w/ 500MB: Costup(32) = 8.6,Efficient and optimal parallel algorithms,A parallel algorithm is efficient iff it is fast (e.g. polynomial time) and the product of the parallel time and number of processors is close to the time of at the best know sequential algorithm T sequential T parallel N processors A parallel algorithms is optimal iff this product is of the same order as the best known sequential time,映射(分配)、调度,映射(mapping): 任务指定给处理器 俗称 分配 (allocation),静态考虑,编译期间完成,与运行时间无关,调度(scheduling): 任务指定给处理器,同时考虑分配的时刻。,严格来说,只能在 running time 完成,不讲究时,但经常将分配、映射、调度混同为一 个概念,If the number of processes = number of processors and it is possible to satisfy all processes needs for processor types, then Allocate each process to a separate processor, Keeping all processor types compatible with processes needs else We must have groups of processes, Each group being allocated a single processor, and All processes within each group running in a time-sharing mode. endif,WHY PROCESS GROUPS ?,We may face two possible situations:,A computer with N processors (of possibly differing characteristics like speed, amount of local memory, presence or absence of floating point facilities, graphics, FFT, etc.) and An algorithm consisting of a mix of P processes (of possibly differing processor type preferences),处理器分配问题,Given:,Which process(es) to run on which processor(s), when, and for how long? On the order sequence of such assignments; On the computer architecture, facilitating interprocess communication. Our measure of success will be calculated according to some selected criterion (like speedup, node efficiency, area efficiency, etc.),Decide:,Minimize total IPC cost Minimize total computation and IPC cost Minimize completion time Minimize load imbalance Maximize system reliability Feasible allocations must meet system constraints, like: Memory capacity Processing time limits,任务分配的初步分析,Combinatorially, if we have P processes and N processors, there are NP possible allocations (or more, if we allow replication to enhance fault tolerance),Allocation effectiveness depends on the allocation GOAL, like:,并行任务调度问题,两个逻辑上相对独立的阶段:映射和调度,所谓的任务映射(Map), 即确定任务节点映射到哪一个具体的处理单元上,简单表述为: Assign : = ( Ti , Pj) | Ti Pj , i n , j k 。其中Ti 代表任务节点, Pj 代表处理单元, n 表示任务节点的个数, k 代表处理单元的个数。,任务的具体调度,即对于所有的处理单元,映射在其上的相应任务节点具体何时开始执行, 可以表述为Scheduling : = U j = 1k Ti ( Ti, Pj) ST( Ti) 其中ST( Ti ) 表示任务Ti的开始执行时间。,在调度算法中,这两个阶段可以显式地反映出来,如基于Cluster 的调度算法;也可能根本看不出这两个阶段的分界,如基于表的调度算法。,调度的框架,子任务间的依赖关系,处理器速度,子任务间的通信开销,负载平衡,影响调度的主要因素,其它,公式化描述,深入分析,一般来说,对寻求最优解在绝大多数的情况下是NP 完全问题。,但当任务图是自由树结构而所有任务节点是相同权值并且处理器数目不限时, Hu提出了一个线性时间复杂度的最优解。,Ullman证明了把单位节点权重的任务图调度到P 个处理器的问题以及任务节点权重限于一个或两个单位组成的任务图到两个处理器的问题都是NP 完全问题; Papadimitriou 等人证明了即使允许任务复制的前提下调度单位权重节点的任意任务图到P 个处理器的问题也是NP 完全问题。,当任务图是任意结构节点拥有相同权值而处理器个数限定为两个时,Coffman等提出了O( n4) 时间复杂度( n 代表任务节点的个数) 的最优解。,随后Sethi对Coffman 的算法进行了改进得到了几乎是线性复杂度的最优解。上述三种情况都没有考虑节点之间的通信情况。,另外在一些限制条件下基于任务复制的调度算法也可以得到非指数时间复杂度的最优解。,并行任务调度分类,专门调度,自调度(self-scheduling),在建立并行任务调度模型的过程中,使用过树结构、Fork, Join 等结构。但这些结构只能反映任务调度的一些特殊情况。要表示广泛而多样的并行任务调度问题,随着研究的深入,自然而然地运用了图论,即并行任务图。,并行任务调度模型,有向无环图DAG。,合节点(join node):有多个前驱节点的任务节点 分节点(fork node):有多个后继节点的任务节点 孤节点(single node):图分解后剩下的孤独节点 伪入(pseudo-entry, 伪出(pseudo-exit)节点: 节点i的前驱节点集:pred(vi)= vj|ejiE 节点i的后继节点集:succ(vi)= vj|eijE 节点i的最早执行时间: est(vi) 任务vi的就绪时间,记作rdy(vs ,vi) 节点i的最早完成时间:ect(vi) 关键路径 critical path,DAG:=(V,E) V=v1, v2, vn, E=e1, e2, , em, eiVV,DAG的术语,问题 ?,小论文的选题,如何得到(生成)并行任务图(DAG) ?,各类调度算法比较,1 表(List) 调度,表调度的基本思想是通过对节点的优先级别进行排序来构造一个调度列表, 然后: 从调度列表中顺序取出一个节点; 将节点分配到使它的启动时间最早的处理机器上。直到节点被调度完毕。,HLFET(Highest Level First with Estimated Time)算法,ISH( Insertion Scheduling Heuristic) 算法,ETF(Earliest Time First) 算法,MCP(Modified Critical Path) 算法 。,DLS (Dynamic Level Scheduling) 算法,MH(Mapping Heuristic) 算法,各类调度算法比较,2 基于任务复制的调度,主要思想在一些处理器上冗余地映射任务图中的一些任务,以达到减少处理器之间通信开销的目的,即它利用处理器的空闲时间复制前驱任务,可以避免某些前驱任务的通信数据的传输,从而减少处理器等待的时间间隙。,DSH(Duplication Scheduling Heuristic) 算法,BTDH(Bottom2up TOP2down Duplication Heuristic) 算法,CPFD(Critical Path Fast Duplication) 算法,PY(由Papadimitriou and Yannakakis 提出) 算法,DFRN(Duplication First and Reducation Next) 算法,各类调度算法比较,3 基于任务集群(Cluster) 或聚类的调度,该类算法的基本思想是把给定任务图的所有任务映射到数量不受限的集群上。在算法的每一步,选择进行集群(Clustering) 的任务可以是任何任务而不必是一个就绪的任务。在每个集群循环中是把上一步的一些集群合并成为新的集群。如果两个任务分配到同一个集群,则表示它们在同一个处理器上执行。除了执行集群的步骤外,算法还必须对完成映射的集群进行最后的调度,即对集群中的任务在每个处理器上进行时间先后排序。如果处理器数目有限,在集群步骤中还必须考虑使集群的数目与可用的处理器数目相等。,MD (Mobility Directed) 算法,LC(Linear Clustering) 算法,EZ(Edge2Zeroing) 算法,DSC(Dominant Sequnce Clustering) 算法,DCP(Dynamic Critical Path) 算法,各类调度算法比较,4 非确定性调度,非确定性调度技术又称为随机搜索调度技术,它主要是通过有导向的随机选择来搜索问题的解空间而并不是单纯的随机搜索。这类技术组合前面搜索结果的知识和特定的随机搜索特点来产生新的结果。 遗传算法是最流行和使用最广泛的该类技术,它们的调度时间一般高于使用其他技术的调度算法,适合于某一种任务图的控制参数优化集并不适合于另一种类型的任务图,即对新的任务图遗传算法需要长时间的训练学习。另外,模拟退火方法也属于该类型技术。,遗传算法 模拟退火算 A* Tabu 爬山法,Min-Min task/resource that can complete the earliest is assigned first mini minj predtime (taski, processorj) Max-Min longest of task/earliest resource times assigned first maxi minj predtime (taski, processorj) Sufferage task that would “suffer” most if given a poor schedule assigned first maxi,j predtime (taski, processorj)- next maxi,j predtime (taski, processorj) Extended Sufferage minimal completion times computed for task on each cluster, sufferage heuristic applied to these maxi,j predtime (taski, processorj)- next maxi,j predtime (taski, processorj) Workqueue randomly chosen task assigned first,其它的一些术语,并行程序任务图通用计时算法 L1: Li | indegree(i)=0, 1 i N L2: ts(i)=0, te(i)=0, 1 i N, _pidle(x)=0, 0 x P L3: L4: Do until L empty L5: (1) For each idle processor p(x) L6: iSched( L, p(x) ) L7: if i 0 L8: LLi L9: +i L10: ts(i) max(ts(i), _pidle(x) ) L11: (2) For each executable task j L12: j L13: te(j) ts(j)+ T comp(j)+ T comm(j) L14: _
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 品格教育面试题及答案
- 农商行面试题目及答案
- 内蒙古乌兰察布市集宁第一中学2026届化学高三上期中达标检测试题含解析
- 陕西省延安市2026届化学高一第一学期期中质量检测模拟试题含解析
- 2025年互联网数据中心服务代理销售合同范本
- 面试题目及答案
- 面试题及答案美团
- 安徽省铜陵市枞阳县浮山中学2026届化学高一第一学期期末联考模拟试题含解析
- 面试题 金融问题及答案
- 梅州国企面试题及答案
- 四川省公需科目(超全):2025年度四川省专业技术人员继续教育考试题库
- 进入有限空间作业工作票
- GB/T 29790-2020即时检验质量和能力的要求
- 最新部编版人教版一年级语文上册《江南》优质教学课件
- 艰苦边远地区范围和类别表
- 高考作文指导:理顺说理逻辑增强议论文生命力 课件(47张PPT)
- 《普通高中英语课程标准版》
- 国家开放大学人文英语4 unit1~8边学边练答案完整版
- 直流充电桩出厂检验报告
- 风电项目开发流程
- MCN机构与抖音达人签约协议
评论
0/150
提交评论