GraphPartitioning.ppt_第1页
GraphPartitioning.ppt_第2页
GraphPartitioning.ppt_第3页
GraphPartitioning.ppt_第4页
GraphPartitioning.ppt_第5页
已阅读5页,还剩64页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、8/7/2020,CS267, Yelick,1,CS 267: Applications of Parallel ComputersLecture 19:Graph Partitioning Part II,Kathy Yelick /cs267,8/7/2020,CS267, Yelick,2,Recap of Last Lecture,Partitioning with nodal coordinates: Inertial method Projection onto a sphere Algorithms are eff

2、icient Rely on graphs having nodes connected (mostly) to “nearest neighbors” in space Partitioning without nodal coordinates: Breadth-First Search simple, but not great partition Kernighan-Lin good corrector given reasonable partition Spectral Method good partitions, but slow Today: Spectral methods

3、 revisited Multilevel methods,8/7/2020,CS267, Yelick,3,Basic Definitions,Definition: The Laplacian matrix L(G) of a graph G(N,E) is an |N| by |N| symmetric matrix, with one row and column for each node. It is defined by L(G) (i,i) = degree of node I (number of incident edges) L(G) (i,j) = -1 if i !=

4、 j and there is an edge (i,j) L(G) (i,j) = 0 otherwise,2 -1 -1 0 0 -1 2 -1 0 0 -1 -1 4 -1 -1 0 0 -1 2 -1 0 0 -1 -1 2,1,2,3,4,5,G =,L(G) =,8/7/2020,CS267, Yelick,4,Properties of Laplacian Matrix,Theorem 1: Given G, L(G) has the following properties (proof on web page) L(G) is symmetric. This means th

5、e eigenvalues of L(G) are real and its eigenvectors are real and orthogonal. Rows of L sum to zero: Let e = 1,1T, i.e. the column vector of all ones. Then L(G)*e=0. The eigenvalues of L(G) are nonnegative: 0 = l1 = l2 = = ln The number of connected components of G is equal to the number of li equal

6、to 0. Definition: l2(L(G) is the algebraic connectivity of G The magnitude of l2 measures connectivity In particular, l2 != 0 if and only if G is connected.,8/7/2020,CS267, Yelick,5,Spectral Bisection Algorithm,Spectral Bisection Algorithm: Compute eigenvector v2 corresponding to l2(L(G) For each no

7、de n of G if v2(n) 0 put node n in partition N- else put node n in partition N+ Why does this make sense? Recall l2(L(G) is the algebraic connectivity of G Theorem (Fiedler): Let G1(N,E1) be a subgraph of G(N,E), so that G1 is “less connected” than G. Then l2(L(G) = l2(L(G) , i.e. the algebraic conn

8、ectivity of G1 is less than or equal to the algebraic connectivity of G. (proof on web page),8/7/2020,CS267, Yelick,6,Motivation for Spectral Bisection,Vibrating string has modes of vibration, or harmonics Modes computable as follows Model string as masses connected by springs (a 1D mesh) Write down

9、 F=ma for coupled system, get matrix A Eigenvalues and eigenvectors of A are frequencies and shapes of modes Label nodes by whether mode - or + to get N- and N+ Same idea for other graphs (eg planar graph trampoline),8/7/2020,CS267, Yelick,7,Eigenvectors of L(1D mesh),Eigenvector 1 (all ones),Eigenv

10、ector 2,Eigenvector 3,8/7/2020,CS267, Yelick,8,2nd eigenvector of L(planar mesh),8/7/2020,CS267, Yelick,9,Computing v2 and l2 of L(G) using Lanczos,Given any n-by-n symmetric matrix A (such as L(G) Lanczos computes a k-by-k “approximation” T by doing k matrix-vector products, k n Approximate As eige

11、nvalues/vectors using Ts,Choose an arbitrary starting vector r b(0) = |r| j=0 repeat j=j+1 q(j) = r/b(j-1) scale a vector r = A*q(j) matrix vector multiplication, the most expensive step r = r - b(j-1)*v(j-1) “saxpy”, or scalar*vector + vector a(j) = v(j)T * r dot product r = r - a(j)*v(j) “saxpy” b

12、(j) = |r| compute vector norm until convergence details omitted,T = a(1) b(1) b(1) a(2) b(2) b(2) a(3) b(3) b(k-2) a(k-1) b(k-1) b(k-1) a(k),8/7/2020,CS267, Yelick,10,Spectral Bisection: Summary,Laplacian matrix represents graph connectivity Second eigenvector gives a graph bisection Roughly equal “

13、weights” in two parts Weak connection in the graph will be separator Implementation via the Lanczos Algorithm To optimize sparse-matrix-vector multiply, we graph partition To graph partition, we find an eigenvector of a matrix associated with the graph To find an eigenvector, we do sparse-matrix vec

14、tor multiply Have we made progress? The first matrix-vector multiplies are slow, but use them to learn how to make the rest faster,8/7/2020,CS267, Yelick,11,Introduction to Multilevel Partitioning,If we want to partition G(N,E), but it is too big to do efficiently, what can we do? 1) Replace G(N,E)

15、by a coarse approximation Gc(Nc,Ec), and partition Gc instead 2) Use partition of Gc to get a rough partitioning of G, and then iteratively improve it What if Gc still too big? Apply same idea recursively,8/7/2020,CS267, Yelick,12,Multilevel Partitioning - High Level Algorithm,(N+,N- ) = Multilevel_

16、Partition( N, E ) recursive partitioning routine returns N+ and N- where N = N+ U N- if |N| is small (1) Partition G = (N,E) directly to get N = N+ U N- Return (N+, N- ) else (2) Coarsen G to get an approximation Gc = (Nc, Ec) (3) (Nc+ , Nc- ) = Multilevel_Partition( Nc, Ec ) (4) Expand (Nc+ , Nc- )

17、 to a partition (N+ , N- ) of N (5) Improve the partition ( N+ , N- ) Return ( N+ , N- ) endif,(2,3),(2,3),(2,3),(1),(4),(4),(4),(5),(5),(5),How do we Coarsen? Expand? Improve?,“V - cycle:”,8/7/2020,CS267, Yelick,13,Multilevel Kernighan-Lin,Coarsen graph and expand partition using maximal matchings

18、Improve partition using Kernighan-Lin,8/7/2020,CS267, Yelick,14,Maximal Matching,Definition: A matching of a graph G(N,E) is a subset Em of E such that no two edges in Em share an endpoint Definition: A maximal matching of a graph G(N,E) is a matching Em to which no more edges can be added and remai

19、n a matching A simple greedy algorithm computes a maximal matching:,let Em be empty mark all nodes in N as unmatched for i = 1 to |N| visit the nodes in any order if i has not been matched mark i as matched if there is an edge e=(i,j) where j is also unmatched, add e to Em mark j as matched endif en

20、dif endfor,8/7/2020,CS267, Yelick,15,Maximal Matching: Example,8/7/2020,CS267, Yelick,16,Coarsening using a maximal matching,1) Construct a maximal matching Em of G(N,E) for all edges e=(j,k) in Em 2) collapse matches nodes into a single one Put node n(e) in Nc W(n(e) = W(j) + W(k) gray statements u

21、pdate node/edge weights for all nodes n in N not incident on an edge in Em 3) add unmatched nodes Put n in Nc do not change W(n) Now each node r in N is “inside” a unique node n(r) in Nc 4) Connect two nodes in Nc if nodes inside them are connected in E for all edges e=(j,k) in Em for each other edg

22、e e=(j,r) in E incident on j Put edge ee = (n(e),n(r) in Ec W(ee) = W(e) for each other edge e=(r,k) in E incident on k Put edge ee = (n(r),n(e) in Ec W(ee) = W(e) If there are multiple edges connecting two nodes in Nc, collapse them, adding edge weights,8/7/2020,CS267, Yelick,17,Example of Coarseni

23、ng,8/7/2020,CS267, Yelick,18,Expanding a partition of Gc to a partition of G,8/7/2020,CS267, Yelick,19,Multilevel Spectral Bisection,Coarsen graph and expand partition using maximal independent sets Improve partition using Rayleigh Quotient Iteration,8/7/2020,CS267, Yelick,20,Maximal Independent Set

24、s,Definition: An independent set of a graph G(N,E) is a subset Ni of N such that no two nodes in Ni are connected by an edge Definition: A maximal independent set of a graph G(N,E) is an independent set Ni to which no more nodes can be added and remain an independent set A simple greedy algorithm co

25、mputes a maximal independent set:,let Ni be empty for k = 1 to |N| visit the nodes in any order if node k is not adjacent to any node already in Ni add k to Ni endif endfor,8/7/2020,CS267, Yelick,21,Coarsening using Maximal Independent Sets, Build “domains” D(k) around each node k in Ni to get nodes

26、 in Nc Add an edge to Ec whenever it would connect two such domains Ec = empty set for all nodes k in Ni D(k) = ( k, empty set ) first set contains nodes in D(k), second set contains edges in D(k) unmark all edges in E repeat choose an unmarked edge e = (k,j) from E if exactly one of k and j (say k)

27、 is in some D(m) mark e add j and e to D(m) else if k and j are in two different D(m)s (say D(mi) and D(mj) mark e add edge (mk, mj) to Ec else if both k and j are in the same D(m) mark e add e to D(m) else leave e unmarked endif until no unmarked edges,8/7/2020,CS267, Yelick,22,Example of Coarsenin

28、g,- encloses domain Dk = node of Nc,8/7/2020,CS267, Yelick,23,Expanding a partition of Gc to a partition of G,Need to convert an eigenvector vc of L(Gc) to an approximate eigenvector v of L(G) Use interpolation:,For each node j in N if j is also a node in Nc, then v(j) = vc(j) use same eigenvector c

29、omponent else v(j) = average of vc(k) for all neighbors k of j in Nc end if endif,8/7/2020,CS267, Yelick,24,Example: 1D mesh of 9 nodes,8/7/2020,CS267, Yelick,25,Improve eigenvector: Rayleigh Quotient Iteration,j = 0 pick starting vector v(0) from expanding vc repeat j=j+1 r(j) = vT(j-1) * L(G) * v(

30、j-1) r(j) = Rayleigh Quotient of v(j-1) = good approximate eigenvalue v(j) = (L(G) - r(j)*I)-1 * v(j-1) expensive to do exactly, so solve approximately using an iteration called SYMMLQ, which uses matrix-vector multiply (no surprise) v(j) = v(j) / | v(j) | normalize v(j) until v(j) converges Converg

31、ence is very fast: cubic,8/7/2020,CS267, Yelick,26,Example of convergence for 1D mesh,8/7/2020,CS267, Yelick,27,Available Implementations,Multilevel Kernighan/Lin METIS (/metis) ParMETIS - parallel version Multilevel Spectral Bisection S. Barnard and H. Simon, “A fast multilevel implem

32、entation of recursive spectral bisection ”, Proc. 6th SIAM Conf. On Parallel Processing, 1993 Chaco (/CRF/papers_chaco.html) Hybrids possible Ex: Using Kernighan/Lin to improve a partition from spectral bisection,8/7/2020,CS267, Yelick,28,Comparison of methods,Compare only methods t

33、hat use edges, not nodal coordinates CS267 webpage and KK95a (see below) have other comparisons Metrics Speed of partitioning Number of edge cuts Other application dependent metrics Summary No one method best Multi-level Kernighan/Lin fastest by far, comparable to Spectral in the number of edge cuts

34、 /karypis/metis/publications/mail.html see publications KK95a and KK95b Spectral give much better cuts for some applications Ex: image segmentation /jshi/Grouping/overview.html see “Normalized Cuts and Image Segmentation”,8/7/2020,CS267, Yelick,29,Number of edg

35、es cut for a 64-way partition,Graph 144 4ELT ADD32 AUTO BBMAT FINAN512 LHR10 MAP1 MEMPLUS SHYY161 TORSO,# of Nodes 144649 15606 4960 448695 38744 74752 10672 267241 17758 76480 201142,# of Edges 1074393 45878 9462 3314611 993481 261120 209093 334931 54196 152002 1479989,Description 3D FE Mesh 2D FE

36、Mesh 32 bit adder 3D FE Mesh 2D Stiffness M. Lin. Prog. Chem. Eng. Highway Net. Memory circuit Navier-Stokes 3D FE Mesh,# Edges cut for 64-way partition 88806 2965 675 194436 55753 11388 58784 1388 17894 4365 117997,Expected # cuts for 2D mesh 6427 2111 1190 11320 3326 4620 1746 8736 2252 4674 7579,

37、Expected # cuts for 3D mesh 31805 7208 3357 67647 13215 20481 5595 47887 7856 20796 39623,Expected # cuts for 64-way partition of 2D mesh of n nodes n1/2 + 2*(n/2)1/2 + 4*(n/4)1/2 + + 32*(n/32)1/2 17 * n1/2 Expected # cuts for 64-way partition of 3D mesh of n nodes = n2/3 + 2*(n/2)2/3 + 4*(n/4)2/3 +

38、 + 32*(n/32)2/3 11.5 * n2/3,For Multilevel Kernighan/Lin, as implemented in METIS (see KK95a),8/7/2020,CS267, Yelick,30,Speed of 256-way partitioning (from KK95a),Graph 144 4ELT ADD32 AUTO BBMAT FINAN512 LHR10 MAP1 MEMPLUS SHYY161 TORSO,# of Nodes 144649 15606 4960 448695 38744 74752 10672 267241 17

39、758 76480 201142,# of Edges 1074393 45878 9462 3314611 993481 261120 209093 334931 54196 152002 1479989,Description 3D FE Mesh 2D FE Mesh 32 bit adder 3D FE Mesh 2D Stiffness M. Lin. Prog. Chem. Eng. Highway Net. Memory circuit Navier-Stokes 3D FE Mesh,Multilevel Spectral Bisection 607.3 25.0 18.7 2

40、214.2 474.2 311.0 142.6 850.2 117.9 130.0 1053.4,Multilevel Kernighan/ Lin 48.1 3.1 1.6 179.2 25.5 18.0 8.1 44.8 4.3 10.1 63.9,Partitioning time in seconds,Kernighan/Lin much faster than Spectral Bisection!,8/7/2020,CS267, Yelick,31,Coordinate-Free Partitioning: Summary,Several techniques for partit

41、ioning without coordinates Breadth-First Search simple, but not great partition Kernighan-Lin good corrector given reasonable partition Spectral Method good partitions, but slow Multilevel methods Used to speed up problems that are too large/slow Coarsen, partition, expand, improve Can be used with

42、K-L and Spectral methods and others Speed/quality For load balancing of grids, multi-level K-L probably best For other partitioning problems (vision, clustering, etc.) spectral may be better Good software available,8/7/2020,CS267, Yelick,32,Is Graph Partitioning a Solved Problem?,Myths of partitioni

43、ng due to Bruce Hendrickson Edge cut = communication cost Simple graphs are sufficient Edge cut is the right metric Existing tools solve the problem Key is finding the right partition Graph partitioning is a solved problem Slides and myths based on Bruce Hendricksons: “Load Balancing Myths, Fictions

44、 show p log p tasks leads to “good” balance Karp and Zhang 88 show this for a tree of unit cost (equal size) tasks parent must be done before children, tree unfolds at runtime children “pushed” to random processors Blumofe and Leiserson 94 show this for a fixed task tree of variable cost tasks their

45、 algorithm uses task pulling (stealing) instead of pushing, which is good for locality I.e., when a processor becomes idle, it steals from a random processor also have (loose) bounds on the total memory required Chakrabarti et al 94 show this for a dynamic tree of variable cost tasks works for branc

46、h and bound, I.e. tree structure can depend on execution order uses randomized pushing of tasks instead of pulling, so worse locality Open problem: does task pulling provably work well for dynamic trees?,8/7/2020,CS267, Yelick,55,Engineering Distributed Task Queues,A lot of papers on engineering the

47、se systems on various machines, and their applications If nothing is known about task costs when created organize local tasks as a stack (push/pop from top) steal from the stack bottom (as if it were a queue), because old tasks likely to cost more If something is known about tasks costs and communic

48、ation costs, can be used as hints. (See Wen, UCB PhD, 1996.) Part of Multipol (/projects/multipol) Try to push tasks with high ratio of cost to compute/cost to push Ex: for matmul, ratio = 2n3 cost(flop) / 2n2 cost(send a word) Goldstein, Rogers, Grunwald, and others (independent

49、work) have all shown advantages of integrating into the language framework very lightweight thread creation CILK (Leicerson et al) (/cilk),8/7/2020,CS267, Yelick,56,Diffusion-Based Load Balancing,In the randomized schemes, the machine is treated as fully-connected. Diffusion-bas

50、ed load balancing takes topology into account Locality properties better than prior work Load balancing somewhat slower than randomized Cost of tasks must be known at creation time No dependencies between tasks,8/7/2020,CS267, Yelick,57,Diffusion-based load balancing,The machine is modeled as a grap

51、h At each step, we compute the weight of task remaining on each processor This is simply the number if they are unit cost tasks Each processor compares its weight with its neighbors and performs some averaging Markov chain analysis See Ghosh et al, SPAA96 for a second order diffusive load balancing

52、algorithm takes into account amount of work sent last time avoids some oscillation of first order schemes Note: locality is still not a major concern, although balancing with neighbors may be better than random,8/7/2020,CS267, Yelick,58,DAG Scheduling,For some problems, you have a directed acyclic g

53、raph (DAG) of tasks nodes represent computation (may be weighted) edges represent orderings and usually communication (may also be weighted) not that common to have the DAG in advance Two application domains where DAGs are known Digital Signal Processing computations Sparse direct solvers (mainly Ch

54、olesky, since it doesnt require pivoting). More on this in another lecture. The basic offline strategy: partition DAG to minimize communication and keep all processors busy NP complete, so need approximations Different than graph partitioning, which was for tasks with communication but no dependencies See Gerasoulis and Yang, IEEE Trans

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论