


全文预览已结束
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
on improving the performance of sparse matrix vector multiplication james b white i11 ohio supercomputer center columbus oh 43221 abstract w e analyze single node performance of sparse matrix vector multiplication by investigating issues of data locality and fine grained parallelism w e exam ine the data locality characteristics of the compressed sparse row representation and consider improvements in locality through matrix permutation motivated by potential improvements in fine grained parallelism we evaluate modified sparse matrix representations the results lead to general conclusions about improv ing single node performance of sparse matrix vector multiplication in parallel libraries of sparse iterative solvers 1 introduction one of the core operations of iterative sparse solvers is sparse matrix vector multiplication in order to achieve high performance a parallel implementation of sparse matrix vector multiplication must maintain scalability this scalability comes from a balanced mapping of the matrix and vectors among the dis tributed processors a mapping that minimizes inter processor communication load balancing and mini mizing communication do not guarantee high perfor mance however high single node performance is also essential in this work we extend the analysis of the mapping of matrices and vectors beyond the global multiple processor distribution we consider the memory hier archy and architecture of a single processor the local representation of the matrix and vectors could affect performance in two significant ways through its im pact on data cache locality and through its effect on fine grained parallelism for pipelined superscalar ex ecution supported in part by the national science foundation un der grants dmr 9520319 and cda 9413962 p sadayappan ohio state university columbus qh 43210 matrix start index of each row i 11 0 1 2 13 16 18 i column index i j 11 0 1 11 i 0 2 3 i 131 nonzero matrix value i a l l 1 2 3 4 5 6 1 7 8 1 figure 1 compressed sparse row format csr the de factestandard storage format for unstruc tured sparse matrices is the compressed sparse row format csr for a parallel implementation the rows of a sparse matrix can be distributed among the pro cessors with the rows local to a processor stored in csr matrix vector multiplication is then performed using an owner computes strategy therefore we use csr as a starting point for our analysis 2 compressed sparse row format the idea behind csr is to pack each row storing only the nonzero elements figure 1 each row may have a different structure so the column index for each nonzero element must also be stored since each row may have an arbitrary number of nonzero elements a 2 d array is not an appropriate container for the compressed matrix instead the nonzero elements and the column indices are stored in 1 d arrays and a third array stores the index of the start of each row within these two arrays with a sparse matrix in csr matrix vector mul tiplication can be implemented with a simple nested pair of loops a c implementation is shown in fig ure 2 the outer loop is over the rows of the matrix and the inner loop is over the elements of a particular 66 1094 7256 97 10 00 0 1997 ieee for k 0 k n k c temp 0 0 f o r 1 irk 1 i k l 1 yck temp temp a l x j 11 3 processor data cache operating system c compiler compiler options 120 mhz hp pa risc 7200 256 kb spp ux 4 2 convex c 6 5 02 figure 2 sample c code for csr sparse matrix vector multiplication row as defined by the row index array i co n the target vector is y co n 11 and the source vector is x co n 11 the nonzero elements of the matrix are stored in the value array a co i nl l and the col umn indices of the nonzero elements are stored in the column index array j co i cnl 11 3 data locality to analyze the single node performance character istics of csr we first consider data locality charac teristics start with the value array ac11 and the column index array j cl for a given matrix vector multiplication each element of these arrays is used only once only spatial locality is possible since all the elements of each array are used consecutively this spatial locality is achieved next consider the target vector y kl here each element is used a number of times so spatial locality and temporal locality are both possible again all the elements are used consecutively so spatial locality is achieved in addition all of the multiple accesses of a particular element are performed consecutively temporal locality is also achieved potentially in the form of cpu register locality problems with locality finally appear in the source vector xcj c111 since the vector is accessed ac cording to an index array the access pattern is non contiguous though there is potential for spatial and temporal locality whether this locality is achieved de pends on the particular structure of the sparse ma trix a sparse matrix with a relatively random struc ture for example would yield relatively poor memory access performance for a particular sparse matrix it may be possible to improve performance by modifying the structure of the matrix for parallel sparse matrix vector multi plication on distributed memory multi computers the amount of communication is significantly affected by the way that the rows of the matrix are partitioned among the processors it may be possible to use the same partitioning strategy to more finely partition a sparse matrix on a single node to enhance data lo cality the matrix can be renumbered to make ele ments within a cache sized partition contiguous with a partitioning scheme that minimizes references be tween partitions most references could be to elements nearby in memory resulting in enhanced locality we tested the effectiveness of reordering via parti tioning on a number of sample sparse matrices using metis 3 a popular partitioning program from the university of minnesota the tests were performed on a single node of the convex exemplar spp1200 at the ohio supercomputer center table 1 the spp1200 includes hardware that collects cache usage statistics and these statistics can be analyzed with the convex performance analyzer cache hit rates and execution times were recorded for each matrix before and after partitioning by metis the performance re sults for matrix vector multiplication for a grid based test matrix are given in table 2 the test matrix has 2 x lo5 nonzero elements making the system much larger than the spp12oo s 256 kb data cache the re sults show that reordering by metis did not improve the cache hit rate or the execution time over a wide range of partition sizes we also tested large matrices from the harwell boeing collection l the results are summarized in table 3 the original matrices gave cache hit rates between 90 and 98 and for no ma trix did reordering by metis improve the hit rate or the execution time a possible explanation for why reordering by metis did not improve performance is that the tested matrices already had a significant level of structure to determine the validity of this explanation we ran domly renumbered the grid of the original test ma trix we then used metis to partition and reorder this randomly reordered matrix the results are given in table 2 they confirm the explanation given a matrix with a random structure a metis reordering was indeed able to improve performance the im 67 original 1 10 50 100 500 1000 randomly reordered partitions table 2 the effects of metis partitions on perfor mance for a 2 x 105 element 2 d grid 5 point stencil matrix on a single node of the spp1200 performance cache hit performance cache hit mflops rate 5 64 94 7 5 49 94 4 5 49 94 3 5 49 94 3 5 44 94 1 5 34 94 1 provement however did not bring performance back to the level of the original matrix we conclude from this that the sparse matrices resulting from real world problems usually have a significant level of struc ture toledo 4 reports similar results for partitioning schemes on an ibm rs sooo mflops rate 2 27 80 7 4 30 91 2 5 03 93 2 5 11 93 6 5 16 93 8 5 30 93 8 4 fine grained parallelism matrix bcsstk32 psa 106 3000 111s 3937 rua 2 5 x 104 60 psmigz 2 rua 5 x 105 1500 saylr4 rua 2 2 x 104 60 in the previous section we analyzed the data locality characteristics of csr matrix vector multipli cation here we discuss the other primary issue in single node performance fine grained parallelism in order for modern superscalar pipelined processors to operate efficiently they must receive an adequate mix of instructions that can be executed concurrently for looping algorithms such as sparse matrix vector multiplication this requirement means there must be an adequate number of arithmetic operations between the branch statements that perform the loops the original csr code falls short in this respect only mflops cache hits before 8 24 90 6 after 8 08 90 5 before 7 87 96 6 after 7 72 96 4 before 8 64 96 6 after 8 64 96 4 before 8 73 97 9 after 8 73 97 5 figure 3 sorting rows by length here the 1 d value array is represented pictorially be a matrix with jagged rows a small number of operations are performed at each iteration the well established solution to this prob lem is loop unrolling each iteration is modified to in clude operations on multiple elements instead of just one and the number of iterations is consequently de creased however csr can make loop unrolling relatively difficult and inefficient if the number of nonzero ele ments per row is small the inner loop of a csr code runs over only a few iterations this small number of iterations may marginalize the benefits of inner loop unrolling in addition csr matrices can have rows with variable length this variability makes outer loop unrolling almost impossible and it forces the overhead of inner loop unrolling to be performed for every row marginalizing the benefits of unrolling even further improving fine grained parallelism for csr then amounts to improving the effectiveness of loop un rolling a first step is improving the structure of the matrix by rearranging the rows according to length figure 3 the resulting matrix has a block struc ture where each block has equal length rows note that the reordering may have detrimental side effects if the original matrix has a structure giving good data locality the reordering may move neighboring rows far apart eliminating some of this locality improve ments in fine grained parallelism must overcome any such loss of locality in order for the performance to improve the next step is to store the reordered matrix in a form that facilitates loop unrolling we analyze three options for data representations that may en hance the effectiveness of unrolling the first option is column major csr cm csr where the rows are compressed but the resulting jagged row matrix is stored in column major form figure 4 the array that stores the beginning of each row in original csr 68 row major column major matrix bcsstk20 bcsstk32 blckhole crystk02 ct20stif e05r0000 memplus msc10848 onetonel watson5 i nnz 1 810 2 014 701 8 502 968 583 2 698 463 5 856 126 150 1 229 778 341 088 10 847 figure 4 cm csr stores the row sorted matrix in column major form bcm csr b rm csr p figure 5 blocked matrix formats instead stores the beginning of each column in cm csr cm csr may make inner loop unrolling more effec tive because the inner loop is over much longer sub sections of the value array this form is equivalent to the jagged diagonal storage format 5 which was designed for vector machines such as the cray y mp where short vector operations are relatively inefficient for cache based architectures however the advan tage in unrolling may not be able to overcome the disadvantage in data locality in column major form the temporal locality of the target vector may be lost each element is reused only after all other elements have been accessed this locality problem is a motivation for the second modified form blocked column major csr bcm csr after reordering the matrix is stored in blocks of equal row length where each block is stored in column major form figure 5 the columns of a block may be short enough that some temporal locality of the target vector can be exploited conversely the columns are likely to be longer than the rows making inner loop unrolling more efficient nnz row 3 7 45 2 4 0 69 4 51 6 24 8 7 1 113 4 9 5 5 9 table 4 sparse matrices tested the number of nonzero elements in each matrix is nnz and the aver age number of nonzero elements per row is nnz row the blocking also gives other bonuses since all the columns of a block are the same length outer loop unrolling is practical for the same reason inner loop unrolling is more efficient the overhead for the unrolling may be moved to an outer loop the final option for modifying csr simply takes the blocking of bcm csr and applies it to row major csr giving brm csr figure 5 brm csr main tains all the locality of the original csr except what is lost in reordering and it allows the full range of inner loop and out loop unrolling including moving unrolling overhead to outer loops 5 performance analysis of alternative implement at ions the various modifications of csr have different strengths and weaknesses the relative performance of a particular form can vary according to problem domain and machine type therefore we tested each representation using matrices of various sizes from dif ferent application domains and we tested them on a number of machines we started with a number of sample sparse matrices from the matrix market l and the university of florida collection 2 the size and average row length for each of these matrices are given in table 4 as illustrated by the table the ma trices were chosen to span a wide range of row size and total size for each of these matrices we timed multiplication using multiple implementations for each representa tion with varying levels of inner loop and outer loop unrolling table 5 for each case a number of suc cessive matrix vector multiplications were performed using the same matrix this was done so that the 69 unrolling 1 x 1 1 x 2 1 x 4 1 x 8 4 x 1 8 x 1 2 x 8 table 5 levels of unrolling for each form of csr the levels of unrolling are displayed as column loop unrolling x row loop unrolling for original and brm the row loop is the inner loop while the column loop is the inner loop for cm and bcm original cm brm bcm x x x x x x x x x x x x x x x x x x measured performance would be representative of its use in a context such as the iterative solution of sparse linear systems where repeated matrix vector products are performed using the same matrix we measured double precision double perfor mance on single nodes of four parallel machines at the ohio supercomputer center a convex exemplar spp1200 an ibm sp2 an sgi power challenge and an sgi onyx in addition we measured performance on a single processor dec 3000 300 axp worksta tion because of space constraints we only discuss results for the spp1200 in this paper figure 6 has performance plots for execution on the spp1200 along with the results for original csr only the best performing alternative representations are shown 1 x 8 brm csr 4 x 1 brm csr and four performance regimes emerge from these fig ures small systems with short rows small systems with long rows large systems with short rows and large systems with long rows to more carefully and systematically evaluate each of these regions we used synthetic matrices based on nearest neighbor grid structures their size and average row length were varied as shown in figure 7 the stencil sizes roughly translate into average row sizes again only the results of the best performing implementations are shown the figures illustrate differences among the four regimes defined above for systems that fit into cache the alternate representations result in higher abso lute performance and greater improvements in per formance over csr the higher absolute performance comes as no surprise since cache access is much less ex pensive than main memory access for systems that do not fit in cache the benefits of improved fine grained parallelism do not yield as much performance 1 x 4 bcm csr improvement as with the systems that fit within cache this is due to the functional unit pipeline stalls for main memory accesses in case of the large systems small matrices 61 ntu r hs 60 0 large matrices do not fll inlo sache 10 22 18 figure 6 performance of sample sparse matrix vector multiplication for a single node of the spp12000 the two charts divide the sample systems into those that fit into the data cache 256 kb and those that do not the figures illustrate more subtle differences be tween long row and short row systems the oppor tunities for fine grained parallelism differ for long and short rows because of this different implementations have better characteristics in the two regimes the re sults from the other computer systems we tested reveal that the four regime picture of performance is gen erally applicable the split between large and small clearly depends upon cache size while the split be tween long and short is consistently around 10 ele ments 70 for the spp1200 the results show that a particu lar single implementation is most effective across each regime for small short 1 x 8 brm csr consistently performs the best while 1 x 4 bcm csr is best for large short and 4 x 1 brm csr is best for small long and large long again the results for other machines show that this holds across architectures each regime tends to have a single best implementation which implementation proves best for a particular regime is dependent on the architecture however the best choices for the convex spp1200 differ from those for the sgi power challenge or ibm sp2 performance on 2 d grid 5 pt stencil 55 1x1 0 v i 50 1 8811 4 lbrm i 4bcy l 40 30 20 i0 i 0xl0 10 x104 1 0xi0 i oxm 1 oxin number of nonzero elements performance on 3 d grid 27 pt stencil i 0 x10 i 0 x104 i oxiu i oxlrp i 0xio number of nonzero elemen
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 怎么借用老师的课件教学
- 洗衣机教学课件
- 2025年护理学专业毕业生考试题及答案
- 2025年湖北专业技术职务水平能力测试(档案)练习题及答案
- 常德男性健康小知识培训课件
- 石油岗位安全知识培训课件
- 读后感小贝流浪记读后感450字(7篇)
- 年产350套GIS组合电器项目可行性研究报告
- 年产5400吨碳捕集材料项目可行性研究报告
- 2025年物资储备局招聘面试技巧大揭秘如何回答模拟题
- 石灰石-石膏湿法脱硫化学分析课件
- 药品出、入库验收制度
- 个人房地产抵押合同书
- 车间员工技能管理办法
- 医院零星维修管理制度及零星维修审批单
- DB11T 1581-2018 生产经营单位应急能力评估规范
- 青年教师成长之路
- 汶川地震波时程记录(卧龙3向)
- 吴迪完胜股市学习笔记
- HB 4-1-2020 扩口管路连接件通用规范
- 霸王集团盘中盘路演模式课件
评论
0/150
提交评论