第三部分程序的执行流水线及优化性能10cache memories liuyan

上传人：我*** IP属地：北京上传时间：2022-03-16 格式：PPTX 页数：53 大小：244.54KB 积分：16 举报 版权申诉

第三部分程序的执行流水线及优化性能10cache memories liuyan_第2页

第三部分程序的执行流水线及优化性能10cache memories liuyan_第3页

第三部分程序的执行流水线及优化性能10cache memories liuyan_第4页

第三部分程序的执行流水线及优化性能10cache memories liuyan_第5页

已阅读5页，还剩48页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Carnegie Mellon1Cache Memories15-213: Introduction to Computer Systems10th Lecture, Sep. 23, 2010.used : Computer Organization and ArchitectureChapter 5, Spring, 2016Instructors: Randy Bryant and Dave OHallaronSpeaker:Yan Liu, Hunan University （）Carnegie Mellon2TodayCache memory organization and ope

2、rationPerformance impact of cachesThe memory mountainRearranging loops to improve spatial localityUsing blocking to improve temporal localityCarnegie Mellon3Cache MemoriesCache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main mem

3、oryCPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.Typical system structure:MainmemoryI/ObridgeBus interfaceALURegister fileCPU chipSystem busMemory busCache memoriesCarnegie Mellon4当把一个块调入高一层当把一个块调入高一层( (靠近靠近CPU)CPU)存储器时，存储器时，可以放在哪些位置上可以放在哪些位置上? ? （映像规则）当所要访问的块在高一层存储器

4、中时，如何当所要访问的块在高一层存储器中时，如何找到该块找到该块? ?（查找算法）当发生失效时，应替换哪一块？当发生失效时，应替换哪一块？（替换算法）当进行写访问时，应进行哪些操作当进行写访问时，应进行哪些操作? ? （写策略）存储存储层次的四个问题层次的四个问题Carnegie Mellon5映象规则映象规则 1. 全相联映象全相联映象全相联：全相联：主存中的任一块可以被放置到主存中的任一块可以被放置到CacheCache中的任意一个位置。中的任意一个位置。举例举例对比：对比：阅览室位置阅览室位置随便坐随便坐特点：特点：空间利用率最高，冲突概率最低，实空间利用率最高，冲突概率最低，实现最

5、复杂。现最复杂。 Carnegie Mellon6CacheCache的基本知识的基本知识Carnegie Mellon7CacheCache的基本知识的基本知识直接映象直接映象直接映象：主存中的每一块只能被放置到Cache中唯一的一个位置。举例（循环分配）对比：阅览室位置只有一个位置可以坐特点：空间利用率最低，冲突概率最高，实现最简单。Carnegie Mellon85.2 Cache5.2 Cache的基本知识的基本知识Carnegie Mellon9组相联映象组相联映象组相联：主存中的每一块可以被放置到Cache中唯一的一个组中的任何一个位置。举例组相联是直接映象和全相联的一种折

6、中Carnegie Mellon10General Cache Organization (S, E, B)E = 2e lines per setS = 2s setssetline01 2B-1tagvB = 2b bytes per cache block (the data)Cache size:C = S x E x B data bytesvalid bitCarnegie Mellon11Cache ReadE = 2e lines per setS = 2s sets01 2B-1tagvvalid bitB = 2b bytes per cache block (the da

7、ta)t bitss bitsb bitsAddress of word:tagsetindexblockoffsetdata begins at this offsetLocate setCheck if any line in sethas matching tagYes + line valid: hitLocate data startingat offsetCarnegie Mellon12Example: Direct Mapped Cache (E = 1)S = 2s setsDirect mapped: One line per setAssume: cache block

8、size 8 bytest bits001100Address of int:01 27tagv365401 27tagv365401 27tagv365401 27tagv3654find setCarnegie Mellon13Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits001100Address of int:01 27tagv3654match: assume yes = hitvalid? +block offsetta

9、gCarnegie Mellon14Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits001100Address of int:01 27tagv3654match: assume yes = hitvalid? +int (4 Bytes) is hereblock offsetNo match: old line is evicted and replacedCarnegie Mellon15Direct-Mapped Cache

10、SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/setAddress trace (reads, one byte per read):000002, 100012, 701112, 810002, 000002xt=1s=2b=1xxx0?vTagBlockmiss10M0-1hitmiss10M6-7miss11M8-9miss10M0-1Set 0Set 1Set 2Set 3Carnegie Mellon16A Higher Level Exampleint sum_array_rows(doub

11、le a1616) int i, j; double sum = 0; for (i = 0; i 16; i+) for (j = 0; j 16; j+) sum += aij; return sum;32 B = 4 doublesassume: cold (empty) cache,a00 goes hereint sum_array_cols(double a1616) int i, j; double sum = 0; for (j = 0; i 16; i+) for (i = 0; j 16; j+) sum += aij; return sum;blackboardIgnor

12、e the variables sum, i, jCarnegie Mellon17E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits001100Address of short int:0 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv3654find setCarnegie

13、 Mellon18E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits001100Address of short int:0 1 27tagv36540 1 27tagv3654compare bothvalid? + match: yes = hitblock offsettagCarnegie Mellon19E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per se

14、tAssume: cache block size 8 bytest bits001100Address of short int:0 1 27tagv36540 1 27tagv3654compare bothvalid? + match: yes = hitblock offsetshort int (2 Bytes) is hereNo match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), Carneg

15、ie Mellon202-Way Set Associative Cache SimulationM=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/setAddress trace (reads, one byte per read):000002, 100012, 701112, 810002, 000002xxt=2s=1b=1xx0?vTagBlock000miss100M0-1hitmiss101M6-7miss110M8-9hitSet 0Set 1Carnegie Mellon21A Higher Level Ex

16、ampleint sum_array_rows(double a1616) int i, j; double sum = 0; for (i = 0; i 16; i+) for (j = 0; j 16; j+) sum += aij; return sum;32 B = 4 doublesassume: cold (empty) cache,a00 goes hereint sum_array_rows(double a1616) int i, j; double sum = 0; for (j = 0; i 16; i+) for (i = 0; j 16; j+) sum += aij

17、; return sum;blackboardIgnore the variables sum, i, jCarnegie Mellon22What about writes?Multiple copies of data exist:L1, L2, Main Memory, DiskWhat to do on a write-hit?Write-through (write immediately to memory)Write-back (defer write to memory until replacement of line)Need a dirty bit (line diffe

18、rent from memory or not)What to do on a write-miss?Write-allocate (load into cache, update line in cache)Good if more writes to the location followNo-write-allocate (writes immediately to memory)TypicalWrite-through + No-write-allocateWrite-back + Write-allocateCarnegie Mellon23Intel Core i7 Cache H

19、ierarchyRegsL1 d-cacheL1 i-cacheL2 unified cacheCore 0RegsL1 d-cacheL1 i-cacheL2 unified cacheCore 3L3 unified cache(shared by all cores)Main memoryProcessor packageL1 i-cache and d-cache:32 KB, 8-way, Access: 4 cyclesL2 unified cache: 256 KB, 8-way, Access: 11 cyclesL3 unified cache:8 MB, 16-way,Ac

20、cess: 30-40 cyclesBlock size: 64 bytes for all caches. Carnegie Mellon24Cache Performance MetricsMiss RateFraction of memory references not found in cache (misses / accesses)= 1 hit rateTypical numbers (in percentages):3-10% for L1can be quite small (e.g., 1%) for L2, depending on size, etc.Hit Time

21、Time to deliver a line in the cache to the processorincludes time to determine whether the line is in the cacheTypical numbers:1-2 clock cycle for L15-20 clock cycles for L2Miss PenaltyAdditional time required because of a misstypically 50-200 cycles for main memory (Trend: increasing!)Carnegie Mell

22、on25Lets think about those numbersHuge difference between a hit and a missCould be 100 x, if just L1 and main memoryWould you believe 99% hits is twice as good as 97%?Consider: cache hit time of 1 cyclemiss penalty of 100 cyclesAverage access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99

23、% hits: 1 cycle + 0.01 * 100 cycles = 2 cyclesThis is why “miss rate” is used instead of “hit rate”Carnegie Mellon26Writing Cache Friendly CodeMake the common case go fastFocus on the inner loops of the core functionsMinimize the misses in the inner loopsRepeated references to variables are good (te

24、mporal locality)Stride-1 reference patterns are good (spatial locality)Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.Carnegie Mellon27TodayCache organization and operationPerformance impact of cachesThe memory mountainRearranging loops to impr

25、ove spatial localityUsing blocking to improve temporal localityCarnegie Mellon28The Memory MountainRead throughput (read bandwidth)Number of bytes read from memory per second (MB/s)Memory mountain: Measured read throughput as a function of spatial and temporal locality.Compact way to characterize me

26、mory system performance. Carnegie Mellon29Memory Mountain Test Function/* The test function */void test(int elems, int stride) int i, result = 0; volatile int sink; for (i = 0; i elems; i += stride)result += datai; sink = result; /* So compiler doesnt optimize away the loop */* Run test(elems, strid

27、e) and return read throughput (MB/s) */double run(int size, int stride, double Mhz) double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cy

28、cles to MB/s */Carnegie Mellon30The Memory Mountain01000200030004000500060007000s1s3s5s7s9s11s13s15s32Read throughput (MB/s)Stride (x8 bytes)Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cacheAll caches on-chipCarnegie Mellon31The Memory Mountain010002000300040005

29、00060007000s1s3s5s7s9s11s13s15s32Read throughput (MB/s)Stride (x8 bytes)Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cacheAll caches on-chipSlopes ofspatial localityCarnegie Mellon32The Memory Mountain01000200030004000500060007000s1s3s5s7s9s11s13s15s32Read throug

30、hput (MB/s)Stride (x8 bytes)L1L2MemL3Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cacheAll caches on-chipSlopes ofspatial localityRidges of Temporal localityCarnegie Mellon33TodayCache organization and operationPerformance impact of cachesThe memory mountainRearr

31、anging loops to improve spatial localityUsing blocking to improve temporal localityCarnegie Mellon34Miss Rate Analysis for Matrix MultiplyAssume:Line size = 32B (big enough for four 64-bit words)Matrix dimension (N) is very largeApproximate 1/N as 0.0Cache is not even big enough to hold multiple row

32、sAnalysis Method:Look at access pattern of inner loopAkiBkjCijCarnegie Mellon35Matrix Multiplication ExampleDescription:Multiply N x N matricesO(N3) total operationsN reads per source elementN values summed per destinationbut may be able to hold in register/* ijk */for (i=0; in; i+) for (j=0; jn; j+

33、) sum = 0.0; for (k=0; kn; k+) sum += aik * bkj; cij = sum; Variable sumheld in registerCarnegie Mellon36Layout of C Arrays in Memory (review)C arrays allocated in row-major ordereach row in contiguous memory locationsStepping through columns in one row:for (i = 0; i 4 bytes, exploit spatial localit

34、ycompulsory miss rate = 4 bytes / BStepping through rows in one column:for (i = 0; i n; i+)sum += ai0;accesses distant elementsno spatial locality!compulsory miss rate = 1 (i.e. 100%)Carnegie Mellon37Matrix Multiplication (ijk)/* ijk */for (i=0; in; i+) for (j=0; jn; j+) sum = 0.0; for (k=0; kn; k+)

35、 sum += aik * bkj; cij = sum; ABC(i,*)(*,j)(i,j)Inner loop:Column-wiseRow-wiseFixedMisses per inner loop iteration:ABC0.251.00.0Carnegie Mellon38Matrix Multiplication (jik)/* jik */for (j=0; jn; j+) for (i=0; in; i+) sum = 0.0; for (k=0; kn; k+) sum += aik * bkj; cij = sum ABC(i,*)(*,j)(i,j)Inner lo

36、op:Row-wiseColumn-wiseFixedMisses per inner loop iteration:ABC0.251.00.0Carnegie Mellon39Matrix Multiplication (kij)/* kij */for (k=0; kn; k+) for (i=0; in; i+) r = aik; for (j=0; jn; j+) cij += r * bkj; ABC(i,*)(i,k)(k,*)Inner loop:Row-wise Row-wiseFixedMisses per inner loop iteration:ABC0.00.250.2

37、5Carnegie Mellon40Matrix Multiplication (ikj)/* ikj */for (i=0; in; i+) for (k=0; kn; k+) r = aik; for (j=0; jn; j+) cij += r * bkj; ABC(i,*)(i,k)(k,*)Inner loop:Row-wise Row-wiseFixedMisses per inner loop iteration:ABC0.00.250.25Carnegie Mellon41Matrix Multiplication (jki)/* jki */for (j=0; jn; j+)

38、 for (k=0; kn; k+) r = bkj; for (i=0; in; i+) cij += aik * r; ABC(*,j)(k,j)Inner loop:(*,k)Column-wiseColumn-wiseFixedMisses per inner loop iteration:ABC1.00.01.0Carnegie Mellon42Matrix Multiplication (kji)/* kji */for (k=0; kn; k+) for (j=0; jn; j+) r = bkj; for (i=0; in; i+) cij += aik * r; ABC(*,

39、j)(k,j)Inner loop:(*,k)FixedColumn-wiseColumn-wiseMisses per inner loop iteration:ABC1.00.01.0Carnegie Mellon43Summary of Matrix Multiplicationijk (& jik): 2 loads, 0 stores misses/iter = 1.25kij (& ikj): 2 loads, 1 store misses/iter = 0.5jki (& kji): 2 loads, 1 store misses/iter = 2.0fo

40、r (i=0; in; i+) for (j=0; jn; j+) sum = 0.0; for (k=0; kn; k+) sum += aik * bkj; cij = sum; for (k=0; kn; k+) for (i=0; in; i+) r = aik; for (j=0; jn; j+) cij += r * bkj; for (j=0; jn; j+) for (k=0; kn; k+) r = bkj; for (i=0; in; i+) cij += aik * r; Carnegie Mellon44Core i7 Matrix Multiply Performan

41、ce010203040506050100150200250300350400450500550600650700750Cycles per inner loop iterationArray size (n)jkikjiijkjikkijikjjki / kjiijk / jikkij / ikjCarnegie Mellon45TodayCache organization and operationPerformance impact of cachesThe memory mountainRearranging loops to improve spatial localityUsing

42、 blocking to improve temporal localityCarnegie Mellon46Example: Matrix Multiplicationabij*c=c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) int i, j, k; for (i = 0; i n; i+)for (j = 0; j n; j+) for (k = 0; k n; k+) ci*n

43、+j += ai*n + k*bk*n + j;Carnegie Mellon47Cache Miss AnalysisAssume: Matrix elements are doublesCache block = 8 doublesCache size C n (much smaller than n)First iteration:n/8 + n = 9n/8 missesAfterwards in cache:(schematic)*=n*=8 wideCarnegie Mellon48Cache Miss AnalysisAssume: Matrix elements are dou

44、blesCache block = 8 doublesCache size C n (much smaller than n)Second iteration:Again:n/8 + n = 9n/8 missesTotal misses:9n/8 * n2 = (9/8) * n3 n*=8 wideCarnegie Mellon49Blocked Matrix Multiplicationc = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) int i, j, k; for (i = 0; i n; i+=B)for (j = 0; j n; j+=B) for (k = 0; k n; k+=B) /* B x B mini matrix multiplicati

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

第三部分程序的执行流水线及优化性能10cache memories liuyan

文档简介

温馨提示

最新文档

评论

第三部分程序的执行流水线及优化性能10cache memories liuyan

文档简介

温馨提示

最新文档

评论

相关文档