




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Carnegie Mellon1Cache Memories15-213: Introduction to Computer Systems10th Lecture, Sep. 23, 2010.used : Computer Organization and ArchitectureChapter 5, Spring, 2016Instructors: Randy Bryant and Dave OHallaronSpeaker:Yan Liu, Hunan University ()Carnegie Mellon2TodayCache memory organization and ope
2、rationPerformance impact of cachesThe memory mountainRearranging loops to improve spatial localityUsing blocking to improve temporal localityCarnegie Mellon3Cache MemoriesCache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main mem
3、oryCPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.Typical system structure:MainmemoryI/ObridgeBus interfaceALURegister fileCPU chipSystem busMemory busCache memoriesCarnegie Mellon4当把一个块调入高一层当把一个块调入高一层( (靠近靠近CPU)CPU)存储器时,存储器时,可以放在哪些位置上可以放在哪些位置上? ? (映像规则)当所要访问的块在高一层存储器
4、中时,如何当所要访问的块在高一层存储器中时,如何找到该块找到该块? ?(查找算法)当发生失效时,应替换哪一块?当发生失效时,应替换哪一块?(替换算法)当进行写访问时,应进行哪些操作当进行写访问时,应进行哪些操作? ? (写策略)存储存储层次的四个问题层次的四个问题Carnegie Mellon5映象规则映象规则 1. 全相联映象全相联映象 全相联:全相联:主存中的任一块可以被放置到主存中的任一块可以被放置到CacheCache中的任意一个位置。中的任意一个位置。举例举例对比:对比:阅览室位置阅览室位置 随便坐随便坐特点:特点:空间利用率最高,冲突概率最低,实空间利用率最高,冲突概率最低,实现最
5、复杂。现最复杂。 Carnegie Mellon6CacheCache的基本知识的基本知识Carnegie Mellon7CacheCache的基本知识的基本知识直接映象直接映象 直接映象:主存中的每一块只能被放置到Cache中唯一的一个位置。举例(循环分配)对比:阅览室位置 只有一个位置可以坐特点:空间利用率最低,冲突概率最高,实现最简单。Carnegie Mellon85.2 Cache5.2 Cache的基本知识的基本知识Carnegie Mellon9组相联映象组相联映象 组相联:主存中的每一块可以被放置到Cache中唯一的一个组中的任何一个位置。 举例组相联是直接映象和全相联的一种折
6、中Carnegie Mellon10General Cache Organization (S, E, B)E = 2e lines per setS = 2s setssetline01 2B-1tagvB = 2b bytes per cache block (the data)Cache size:C = S x E x B data bytesvalid bitCarnegie Mellon11Cache ReadE = 2e lines per setS = 2s sets01 2B-1tagvvalid bitB = 2b bytes per cache block (the da
7、ta)t bitss bitsb bitsAddress of word:tagsetindexblockoffsetdata begins at this offsetLocate setCheck if any line in sethas matching tagYes + line valid: hitLocate data startingat offsetCarnegie Mellon12Example: Direct Mapped Cache (E = 1)S = 2s setsDirect mapped: One line per setAssume: cache block
8、size 8 bytest bits001100Address of int:01 27tagv365401 27tagv365401 27tagv365401 27tagv3654find setCarnegie Mellon13Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits001100Address of int:01 27tagv3654match: assume yes = hitvalid? +block offsetta
9、gCarnegie Mellon14Example: Direct Mapped Cache (E = 1)Direct mapped: One line per setAssume: cache block size 8 bytest bits001100Address of int:01 27tagv3654match: assume yes = hitvalid? +int (4 Bytes) is hereblock offsetNo match: old line is evicted and replacedCarnegie Mellon15Direct-Mapped Cache
10、SimulationM=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 Blocks/setAddress trace (reads, one byte per read):000002, 100012, 701112, 810002, 000002xt=1s=2b=1xxx0?vTagBlockmiss10M0-1hitmiss10M6-7miss11M8-9miss10M0-1Set 0Set 1Set 2Set 3Carnegie Mellon16A Higher Level Exampleint sum_array_rows(doub
11、le a1616) int i, j; double sum = 0; for (i = 0; i 16; i+) for (j = 0; j 16; j+) sum += aij; return sum;32 B = 4 doublesassume: cold (empty) cache,a00 goes hereint sum_array_cols(double a1616) int i, j; double sum = 0; for (j = 0; i 16; i+) for (i = 0; j 16; j+) sum += aij; return sum;blackboardIgnor
12、e the variables sum, i, jCarnegie Mellon17E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits001100Address of short int:0 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv36540 1 27tagv3654find setCarnegie
13、 Mellon18E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per setAssume: cache block size 8 bytest bits001100Address of short int:0 1 27tagv36540 1 27tagv3654compare bothvalid? + match: yes = hitblock offsettagCarnegie Mellon19E-way Set Associative Cache (Here: E = 2)E = 2: Two lines per se
14、tAssume: cache block size 8 bytest bits001100Address of short int:0 1 27tagv36540 1 27tagv3654compare bothvalid? + match: yes = hitblock offsetshort int (2 Bytes) is hereNo match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), Carneg
15、ie Mellon202-Way Set Associative Cache SimulationM=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/setAddress trace (reads, one byte per read):000002, 100012, 701112, 810002, 000002xxt=2s=1b=1xx0?vTagBlock000miss100M0-1hitmiss101M6-7miss110M8-9hitSet 0Set 1Carnegie Mellon21A Higher Level Ex
16、ampleint sum_array_rows(double a1616) int i, j; double sum = 0; for (i = 0; i 16; i+) for (j = 0; j 16; j+) sum += aij; return sum;32 B = 4 doublesassume: cold (empty) cache,a00 goes hereint sum_array_rows(double a1616) int i, j; double sum = 0; for (j = 0; i 16; i+) for (i = 0; j 16; j+) sum += aij
17、; return sum;blackboardIgnore the variables sum, i, jCarnegie Mellon22What about writes?Multiple copies of data exist:L1, L2, Main Memory, DiskWhat to do on a write-hit?Write-through (write immediately to memory)Write-back (defer write to memory until replacement of line)Need a dirty bit (line diffe
18、rent from memory or not)What to do on a write-miss?Write-allocate (load into cache, update line in cache)Good if more writes to the location followNo-write-allocate (writes immediately to memory)TypicalWrite-through + No-write-allocateWrite-back + Write-allocateCarnegie Mellon23Intel Core i7 Cache H
19、ierarchyRegsL1 d-cacheL1 i-cacheL2 unified cacheCore 0RegsL1 d-cacheL1 i-cacheL2 unified cacheCore 3L3 unified cache(shared by all cores)Main memoryProcessor packageL1 i-cache and d-cache:32 KB, 8-way, Access: 4 cyclesL2 unified cache: 256 KB, 8-way, Access: 11 cyclesL3 unified cache:8 MB, 16-way,Ac
20、cess: 30-40 cyclesBlock size: 64 bytes for all caches. Carnegie Mellon24Cache Performance MetricsMiss RateFraction of memory references not found in cache (misses / accesses)= 1 hit rateTypical numbers (in percentages):3-10% for L1can be quite small (e.g., 1%) for L2, depending on size, etc.Hit Time
21、Time to deliver a line in the cache to the processorincludes time to determine whether the line is in the cacheTypical numbers:1-2 clock cycle for L15-20 clock cycles for L2Miss PenaltyAdditional time required because of a misstypically 50-200 cycles for main memory (Trend: increasing!)Carnegie Mell
22、on25Lets think about those numbersHuge difference between a hit and a missCould be 100 x, if just L1 and main memoryWould you believe 99% hits is twice as good as 97%?Consider: cache hit time of 1 cyclemiss penalty of 100 cyclesAverage access time: 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles 99
23、% hits: 1 cycle + 0.01 * 100 cycles = 2 cyclesThis is why “miss rate” is used instead of “hit rate”Carnegie Mellon26Writing Cache Friendly CodeMake the common case go fastFocus on the inner loops of the core functionsMinimize the misses in the inner loopsRepeated references to variables are good (te
24、mporal locality)Stride-1 reference patterns are good (spatial locality)Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.Carnegie Mellon27TodayCache organization and operationPerformance impact of cachesThe memory mountainRearranging loops to impr
25、ove spatial localityUsing blocking to improve temporal localityCarnegie Mellon28The Memory MountainRead throughput (read bandwidth)Number of bytes read from memory per second (MB/s)Memory mountain: Measured read throughput as a function of spatial and temporal locality.Compact way to characterize me
26、mory system performance. Carnegie Mellon29Memory Mountain Test Function/* The test function */void test(int elems, int stride) int i, result = 0; volatile int sink; for (i = 0; i elems; i += stride)result += datai; sink = result; /* So compiler doesnt optimize away the loop */* Run test(elems, strid
27、e) and return read throughput (MB/s) */double run(int size, int stride, double Mhz) double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cy
28、cles to MB/s */Carnegie Mellon30The Memory Mountain01000200030004000500060007000s1s3s5s7s9s11s13s15s32Read throughput (MB/s)Stride (x8 bytes)Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cacheAll caches on-chipCarnegie Mellon31The Memory Mountain010002000300040005
29、00060007000s1s3s5s7s9s11s13s15s32Read throughput (MB/s)Stride (x8 bytes)Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cacheAll caches on-chipSlopes ofspatial localityCarnegie Mellon32The Memory Mountain01000200030004000500060007000s1s3s5s7s9s11s13s15s32Read throug
30、hput (MB/s)Stride (x8 bytes)L1L2MemL3Intel Core i732 KB L1 i-cache32 KB L1 d-cache256 KB unified L2 cache8M unified L3 cacheAll caches on-chipSlopes ofspatial localityRidges of Temporal localityCarnegie Mellon33TodayCache organization and operationPerformance impact of cachesThe memory mountainRearr
31、anging loops to improve spatial localityUsing blocking to improve temporal localityCarnegie Mellon34Miss Rate Analysis for Matrix MultiplyAssume:Line size = 32B (big enough for four 64-bit words)Matrix dimension (N) is very largeApproximate 1/N as 0.0Cache is not even big enough to hold multiple row
32、sAnalysis Method:Look at access pattern of inner loopAkiBkjCijCarnegie Mellon35Matrix Multiplication ExampleDescription:Multiply N x N matricesO(N3) total operationsN reads per source elementN values summed per destinationbut may be able to hold in register/* ijk */for (i=0; in; i+) for (j=0; jn; j+
33、) sum = 0.0; for (k=0; kn; k+) sum += aik * bkj; cij = sum; Variable sumheld in registerCarnegie Mellon36Layout of C Arrays in Memory (review)C arrays allocated in row-major ordereach row in contiguous memory locationsStepping through columns in one row:for (i = 0; i 4 bytes, exploit spatial localit
34、ycompulsory miss rate = 4 bytes / BStepping through rows in one column:for (i = 0; i n; i+)sum += ai0;accesses distant elementsno spatial locality!compulsory miss rate = 1 (i.e. 100%)Carnegie Mellon37Matrix Multiplication (ijk)/* ijk */for (i=0; in; i+) for (j=0; jn; j+) sum = 0.0; for (k=0; kn; k+)
35、 sum += aik * bkj; cij = sum; ABC(i,*)(*,j)(i,j)Inner loop:Column-wiseRow-wiseFixedMisses per inner loop iteration:ABC0.251.00.0Carnegie Mellon38Matrix Multiplication (jik)/* jik */for (j=0; jn; j+) for (i=0; in; i+) sum = 0.0; for (k=0; kn; k+) sum += aik * bkj; cij = sum ABC(i,*)(*,j)(i,j)Inner lo
36、op:Row-wiseColumn-wiseFixedMisses per inner loop iteration:ABC0.251.00.0Carnegie Mellon39Matrix Multiplication (kij)/* kij */for (k=0; kn; k+) for (i=0; in; i+) r = aik; for (j=0; jn; j+) cij += r * bkj; ABC(i,*)(i,k)(k,*)Inner loop:Row-wise Row-wiseFixedMisses per inner loop iteration:ABC0.00.250.2
37、5Carnegie Mellon40Matrix Multiplication (ikj)/* ikj */for (i=0; in; i+) for (k=0; kn; k+) r = aik; for (j=0; jn; j+) cij += r * bkj; ABC(i,*)(i,k)(k,*)Inner loop:Row-wise Row-wiseFixedMisses per inner loop iteration:ABC0.00.250.25Carnegie Mellon41Matrix Multiplication (jki)/* jki */for (j=0; jn; j+)
38、 for (k=0; kn; k+) r = bkj; for (i=0; in; i+) cij += aik * r; ABC(*,j)(k,j)Inner loop:(*,k)Column-wiseColumn-wiseFixedMisses per inner loop iteration:ABC1.00.01.0Carnegie Mellon42Matrix Multiplication (kji)/* kji */for (k=0; kn; k+) for (j=0; jn; j+) r = bkj; for (i=0; in; i+) cij += aik * r; ABC(*,
39、j)(k,j)Inner loop:(*,k)FixedColumn-wiseColumn-wiseMisses per inner loop iteration:ABC1.00.01.0Carnegie Mellon43Summary of Matrix Multiplicationijk (& jik): 2 loads, 0 stores misses/iter = 1.25kij (& ikj): 2 loads, 1 store misses/iter = 0.5jki (& kji): 2 loads, 1 store misses/iter = 2.0fo
40、r (i=0; in; i+) for (j=0; jn; j+) sum = 0.0; for (k=0; kn; k+) sum += aik * bkj; cij = sum; for (k=0; kn; k+) for (i=0; in; i+) r = aik; for (j=0; jn; j+) cij += r * bkj; for (j=0; jn; j+) for (k=0; kn; k+) r = bkj; for (i=0; in; i+) cij += aik * r; Carnegie Mellon44Core i7 Matrix Multiply Performan
41、ce010203040506050100150200250300350400450500550600650700750Cycles per inner loop iterationArray size (n)jkikjiijkjikkijikjjki / kjiijk / jikkij / ikjCarnegie Mellon45TodayCache organization and operationPerformance impact of cachesThe memory mountainRearranging loops to improve spatial localityUsing
42、 blocking to improve temporal localityCarnegie Mellon46Example: Matrix Multiplicationabij*c=c = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) int i, j, k; for (i = 0; i n; i+)for (j = 0; j n; j+) for (k = 0; k n; k+) ci*n
43、+j += ai*n + k*bk*n + j;Carnegie Mellon47Cache Miss AnalysisAssume: Matrix elements are doublesCache block = 8 doublesCache size C n (much smaller than n)First iteration:n/8 + n = 9n/8 missesAfterwards in cache:(schematic)*=n*=8 wideCarnegie Mellon48Cache Miss AnalysisAssume: Matrix elements are dou
44、blesCache block = 8 doublesCache size C n (much smaller than n)Second iteration:Again:n/8 + n = 9n/8 missesTotal misses:9n/8 * n2 = (9/8) * n3 n*=8 wideCarnegie Mellon49Blocked Matrix Multiplicationc = (double *) calloc(sizeof(double), n*n);/* Multiply n x n matrices a and b */void mmm(double *a, double *b, double *c, int n) int i, j, k; for (i = 0; i n; i+=B)for (j = 0; j n; j+=B) for (k = 0; k n; k+=B) /* B x B mini matrix multiplicati
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 网络管理员考试必知要点试题及答案
- 用户反馈的计算机二级VB试题与答案
- 软考网络管理员评估试题及答案合集
- 2025年软件设计师考试快速掌握技巧试题及答案
- 2025年不同文化对公司战略的挑战及试题及答案
- 未来公司的治理结构与风险控制探索试题及答案
- 行政法学考试常见知识点:试题及答案
- 计算机教程与编程实践试题及答案
- 2025租房合同协议书
- 网络架构所需技能分析试题及答案
- 2022年阜宁县(中小学、幼儿园)教师招聘考试《教育综合知识》试题及答案解析
- GB/T 15608-2006中国颜色体系
- 建筑架子工(普通脚手架)操作技能考核标准
- 山推SD16结构原理课件
- 病假医疗期申请单(新修订)
- 95598工单大数据分析及压降策略
- 《游园不值》-完整版课件
- 大连银行招聘考试最新笔试复习材料题目内容试卷真题复习
- 卷烟纸生产工艺
- 肩关节镜下肩袖修补术的护理查房ppt
- 回旋镖运动轨迹的模拟
评论
0/150
提交评论