版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、CS252 Graduate Computer Architecture Lecture 11 Data Value Prediction Limits to ILP / Multithreading February 27th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley /kubitron/cs252 2/27/20122cs252-S12, Lecture11 Data Val
2、ue Prediction Why do it? Can “Break the DataFlow Boundary” Before: Critical path = 4 operations (probably worse) After: Critical path = 1 operation (plus verification) + * / AB + YX + * / AB + YX Guess Guess Guess 2/27/20123cs252-S12, Lecture11 Data Value Predictability “The Predictability of Data V
3、alues” Yiannakis Sazeides and James Smith, Micro 30, 1997 Three different types of Patterns: Constant (C): 5 5 5 5 5 5 5 5 5 5 Stride (S): 1 2 3 4 5 6 7 8 9 Non-Stride (NS):28 13 99 107 23 456 Combinations: Repeated Stride (RS):1 2 3 1 2 3 1 2 3 1 2 3 Repeadted Non-Stride (RNS):1 -13 -99 7 1 -13 -99
4、 7 2/27/20124cs252-S12, Lecture11 Computational Predictors Last Value Predictors Predict that instruction will produce same value as last time Requires some form of hysteresis. Two subtle alternatives: Saturating counter incremented/decremented on success/failure replace when the count is below thre
5、shold Keep old value until new value seen frequently enough Second version predicts a constant when appears temporarily constant Stride Predictors Predict next value by adding the sum of most recent value to difference of two most recent values: If vn-1 and vn-2 are the two most recent values, then
6、predict next value will be: vn-1 + (vn-1 vn-2) The value (vn-1 vn-2) is called the “stride” Important variations in hysteresis: Change stride only if saturating counter falls below threshold Or “two-delta” method. Two strides maintained. First (S1) always updated by difference between two most recen
7、t values Other (S2) used for computing predictions When S1 seen twice in a row, then S1S2 More complex predictors: Multiple strides for nested loops Complex computations for complex loops (polynomials, etc!) 2/27/20125cs252-S12, Lecture11 Context Based Predictors Context Based Predictor Relies on Ta
8、bles to do trick Classified according to the order: an “n-th” order model takes last n values and uses this to produce prediction So 0th order predictor will be entirely frequency based Consider sequence: a a a b c a a a b c a a a Next value is? “Blending”: Use prediction of highest order available
9、2/27/20126cs252-S12, Lecture11 Which is better? Stride-based: Learns faster less state Much cheaper in terms of hardware! runs into errors for any pattern that is not an infinite stride Context-based: Much longer to train Performs perfectly once trained Much more expensive hardware 2/27/20127cs252-S
10、12, Lecture11 How predictable are data items? Assumptions looking for limits Prediction done with no table aliasing (every instruction has own set of tables/strides/etc. Only instructions that write into registers are measured Excludes stores, branches, jumps, etc Overall Predictability: L = Last Va
11、lue S = Stride (delta-2) FCMx = Order x context based predictor 2/27/20128cs252-S12, Lecture11 Correlation of Predicted Sets Way to interpret: l = last val s = stride f = fcm3 Combinations: ls = both l and s Etc. Conclusion? Only 18% not predicted correctly by any model About 40% captured by all pre
12、dictors A significant fraction (over 20%) only captured by fcm Stride does well! Over 60% of correct predictions captured Last-Value seems to have very little added value 2/27/20129cs252-S12, Lecture11 Number of unique values Data Observations: Many static instructions (50%) generate only one value
13、Majority of static instructions (90%) generate fewer than 64 values Majority of dynamic instructions (50%) correspond to static insts that generate fewer than 64 values Over 90% of dynamic instructions correspond to static insts that generate fewer than 4096 unique values Suggests that a relatively
14、small number of values would be required for actual context prediction 2/27/201210cs252-S12, Lecture11 General Idea: Confidence Prediction Data Predictor Confidence Prediction FetchDecode Execute Commit Reorder Buffer PC Complete Check Results Result Kill Adjust Adjust Correct PC Separate mechanisms
15、 for data and confidence prediction Data predictor keeps track of values via multiple mechanisms Confidence predictor tracks history of correctness (good/bad) Confidence prediction options: Saturating counter History register (like branch prediction) ? 2/27/201211cs252-S12, Lecture11 Administrivia M
16、idterm I: Wednesday 3/21 Location: 405 Soda Hall TIME: 5:00-8:00 Can have 1 sheet of 8x11 handwritten notes both sides No microfiche of the book! Bring DUMB calculator Meet at LaVals afterwards for Pizza and Beverages Great way for me to get to know you better Ill Buy! CS252 First Project proposal d
17、ue by Friday 3/2 Need two people/project (although can justify three for right project) Complete Research project in 9 weeks Typically investigate hypothesis by building an artifact and measuring it against a “base case” Generate conference-length paper/give oral presentation Often, can lead to an a
18、ctual publication. 2/27/201212cs252-S12, Lecture11 Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to inven
19、t new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc. GPUs? 2/27/201213cs252-S12, Lecture11 Overcoming Lim
20、its Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future 2/27/201214cs252-S12, Lecture11 Limits to IL
21、P Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming infinite virtual registers all register WAW no mispredictions 3. Jump prediction all jumps perfectly predicted (returns, case statements) 2 perfect speculation 1 1 cycle latency for all inst
22、ructions (FP *,/); unlimited instructions issued/clock cycle; 2/27/201215cs252-S12, Lecture11 ModelPower 5 Instructions Issued per clock Infinite4 Instruction Window Size Infinite200 Renaming Registers Infinite48 integer + 40 Fl. Pt. Branch PredictionPerfect2% to 6% misprediction (Tournament Branch
23、Predictor) CachePerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Analysis Perfect? Limits to ILP HW Model comparison 2/27/201216cs252-S12, Lecture11 Upper Limit to ILP: Ideal Machine (Figure 3.1) ProgramsPrograms Instruction Issues per cycleInstruction Issues per cycle 0 0 2020 4040 6060 8080 100
24、100 120120 140140 160160 gccgccespressoespressolilifppppfppppdoducddoducdtomcatvtomcatv 54.854.8 62.662.6 17.917.9 75.275.2 118.7118.7 150.1150.1 Integer: 18 - 60 FP: 75 - 150 Instructions Per Clock 2/27/201217cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock InfiniteInfinite4
25、 Instruction Window Size Infinite, 2K, 512, 128, 32 Infinite200 Renaming Registers InfiniteInfinite48 integer + 40 Fl. Pt. Branch Prediction PerfectPerfect2% to 6% misprediction (Tournament Branch Predictor) CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias PerfectPerfect? Limits to IL
26、P HW Model comparison 2/27/201218cs252-S12, Lecture11 More Realistic HW: Window Impact Figure 3.2 Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 Integer: 8 - 63 IPC 2/27/201219cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock 64Infinite4 Instruction Window Size 204
27、8Infinite200 Renaming Registers InfiniteInfinite48 integer + 40 Fl. Pt. Branch Prediction Perfect vs. 8K Tournament vs. 512 2-bit vs. static vs. none Perfect2% to 6% misprediction (Tournament Branch Predictor) CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias PerfectPerfect? Limits to
28、ILP HW Model comparison 2/27/201220cs252-S12, Lecture11 35 41 16 61 58 60 9 12 10 48 15 6 7 6 46 13 45 66 7 45 14 45 2 2 2 29 4 19 46 0 10 20 30 40 50 60 gccespressolifppppdoducdtomcatv ProgramProgram PerfectSelective predictorStandard 2-bitStatic None More Realistic HW: Branch Impact Figure 3.3 Cha
29、nge from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle StaticBHT (512)TournamentPerfect No prediction FP: 15 - 45 Integer: 6 - 12 IPC 2/27/201221cs252-S12, Lecture11 Misprediction Rates 2/27/201222cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issue
30、d per clock 64Infinite4 Instruction Window Size 2048Infinite200 Renaming Registers Infinite v. 256, 128, 64, 32, none Infinite48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bitPerfectTournament Branch Predictor CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias PerfectPerfectPerfect Lim
31、its to ILP HW Model comparison 2/27/201223cs252-S12, Lecture11 11 15 12 29 54 10 15 12 49 16 10 13 12 35 15 44 9 10 11 20 11 28 5 5 6 55 7 4 4 5 4 5 5 59 45 0 10 20 30 40 50 60 70 gccespressolifppppdoducdtomcatv ProgramProgram Infinite2561286432None More Realistic HW: Renaming Register Impact (N int
32、 + N fp) Figure 3.5 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction 64None256Infinite32128 Integer: 5 - 15 FP: 11 - 45 IPC 2/27/201224cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock 64Infinite4 Instruction Window Size 2048Infinite200 Renaming Registers 256 In
33、t + 256 FPInfinite48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bitPerfectTournament CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect v. Stack v. Inspect v. none PerfectPerfect Limits to ILP HW Model comparison 2/27/201225cs252-S12, Lecture11 ProgramProgram Instruction issues
34、 per cycleInstruction issues per cycle 0 5 10 15 20 25 30 35 40 45 50 gccespressolifppppdoducdtomcatv 10 15 12 49 16 45 7 7 9 49 16 4 5 4 4 6 5 3 5 3 3 4 4 45 PerfectGlobal/stack PerfectInspectionNone More Realistic HW: Memory Address Alias Impact Figure 3.6 Change 2048 instr window, 64 instr issue,
35、 8K 2 level Prediction, 256 renaming registers NoneGlobal/Stack perf; heap conflicts PerfectInspec. Assem. FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC 2/27/201226cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock 64 (no restrictions) Infinite4 Instruction Window Size Infin
36、ite vs. 256, 128, 64, 32 Infinite200 Renaming Registers 64 Int + 64 FPInfinite48 integer + 40 Fl. Pt. Branch Prediction 1K 2-bitPerfectTournament CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias HW disambiguation PerfectPerfect Limits to ILP HW Model comparison 2/27/201227cs252-S12, L
37、ecture11 ProgramProgram Instruction issues per cycleInstruction issues per cycle 0 10 20 30 40 50 60 gccexpressolifppppdoducdtomcatv 10 15 12 52 17 56 10 15 12 47 16 10 13 11 35 15 34 9 10 11 22 12 8 8 9 14 9 14 66 6 8 7 9 4 4 4 5 4 6 3 2 3 3 3 3 45 22 Infinite25612864321684 Realistic HW: Window Imp
38、act (Figure 3.7) Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window 6416256Infinite3212884 Integer: 6 - 12 FP: 8 - 45 IPC 2/27/201228cs252-S12, Lecture11 How to Exceed ILP Limits of this study? These are not laws of physics; just practical li
39、mits for today, and perhaps overcome via research Compiler and ISA advances could change results WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses the mem
40、ory addresses of a previous frame on the stack 2/27/201229cs252-S12, Lecture11 HW v. SW to increase ILP Memory disambiguation: HW best Speculation: HW best when dynamic branch prediction better than compile time prediction Exceptions easier for HW HW doesnt need bookkeeping code or compensation code
41、 Very complicated to get right Scheduling: SW can look ahead to schedule better Compiler independence: does not require new compiler, recompilation to run well 2/27/201230cs252-S12, Lecture11 “Complexity-effective superscalar processors”, Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Severa
42、l data structures analyzed for complexity WRT issue width Rename: Roughly Linear in IW, steeper slope for smaller feature size Wakeup: Roughly Linear in IW, but quadratic in window size Bypass: Strongly quadratic in IW Overall results: Bypass significant at high window size/issue width Wakeup+Select
43、 delay dominates otherwise Proposed Complexity-effective design: Replace issue window with FIFOs/steer dependent Insts to same FIFO Discussion of papers: Complexity-effective superscalar processors 2/27/201231cs252-S12, Lecture113/2/2011cs252-S11, Lecture 1231 Discussion of SPARCLE paper Example of
44、close coupling between processor and memory controller (CMMU) All of features mentioned in this paper implemented by combination of processor and memory controller Some functions implemented as special “coprocessor” instructions Others implemented as “Tagged” loads/stores/swaps Course Grained Multit
45、hreading Using SPARC register windows Automatic synchronous trap on cache miss Fast handling of all other traps/interrupts (great for message interface!) Multithreading half in hardware/half software (hence 14 cycles) Fine Grained Synchronization Full-Empty bit/32 bit word (effectively 33 bits) Grou
46、ps of 4 words/cache line F/E bits put into memory TAG Fast TRAP on bad condition Multiple instructions. Examples: LDT (load/trap if empty) LDET (load/set empty/trap if empty) STF (Store/set full) STFT (store/set full/trap if full) 2/27/201232cs252-S12, Lecture113/2/2011cs252-S11, Lecture 1232 Discus
47、sion of Papers: Sparcle (Cont) Message Interface Closely couple with processor Interface at speed of first-level cache Atomic message launch: Describe message (including DMA ops) with simple stio insts Atomic launch instruction (ipilaunch) Message Reception Possible interrupt on message receive: use
48、 fast context switch Examine message with simple ldio instructions Discard in pieces, possibly with DMA Free message (ipicst, i.e “coherent storeback”) We will talk about message interface in greater detail 2/27/201233cs252-S12, Lecture11 Performance beyond single thread ILP There can be much higher
49、 natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: instruction stream with own PC and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Eac
50、h thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data 2/27/201234cs252-S12, Lecture11 Thread Level Parallelism (TLP) ILP exploits implicit parallel operations with
51、in a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve 1. Throughput of computers that run many programs 2. Execution time of multi-threaded programs TLP could be mo
52、re cost-effective to exploit than ILP 2/27/201235cs252-S12, Lecture11 Another Approach: Multithreaded Execution Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register fi
53、le, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per
54、thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 2/27/201236cs252-S12, Lecture11 Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in
55、 a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread
56、 ready to execute without stalls will be delayed by instructions from other threads Used on Suns Niagara (will see later) 2/27/201237cs252-S12, Lecture11 Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Relieves need to have very fast thread-sw
57、itching Doesnt slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pi
58、peline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill stall time Used in IBM AS/400, Alewife 2/27/201238cs252-S12, Lect
59、ure11 For most apps: most execution units lie idle From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. For an 8-way superscalar. 2/27/201239cs252-S12, Lecture11 Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in
60、 a program Could a processor oriented at ILP to exploit TLP? functional units are often idle in data path designed for ILP because of either stalls or dependences in the code Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 湖南省长沙市开福区2025-2026学年初三下学期第一次月考英语试题含解析
- 陕西省西安市滨河区2025-2026学年初三中考模拟训练评估卷(2)英语试题含解析
- 项目预算成本费用计算及审批模板
- 制造业设备维护保养周期规划手册
- 企业产品(服务)用户调查问卷模板
- 企业市场调研与策略制定工具
- 高等职业技术教育电力系统自动化技术专业人才培养方案
- 2026年职业生涯规划书民航气象
- 2026年食品行业现场管理(6S)专员职责与能力
- 博物馆捐款协议书范本
- 2026年江苏苏锡常镇四市高三一模高考数学试卷(答案详解)
- 胖东来售后服务合规管理实施细则
- 2026年安庆职业技术学院单招职业技能考试题库附参考答案详解(典型题)
- 2026年安徽工业经济职业技术学院单招职业技能测试题库附答案详解(a卷)
- 第三单元整本书阅读《骆驼祥子》 课件(内嵌视频) 2025-2026学年统编版语文七年级下册
- 2025 国际经济合作中的区域贸易协定课件
- 2026年徽商职业学院单招职业适应性测试题库附答案解析
- 医务人员职业暴露防护知识更新培训课件
- 小学四年级科学核心素养国测模拟测试题(含参考答案)
- 2025年事业单位教师招聘考试英语学科专业知识试卷(英语教学课件)试题
- 陕22N1 供暖工程标准图集
评论
0/150
提交评论