lec11-multithreading

上传人：2*** IP属地：湖北上传时间：2021-10-11 格式：PPTX 页数：58 大小：944.93KB 积分：30 举报 版权申诉

已阅读5页，还剩53页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、CS252 Graduate Computer Architecture Lecture 11 Data Value Prediction Limits to ILP / Multithreading February 27th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley /kubitron/cs252 2/27/20122cs252-S12, Lecture11 Data Val

2、ue Prediction Why do it? Can “Break the DataFlow Boundary” Before: Critical path = 4 operations (probably worse) After: Critical path = 1 operation (plus verification) + * / AB + YX + * / AB + YX Guess Guess Guess 2/27/20123cs252-S12, Lecture11 Data Value Predictability “The Predictability of Data V

3、alues” Yiannakis Sazeides and James Smith, Micro 30, 1997 Three different types of Patterns: Constant (C): 5 5 5 5 5 5 5 5 5 5 Stride (S): 1 2 3 4 5 6 7 8 9 Non-Stride (NS):28 13 99 107 23 456 Combinations: Repeated Stride (RS):1 2 3 1 2 3 1 2 3 1 2 3 Repeadted Non-Stride (RNS):1 -13 -99 7 1 -13 -99

4、 7 2/27/20124cs252-S12, Lecture11 Computational Predictors Last Value Predictors Predict that instruction will produce same value as last time Requires some form of hysteresis. Two subtle alternatives: Saturating counter incremented/decremented on success/failure replace when the count is below thre

5、shold Keep old value until new value seen frequently enough Second version predicts a constant when appears temporarily constant Stride Predictors Predict next value by adding the sum of most recent value to difference of two most recent values: If vn-1 and vn-2 are the two most recent values, then

6、predict next value will be: vn-1 + (vn-1 vn-2) The value (vn-1 vn-2) is called the “stride” Important variations in hysteresis: Change stride only if saturating counter falls below threshold Or “two-delta” method. Two strides maintained. First (S1) always updated by difference between two most recen

7、t values Other (S2) used for computing predictions When S1 seen twice in a row, then S1S2 More complex predictors: Multiple strides for nested loops Complex computations for complex loops (polynomials, etc!) 2/27/20125cs252-S12, Lecture11 Context Based Predictors Context Based Predictor Relies on Ta

8、bles to do trick Classified according to the order: an “n-th” order model takes last n values and uses this to produce prediction So 0th order predictor will be entirely frequency based Consider sequence: a a a b c a a a b c a a a Next value is? “Blending”: Use prediction of highest order available

9、2/27/20126cs252-S12, Lecture11 Which is better? Stride-based: Learns faster less state Much cheaper in terms of hardware! runs into errors for any pattern that is not an infinite stride Context-based: Much longer to train Performs perfectly once trained Much more expensive hardware 2/27/20127cs252-S

10、12, Lecture11 How predictable are data items? Assumptions looking for limits Prediction done with no table aliasing (every instruction has own set of tables/strides/etc. Only instructions that write into registers are measured Excludes stores, branches, jumps, etc Overall Predictability: L = Last Va

11、lue S = Stride (delta-2) FCMx = Order x context based predictor 2/27/20128cs252-S12, Lecture11 Correlation of Predicted Sets Way to interpret: l = last val s = stride f = fcm3 Combinations: ls = both l and s Etc. Conclusion? Only 18% not predicted correctly by any model About 40% captured by all pre

12、dictors A significant fraction (over 20%) only captured by fcm Stride does well! Over 60% of correct predictions captured Last-Value seems to have very little added value 2/27/20129cs252-S12, Lecture11 Number of unique values Data Observations: Many static instructions (50%) generate only one value

13、Majority of static instructions (90%) generate fewer than 64 values Majority of dynamic instructions (50%) correspond to static insts that generate fewer than 64 values Over 90% of dynamic instructions correspond to static insts that generate fewer than 4096 unique values Suggests that a relatively

14、small number of values would be required for actual context prediction 2/27/201210cs252-S12, Lecture11 General Idea: Confidence Prediction Data Predictor Confidence Prediction FetchDecode Execute Commit Reorder Buffer PC Complete Check Results Result Kill Adjust Adjust Correct PC Separate mechanisms

15、 for data and confidence prediction Data predictor keeps track of values via multiple mechanisms Confidence predictor tracks history of correctness (good/bad) Confidence prediction options: Saturating counter History register (like branch prediction) ? 2/27/201211cs252-S12, Lecture11 Administrivia M

16、idterm I: Wednesday 3/21 Location: 405 Soda Hall TIME: 5:00-8:00 Can have 1 sheet of 8x11 handwritten notes both sides No microfiche of the book! Bring DUMB calculator Meet at LaVals afterwards for Pizza and Beverages Great way for me to get to know you better Ill Buy! CS252 First Project proposal d

17、ue by Friday 3/2 Need two people/project (although can justify three for right project) Complete Research project in 9 weeks Typically investigate hypothesis by building an artifact and measuring it against a “base case” Generate conference-length paper/give oral presentation Often, can lead to an a

18、ctual publication. 2/27/201212cs252-S12, Lecture11 Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to inven

19、t new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc. GPUs? 2/27/201213cs252-S12, Lecture11 Overcoming Lim

20、its Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future 2/27/201214cs252-S12, Lecture11 Limits to IL

21、P Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming infinite virtual registers all register WAW no mispredictions 3. Jump prediction all jumps perfectly predicted (returns, case statements) 2 perfect speculation 1 1 cycle latency for all inst

22、ructions (FP *,/); unlimited instructions issued/clock cycle; 2/27/201215cs252-S12, Lecture11 ModelPower 5 Instructions Issued per clock Infinite4 Instruction Window Size Infinite200 Renaming Registers Infinite48 integer + 40 Fl. Pt. Branch PredictionPerfect2% to 6% misprediction (Tournament Branch

23、Predictor) CachePerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Analysis Perfect? Limits to ILP HW Model comparison 2/27/201216cs252-S12, Lecture11 Upper Limit to ILP: Ideal Machine (Figure 3.1) ProgramsPrograms Instruction Issues per cycleInstruction Issues per cycle 0 0 2020 4040 6060 8080 100

24、100 120120 140140 160160 gccgccespressoespressolilifppppfppppdoducddoducdtomcatvtomcatv 54.854.8 62.662.6 17.917.9 75.275.2 118.7118.7 150.1150.1 Integer: 18 - 60 FP: 75 - 150 Instructions Per Clock 2/27/201217cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock InfiniteInfinite4

25、 Instruction Window Size Infinite, 2K, 512, 128, 32 Infinite200 Renaming Registers InfiniteInfinite48 integer + 40 Fl. Pt. Branch Prediction PerfectPerfect2% to 6% misprediction (Tournament Branch Predictor) CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias PerfectPerfect? Limits to IL

26、P HW Model comparison 2/27/201218cs252-S12, Lecture11 More Realistic HW: Window Impact Figure 3.2 Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 Integer: 8 - 63 IPC 2/27/201219cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock 64Infinite4 Instruction Window Size 204

27、8Infinite200 Renaming Registers InfiniteInfinite48 integer + 40 Fl. Pt. Branch Prediction Perfect vs. 8K Tournament vs. 512 2-bit vs. static vs. none Perfect2% to 6% misprediction (Tournament Branch Predictor) CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias PerfectPerfect? Limits to

28、ILP HW Model comparison 2/27/201220cs252-S12, Lecture11 35 41 16 61 58 60 9 12 10 48 15 6 7 6 46 13 45 66 7 45 14 45 2 2 2 29 4 19 46 0 10 20 30 40 50 60 gccespressolifppppdoducdtomcatv ProgramProgram PerfectSelective predictorStandard 2-bitStatic None More Realistic HW: Branch Impact Figure 3.3 Cha

29、nge from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle StaticBHT (512)TournamentPerfect No prediction FP: 15 - 45 Integer: 6 - 12 IPC 2/27/201221cs252-S12, Lecture11 Misprediction Rates 2/27/201222cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issue

30、d per clock 64Infinite4 Instruction Window Size 2048Infinite200 Renaming Registers Infinite v. 256, 128, 64, 32, none Infinite48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bitPerfectTournament Branch Predictor CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias PerfectPerfectPerfect Lim

31、its to ILP HW Model comparison 2/27/201223cs252-S12, Lecture11 11 15 12 29 54 10 15 12 49 16 10 13 12 35 15 44 9 10 11 20 11 28 5 5 6 55 7 4 4 5 4 5 5 59 45 0 10 20 30 40 50 60 70 gccespressolifppppdoducdtomcatv ProgramProgram Infinite2561286432None More Realistic HW: Renaming Register Impact (N int

32、 + N fp) Figure 3.5 Change 2048 instr window, 64 instr issue, 8K 2 level Prediction 64None256Infinite32128 Integer: 5 - 15 FP: 11 - 45 IPC 2/27/201224cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock 64Infinite4 Instruction Window Size 2048Infinite200 Renaming Registers 256 In

33、t + 256 FPInfinite48 integer + 40 Fl. Pt. Branch Prediction 8K 2-bitPerfectTournament CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect v. Stack v. Inspect v. none PerfectPerfect Limits to ILP HW Model comparison 2/27/201225cs252-S12, Lecture11 ProgramProgram Instruction issues

34、 per cycleInstruction issues per cycle 0 5 10 15 20 25 30 35 40 45 50 gccespressolifppppdoducdtomcatv 10 15 12 49 16 45 7 7 9 49 16 4 5 4 4 6 5 3 5 3 3 4 4 45 PerfectGlobal/stack PerfectInspectionNone More Realistic HW: Memory Address Alias Impact Figure 3.6 Change 2048 instr window, 64 instr issue,

35、 8K 2 level Prediction, 256 renaming registers NoneGlobal/Stack perf; heap conflicts PerfectInspec. Assem. FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC 2/27/201226cs252-S12, Lecture11 New ModelModelPower 5 Instructions Issued per clock 64 (no restrictions) Infinite4 Instruction Window Size Infin

36、ite vs. 256, 128, 64, 32 Infinite200 Renaming Registers 64 Int + 64 FPInfinite48 integer + 40 Fl. Pt. Branch Prediction 1K 2-bitPerfectTournament CachePerfectPerfect64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias HW disambiguation PerfectPerfect Limits to ILP HW Model comparison 2/27/201227cs252-S12, L

37、ecture11 ProgramProgram Instruction issues per cycleInstruction issues per cycle 0 10 20 30 40 50 60 gccexpressolifppppdoducdtomcatv 10 15 12 52 17 56 10 15 12 47 16 10 13 11 35 15 34 9 10 11 22 12 8 8 9 14 9 14 66 6 8 7 9 4 4 4 5 4 6 3 2 3 3 3 3 45 22 Infinite25612864321684 Realistic HW: Window Imp

38、act (Figure 3.7) Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window 6416256Infinite3212884 Integer: 6 - 12 FP: 8 - 45 IPC 2/27/201228cs252-S12, Lecture11 How to Exceed ILP Limits of this study? These are not laws of physics; just practical li

39、mits for today, and perhaps overcome via research Compiler and ISA advances could change results WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses the mem

40、ory addresses of a previous frame on the stack 2/27/201229cs252-S12, Lecture11 HW v. SW to increase ILP Memory disambiguation: HW best Speculation: HW best when dynamic branch prediction better than compile time prediction Exceptions easier for HW HW doesnt need bookkeeping code or compensation code

41、 Very complicated to get right Scheduling: SW can look ahead to schedule better Compiler independence: does not require new compiler, recompilation to run well 2/27/201230cs252-S12, Lecture11 “Complexity-effective superscalar processors”, Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Severa

42、l data structures analyzed for complexity WRT issue width Rename: Roughly Linear in IW, steeper slope for smaller feature size Wakeup: Roughly Linear in IW, but quadratic in window size Bypass: Strongly quadratic in IW Overall results: Bypass significant at high window size/issue width Wakeup+Select

43、 delay dominates otherwise Proposed Complexity-effective design: Replace issue window with FIFOs/steer dependent Insts to same FIFO Discussion of papers: Complexity-effective superscalar processors 2/27/201231cs252-S12, Lecture113/2/2011cs252-S11, Lecture 1231 Discussion of SPARCLE paper Example of

44、close coupling between processor and memory controller (CMMU) All of features mentioned in this paper implemented by combination of processor and memory controller Some functions implemented as special “coprocessor” instructions Others implemented as “Tagged” loads/stores/swaps Course Grained Multit

45、hreading Using SPARC register windows Automatic synchronous trap on cache miss Fast handling of all other traps/interrupts (great for message interface!) Multithreading half in hardware/half software (hence 14 cycles) Fine Grained Synchronization Full-Empty bit/32 bit word (effectively 33 bits) Grou

46、ps of 4 words/cache line F/E bits put into memory TAG Fast TRAP on bad condition Multiple instructions. Examples: LDT (load/trap if empty) LDET (load/set empty/trap if empty) STF (Store/set full) STFT (store/set full/trap if full) 2/27/201232cs252-S12, Lecture113/2/2011cs252-S11, Lecture 1232 Discus

47、sion of Papers: Sparcle (Cont) Message Interface Closely couple with processor Interface at speed of first-level cache Atomic message launch: Describe message (including DMA ops) with simple stio insts Atomic launch instruction (ipilaunch) Message Reception Possible interrupt on message receive: use

48、 fast context switch Examine message with simple ldio instructions Discard in pieces, possibly with DMA Free message (ipicst, i.e “coherent storeback”) We will talk about message interface in greater detail 2/27/201233cs252-S12, Lecture11 Performance beyond single thread ILP There can be much higher

49、 natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: instruction stream with own PC and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Eac

50、h thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data 2/27/201234cs252-S12, Lecture11 Thread Level Parallelism (TLP) ILP exploits implicit parallel operations with

51、in a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve 1. Throughput of computers that run many programs 2. Execution time of multi-threaded programs TLP could be mo

52、re cost-effective to exploit than ILP 2/27/201235cs252-S12, Lecture11 Another Approach: Multithreaded Execution Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register fi

53、le, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch 100s to 1000s of clocks When switch? Alternate instruction per

54、thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain) 2/27/201236cs252-S12, Lecture11 Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiples threads to be interleaved Usually done in

55、 a round-robin fashion, skipping any stalled threads CPU must be able to switch threads every clock Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls Disadvantage is it slows down execution of individual threads, since a thread

56、 ready to execute without stalls will be delayed by instructions from other threads Used on Suns Niagara (will see later) 2/27/201237cs252-S12, Lecture11 Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages Relieves need to have very fast thread-sw

57、itching Doesnt slow down thread, since instructions from other threads issued only when the thread encounters a costly stall Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread, when a stall occurs, the pi

58、peline must be emptied or frozen New thread must fill pipeline before instructions can complete Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill stall time Used in IBM AS/400, Alewife 2/27/201238cs252-S12, Lect

59、ure11 For most apps: most execution units lie idle From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995. For an 8-way superscalar. 2/27/201239cs252-S12, Lecture11 Do both ILP and TLP? TLP and ILP exploit two different kinds of parallel structure in

60、 a program Could a processor oriented at ILP to exploit TLP? functional units are often idle in data path designed for ILP because of either stalls or dependences in the code Could the TLP be used as a source of independent instructions that might keep the processor busy during stalls? Could TLP be

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

lec11-multithreading

文档简介

温馨提示

最新文档

评论

lec11-multithreading

文档简介

温馨提示

最新文档

评论

相关文档