超标量流水线.ppt_第1页
超标量流水线.ppt_第2页
超标量流水线.ppt_第3页
超标量流水线.ppt_第4页
超标量流水线.ppt_第5页
已阅读5页,还剩84页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Pipelining to Superscalar,微处理器结构与设计,Pipelining to Superscalar,Forecast Limits of pipelining The case for superscalar Instruction-level parallel machines Superscalar pipeline organization Superscalar pipeline design,Limits of Pipelining,IBM RISC Experience(P91,Tilak Agerwala and John Cocke,1987)(原理性问

2、题) Control and data dependences add 15% Best case CPI of 1.15, IPC of 0.87 Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates(存储问题) Hit rates approach 100% for some programs Many important programs have much worse hit rates Later!,Processor Pe

3、rformance(P17),In the 1980s (decade of pipelining): CPI: 5.0 = 1.15 In the 1990s (decade of superscalar): CPI: 1.15 = 0.5 (best case),Processor Performance = -,Time,Program,Amdahls Law(P18),h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup:,No. o

4、f,Processors,N,Time,1,h,1 - h,1 - f,f,Revisit Amdahls Law,Sequential bottleneck Even if v is infinite Performance limited by nonvectorizable portion (1-f),No. of,Processors,N,Time,1,h,1 - h,1 - f,f,Pipelined Performance Model(Harold Stone,1987,P19),g = fraction of time pipeline is filled 1-g = fract

5、ion of time pipeline is not filled (stalled),三个阶段: 第一:N条指令进入流水线 第二:流水线充满阶段,假定没有流水线干扰引起的停顿,此时是流水线最优的性能 第三:流水线排空阶段,没有新指令进入流水线,当前正在流水线中的指令完成执行,Pipelined Performance Model,Tyranny of Amdahls Law Bob Colwell When g is even slightly below 100%, a big performance hit will result Stalled cycles are the key

6、adversary and must be minimized as much as possible,1-g,g,Pipeline,Depth,N,1,Motivation for SuperscalarAgerwala and Cocke(P23),Typical Range,Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar),Superscalar Proposal,Moderate tyranny of Amdahls Law Ease sequential bottleneck Mo

7、re generally applicable Robust (less sensitive to f) Revised Amdahls Law:,Limits on Instruction Level Parallelism (ILP),Variance due : benchmarks, machine models, cache latency allows combining of ops from two cache lines. If both hit, can get 4 every cycle.,Issues in Decoding,Primary Tasks Identify

8、 individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors Instruction set architecture Pipeline width RISC vs. CISC RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ = decode is 5 cycles, oft

9、en translates into internal RISC-like uops or ROPs,Decode Stage,Superscalar processor: 有序组织的前端(In-Order Issue Front-end)单元,乱序内核(Out-of-Order Core)单元和有序的退出(In-Order Retirement)单元 Instruction delivery: 流水线的取指段和译码段比执行段具有较高的带宽。 Delivery task: 保持指令窗的始终处于充满状态 预取指令越深,则允许更多的指令发射给各功能单元。 指令预取和译码的数量大概是指令执行后被最终

10、确认的数量的1.4倍到2倍 because of mispredicted branch paths 通常情况下,指令预取宽度与指令译码宽度相等,Decoding variable-length instructions,固定指令长度的微处理器一般支持多指令预取和译码 Variable instruction length: CISC instruction sets as the Intel X86 ISA. a multistage decode is necessary. 第一栈定界:处理判断指令流里面的指令边界。并将确定长度的指令发送给第二栈。 第二栈译码微操作:对每条指令进行译码,生

11、成一条或者多条微操作 AMD K系列:复杂CISC指令集结构 Complex CISC instructions are split into micro-ops which resemble ordinary RISC instructions. 微操作可以是数条简单指令,或者一个简单指令构成的指令流。 CISC指令集相比与RISC指令集: 优点:有更高的指令密度 缺点:指令译码结构更加复杂,Pentium Pro Fetch/Decode,16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully

12、general, 1 need a good branch predictor. Other option: predecode bits,Pre-decoding,如果指令操作码允许,取指段就可以分析部分操作,并利用它进行预测。 Pre-decode :transferred from memory to the I-cache. the decode stage is more simple. MIPS R10000: 对32位指令进行预译码,形成36位格式存储在指令CACHE中。 4位扩展位指示将使用哪一个功能单元执行该条指令。 对每条指令的操作数选择域和目的寄存器选择域进行重排,使之存

13、储在同样的位置, 修改操作码以简化整数或者浮点目的寄存器译码。 译码器对这类扩展后的指令译码速度远远高于对原来的指令格式,Predecoding in the AMD K5,K5: notoriously late and slow, but still interesting (AMDs first non-clone x86 processor) 50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architec

14、ture: memoization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later),Distributed Reservation Station,Distributed, with localized

15、control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported Must tune for proper utilization,Must make 1000 little decisions (juggle 100 ping pong balls),Issues in Instruction Execution,Current trends More paral

16、lelism bypassing very challenging Deeper pipelines More diversity Functional unit types Integer Floating point Load/store most difficult to make parallel Branch Specialized units (media),Bypass Networks,O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions

17、 (hurt IPC, help cycle time) Use RF only (IBM Power4) with no bypass network Decompose into clusters (Alpha 21264),Specialized units,Intel Pentium 4 staggered adders Fireball Run at 2x clock frequency Two 16-bit bitslices Dependent ops execute on half-cycle boundaries Full result not available until

18、 full cycle later,Specialized units,FP multiply-accumulate R = (A x B) + C Doubles FLOP/instruction Lose RISC instruction format symmetry: 3 source operands Widely used,IBM POWER/PowerPC FMA or MAF: 3 source operands (loss of regularity in ISA) MIPS R8000 also had this MIPS R10000 (OOO) gave up on i

19、t, decode cracks FMA into M and A,Media Data Types,Subword parallel vector extensions Media data (pixels, quantized datum) often 1-2 bytes Several operands packed in single 32/64b register a,b,c,d and e,f,g,h stored in two 32b registers Vector instructions operate on 4/8 operands in parallel New ins

20、tructions, e.g. motion estimation me = |a e| + |b f| + |c g| + |d h| Substantial throughput improvement Usually requires hand-coding of critical loops,e,f,g,h,a,b,c,d,Media Processors and Multimedia Units,使用了基于单指令多数据的字内并行机制 (data parallel instructions, SIMD) 单周期内处理多组小数据,并获得多个结果。 多媒体单元采用: SIMD指令 Satu

21、ration arithmetic Additional arithmetic instructions, e.g. masking and selection instructions, reordering and conversion,4个16位数计算,Issues in Completion/Retirement,Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Memory coherence and con

22、sistency Solutions Reorder buffer Store buffer Load queue snooping (later),A Dynamic Superscalar Processor,Superscalar Overview,Instruction flow Branches, jumps, calls: predict target, direction Fetch alignment Instruction cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data fl

23、ow In-order stores: WAR/WAW Store queue: RAW Data cache misses,Superscalar Vs VLIW,技术特征 Superscalar machines are distinguished by their ability to (dynamically) issue multiple instructions each clock cycle from a conventional linear instruction stream. VLIW processors use a long instruction word tha

24、t contains a usually fixed number of instructions that are fetched, decoded, issued, and executed synchronously.,Superscalar Vs VLIW,VLIW技术特征 Instructions are issued from a sequential stream of normal instructions VLIW (指令组或指令包)where a sequential stream of instruction tuples is used. The instruction

25、s that are issued are scheduled dynamically by the hardware VLIW processors which rely on a static scheduling by the compiler.,Superscalar Vs VLIW,同时执行指令数目 Superscalar More than one instruction can be issued each cycle (motivating the term superscalar instead of scalar). The number of issued instruc

26、tions is determined dynamically by hardware, that is, the actual number of instructions issued in a single cycle can be zero up to a maximum instruction issue bandwidth。 VLIW VLIW where the number of scheduled instructions is fixed due to padding instructions with no-ops in case the full issue bandw

27、idth would not be met.,Superscalar Vs VLIW,指令调度 Dynamic issue of superscalar processors can allow issue of instructions either in-order, or it can allow also an issue of instructions out of program order. Only in-order issue is possible with VLIW processors. The dynamic instruction issue complicates

28、 the hardware scheduler of a superscalar processor if compared with a VLIW.,Superscalar Vs VLIW,指令调度技术 The scheduler complexity increases when multiple instructions are issued out-of-order from a large instruction window. It is a presumption of superscalar that multiple FUs are available. The number

29、 of available FUs is at least the maximum issue bandwidth, but often higher to diminish potential resource conflicts. The superscalar technique is a microarchitecture technique, not an architecture technique.,超标量的优势,超标量技术可以保持完全的代码兼容性 其它替代技术都或者是需要全新的指令代码结构或者需要使用特殊的编译器编译或软件工具支持,这在软件十分重要的今天是一个最有力的筹码。,超

30、标量的局限性,限制了更大的ILP的开发,这种限制主要来自程序中的控制相关性,建立一个能提供充分的ILP的精确的巨大的动态指令窗口(约200500条指令)几乎不可能。 实现的复杂度随着发射带宽的增加增长很快,尤其是控制复杂度,成平方关系增长。这种复杂度的增长不仅是消耗大量的资源,更重要的是将会影响到整个指令周期时间,以至影响整体性能。 发射带宽及功能单元的资源利用率低,一般只有20左右,这是计算资源的极大浪费。 不断增长的处理器和主存性能差对超标量的性能影响很大。,超标量需解决的问题,解决控制复杂度问题,而这之中解决以判指令相关为主的指令发射部件以及为保持程序顺序语义而设的乱序部件的复杂度是关键

31、。 解决的就是如何从程序中动态地开发出更多的ILP的问题。这需要更复杂的分支处理技术、Cache技术等的支持。 通过编译技术的支持,或者同编译技术结合,以提高动态指令窗口中可并行执行的指令的含量(百分比),也可能可以实现更大的超标量执行。,VLIW技术优势及进步,VLIW技术的最大优点就是它有潜力开发更大的ILP,而且硬件实现上简单,这正符合了VLSI的发展趋势。 VLIW技术进步: 突破了只能处理单个控制流的限制,能够同时处理多条控制流,可得到的ILP提高了 树型VLIW技术使其可以处理多条件分支 动态指令编译技术(二进制翻译)可以在一定程度上解决二进制代码兼容问题。,VLIW技术的局限性,

32、VLIW对ILP的开发根本上还是静态的,难以处理程序中众多的动态不确定性因素,如分支的动态特性、存取延迟的不确定性等。VLIW编译器中大量使用预测技术,但如果程序的实际执行与预测效果不符时,VLIW不得不使用互锁的方法保证程序的正确执行语义,这就造成程序执行性能的巨大损失。所以,当程序的可预测性低时,VLIW结构的性能就大大降低了。VLIW技术的这种不灵活性是其性能提高的根本限制。 代码兼容性仍是一个主要问题。不仅VLIW和现有结构的代码不兼容,而且不同VLIW代次间的代码也不兼容,这将是VLIW推广中遇到的最大障碍。动态指令翻译技术会造成代码量剧增,并不是很理想的技术。 资源利用率低的问题在VLIW中也很严重,这一方面是资源的浪费,另一方面也是对性能提高的一个极大制约。,Architecture, ISA,The architecture of a processor is defined as the instruction

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论