计算机体系结构PPT教学课件-第三章流水线.ppt

上传人：伐*** IP属地：宁夏上传时间：2019-04-07 格式：PPT 页数：142 大小：2.40MB 积分：20 举报 版权申诉

已阅读5页，还剩137页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

computer architecture -a quantitative approach,计算机体系结构,计算机体系结构,chapter 3 pipelining basic and intermediate concepts 流水线：基本概念,王奕 estelle.ywanggmail. com,a. 流水线基本定义,what is pipelining? (什么是流水线) how is the pipelining implemented? (流水线是怎么实现的) what makes pipelining hard to implement? (流水线实现的困难是什么) basic methods improving pipelining (提高流水线性能的一些基本方法),what is pipelining ?,pipelining(流水线定义): “a technique designed into some computers to increase speed by starting the execution of one instruction before completing the previous one.” -modern english-chinese dictionary implementation technique whereby different instructions are overlapped in execution at the same time.（多条指令同时重叠执行） implementation technique to make fast cpus,why pipelining: its nature(流水线的本质),laundry(洗衣店) ann, brian, cathy, dave each have one load of clothes to wash, dry, and fold washer(洗衣机) takes 30 minutes dryer(干衣机) takes 40 minutes “folder”(叠衣人) takes 20 minutes,顺序执行的洗衣,sequential laundry takes 6 hours for 4 loads if they learned pipelining, how long would laundry take?,t a s k o r d e r,流水线的洗衣 -开始工作越快越好,pipelined laundry takes 3.5 hours for 4 loads,why pipelining : overlapped(重叠),latches(锁存器), called pipeline registers break up computation into 5 stages deal 5 tasks at the same time.,only deal one task each time. this task takes “ such a long time” ns: nanosecond 纳秒即十亿分之一秒,why pipelining: more faster,can “launch(开始)” a new computation every 100ns in this structure can finish 107 computations per second,can launch a new computation every 20ns in pipelined structure can finish 5107 computations per second,why pipelining : conclusion,the key implementation technique used to make fast cpu: decrease cputime.(减少cpu时间的关键技术) improving of throughput ( rather than individual execution time) (提高吞吐量，而不是单条指令的执行时间) improving of efficiency for resources (functional unit) (提高了资源的有限利用率),what is a pipeline ?,a pipeline is like an auto assemble line a pipeline has many stages(流水线有多个阶段称流水级或流水节拍) each stage carries out a different part of instruction or operation (每一个流水线完成指令操作的不同部分) the stages, which cooperates at a synchronized clock, are connected to form a pipe (各个流水级在同步时钟的控制下连成一条流水线) an instruction or operation enters through one end and progresses through the stages and exit through the other end (一条指令从一流水线的一端进入从另一端出来，完成整条指令的操作) pipelining is an implementation technique that exploits parallelism among the instructions in a sequential instruction stream (流水线挖掘了指令流中的并行性),吞吐量,throughput (吞吐量) number of items (cars, instructions) exiting the pipeline per unit time. the throughput of the assembly line is the number of products completed per hour. the throughput of a cpu pipeline is the number of instructions completed per second. (单位时间内完成的指令数),流水线特征 - 时钟周期 vs. 机器周期,clock cycle(时钟周期) everything in a cpu moves in lockstep(n.步伐一致), synchronized by the clock ( the heartbeat of cpu ) machine cycle (processor cycle, stage time) time required to complete a single pipeline stage. the pipeline designers goal is to balance the length of each pipeline stage. (机器周期-完成一个流水级所需要的时间) in many instances, machine cycle = max ( times for all stages). (机器周期=各个流水级完成时间的最大值) a machine cycle is usually one, sometimes two, clock cycles long, but rarely more. (一般是一到两个时间周期),流水线的描述,spatio-temporal(时空的) chart relation of pipeline and tasks in sequence,fill,balanced,drain,4 n1,流水节拍,时间，以机器周期为单位,流水线的理想性能,speedup assume:stages:k tasks:n tk(k+(n-1)p 流水线实现所需要的时钟数 t1 n kup 非流水线实现所需要的时钟数,n speedupk,k级流水线完成n个任务（指令）的加速比,流水线的理想性能,if the stages are perfectly balanced, the time per instruction on the pipelined processor equal to(在理想状态下，流水线中每条指令所需的时间为) time per instruction on unpipelined machine number of pipe stages so, ideal speedup equal to number of pipe stages. （理想状态下，加速比等于流水线的级数）,how mips instruction set is implemented without pipelining ?,five phases （mips指令执行的五个阶段-即五个流水级） if: instruction fetch cycle （取指令周期） id: instruction decode/ register fetch cycle （指令解码/寄存器取数据周期） ex: execution/ effective address cycle （指令执行/有效地址计算周期） memory reference r-r/ r-i alu branch mem: memory access （存储器访问周期） wb: write-back cycle （写结果周期）,单周期实现,seldom used !,多周期实现,流水线的mips指令集,since there are five separate stages, we can have a pipeline in which one instruction is in each stage. cpi is decreased to 1, since one instruction will be issued (or finished) each cycle. (cpi为1，每一周期流出一条指令) during any cycle, one instruction is present in each stage. (在任何时候，每一个流水级里都有一条指令在执行) ideally, performance is increased five fold !,pipeline timing chart,5-stage version of mips datapath(数据通路),how pipelining decrease the execution time ?(流水线如何减少执行时间),if your starting point is (如果你的出发点是.) a single clock cycle per instruction machine then(单周期每指令的机器，则) pipelining decreases cycle time. (流水线减少周期时间) a multiple clock cycle per instruction machine then (多周期每指令的机器，则) pipelining decreases cpi. (流水线减少每条指令的平均周期数即cpi),单周期vs.流水线,single cycle implementation: cpi=1, long clock cycle,clk,load,store,waste,cycle 1,cycle 2,pipeline implementation: cpi=1, clock cycle long clock cycle/5,多周期 vs. 流水线,load,store,r-type,multip-cycle implementation: cpi=5,pipeline implementation: cpi=1,what we knew about pipeline,pipelining implementation technique to execute instructions in a overlapped way to make fast cpus(decrease cputime, improve throughput) (流水线通过让指令重叠执行的方式提高了吞吐量) ideal speedup of pipeline equal to -number of pipe stages(理想状态下流水线加速比是流水级数) if the starting point is a multiple clock cycle per instruction machine then pipelining decreases cpi. (在多周期指令计算机中，可以认为流水线降低了cpi),works in the mips 5 stage pipeline,if (instruction fetch cycle) irmempc; npc pc=pc+4; id (instruction decode/register fetch cycle) a regsrs; b regsrt; imm sign-extended immediate field of ir; note: the first two stages of mips pipeline do the same functions for all kinds of instructions.,the third stage of mips pipeline,ex (execution/effective address cycle) memory reference: aluoutput a+imm register-register alu instruction: aluoutput a func b; register-immediate alu instruction: aluoutput a op imm; branch: aluoutput npc+(imm 2 ); cond (a=0),the last two stages of mips pipeline,mem(memory acces/branch completion cycle) memory reference(load或store指令起作用): lmd memaluoutput or (注：lmdload memory data register) memaluoutput b branch: if (cond) pc aluoutput wb (write back cycle) register-register alu instruction regsrd aluoutput; register-immediate alu instruction regsrt aluoutput; load instruction: regsrt lmd;,table: events on every stage,the mips pipelining,problems that pipelining introduces,focus: no different operations with the same data path resource on the same clock cycle. (structure hazard)(在同一时钟周期里，在同一资源上不会发生多个操作) there is conflict(冲突) about the memory !,separate instruction and data memories,use split instruction and data cache the memory system must deliver 5 times the bandwidth over the unpipelined version.,the conflict about the registers !,sometimes we can redesign the resource,allow write-then-read in one clock cycle (double pump-前半拍寄存器组写入，后半拍寄存器组读出) two reads and one write required per clock. need to provide two read port and one write port. what happens when a read and a write occur to the same register ? ( data hazard ),conflict occurs when pc update,must increment and store the pc every clock. what happens when meet a branch ? branches change the value of the pc - but the condition is not evaluated until id ! if the branch is taken, the instructions fetched behind the branch are invalid ! this is clearly a serious problem ( control hazard ) that needs to be addressed.,pipeline hazard(流水线竞争): the major hurdle(流水线的主要障碍),a hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage taxonomy of hazard(流水线竞争的分类) structural hazards (结构竞争) these are conflicts over hardware resources. (硬件资料不够用导致的竞争) data hazards (数据竞争) instruction depends on result of prior computation which is not ready (computed or stored) yet (指令之间的相互依赖性导致的竞争) control hazards (控制竞争) branch condition and the branch pc are not available in time to fetch an instruction on the next clock (转移条件或转移地址未能及时知道而导致的竞争),hazards can always be resolved by stall(停顿),the simplest way to “fix” hazards is to stall the pipeline. （处理竞争最简单的办法是暂停流水线的执行） stall means suspending the pipeline for some instructions by one or more clock cycles. the stall delays all instructions issued after the instruction that was stalled, while other instructions in the pipeline go on proceeding. (停顿指令之后的所有指令都被暂停，而其之前的指令继续执行) a pipeline stall is also called a pipeline bubble(流水汽泡) or simply bubble. no new instructions are fetched during a stall . (停顿时，不会再取新的指令),performance of pipeline with stalls,pipeline stalls decrease performance from the ideal (流水线停顿降低性能) recall the speedup formula:,assumptions for calculation,the ideal cpi on a pipelined processor is almost always 1. so ignore the overhead of pipelining clock cycle. pipe stages are ideal balanced. two different implementation single-cycle implementation multi-cycle implementation,case of single-cycle implementation,cpi unpipelined = 1, clock cycle pipelined =,clock cycle unpipelined pipeline depth,case of multi-cycle implementation,clock cycle unpipelined = clock cycle pipelining cpl unpipelined = pipeline depth,structural hazard: pipe stage contention(争夺),structural hazards(结构竞争) occurs when two or more instructions want to use the same hardware resource in the same cycle(多条指令同时需要使用相同硬件资源时产生) causes bubble (stall) in pipelined machines(引起停顿) overcome by replicating hardware resources multiple accesses to the register file multiple accesses to memory some functional unit is not fully pipelined.,multi access to the register file,simply insert a stall , speedup will be decreased. we have resolved it with “ double bump”,double bump works !,multi access to single memory port,insert stall provide another memory port split instruction memory and data memory use instruction buffer,insert stall,i n s t r. o r d e r,time (clock cycles),ld/st,instr 1,instr 2,instr 3,reg,mem,reg,stall,split instruction and data memory,split instruction and data memory / multiple memory port / instruction buffer means: fetch the instruction and data inference using different hardware resources.,not fully pipelined function unit : may cause structural hazard,machine without structural hazards will always have a lower cpi,example (how much the load structural hazard might cost) data reference constitute 40% of the mix ideal cpi ignoring the structural hazard is 1 the processor with the structural hazard has a clock rate that is 1.05 times higher than that of a processor without structural hazard. answer average instruction time = cpiclock cycle time =(1+0.4 1) ccideal/1.05 = 1.3 ccideal clearly, the processor without the structural hazard is faster.,why allow machine with structural hazard ?,the primary reason is to reduce cost . (降低成本) i.e. adding split caches, requires twice the memory bandwidth. also fully pipelined floating point units costs lots of gates. it is not worth the cost if the hazard does not occur very often. to reduce latency of the unit. (减少部件延迟) making functional units pipelined adds delay an unpipelined version may require fewer clocks per operation.,example: impact of structural hazard to performance,example many machines have unpipelined float-point multiplier. the function unit time of fp multiplier is 6 clock cycles fp multiply has a frequency of 14% in a specfp benchmark will the structural hzard have a large performance impact on the specfp benchmark?,answer to the example,in the best case: fp multiplies are distributed uniformly. there is one multiply in every 7 clock. 1/14% then there will be no structural hazard,then there is no performance penalty at all. in the worst case: the multiplies are all clustered with no intervening instructions. then every multiply instruction have to stall 5 clock cycles to wait for the multiplier be released. the cpi will increase 70% to 1.7(注:1+5*0.14), if the ideal cpi is 1. experiment result: this structural hazard increase execution time by less than 3%.,taxonomy of hazards -structural hazards these are conflicts over hardware resources. ok, maybe add extra hardware resources according to amdahls law; or full pipelined the functional units; otherwise still have to stall -data hazards instruction depends on result of prior computation which is not ready (computed or stored) yet -control hazards branch condition and the branch pc are not available in time to fetch an instruction on the next clock,summary of structural hazard,pipelining hazards,taxonomy of hazards structural hazards these are conflicts over hardware resources. data hazards instruction depends on result of prior computation which is not ready (computed or stored) yet control hazards branch condition and the branch pc are not available in time to fetch an instruction on the next clock,data hazard(数据竞争),data hazards occur when the pipeline changes the order of read/write accesses to operands comparing with that in sequential executing .(当流水导致读写操作数的顺序发生改变时可能发生此竞争) lets see an example dadd r1, r1, r3 dsub r4, r1, r5 and r6, r1, r7 or r8, r1, r9 xor r10, r1, r11,data hazard(数据竞争),basic structure(数据竞争的基本情形) an instruction in flight(在执行过程当中) wants to use a data value thats not “done” yet “done” means “its been computed” and “its located where i would normally expect to go look in the pipe hardware to find it” basic cause (数据竞争产生的基本原因) you are used to assuming a purely sequential model of instruction execution instruction n finishes before instruction n+k, for k = 1 there are dependencies now between “nearby” instructions (“near” in sequential order of fetch from memory),coping with data hazards:example,somecases “double bump” can do !,how do we stall ? insert nop by compiler(编译时插入空操作指令),how do we stall? add hardware interlock ! (采用流水线内锁电路),add extra hardware to detect stall situations watches the instruction field bits (监视指令操作数域) looks for “read versus write” conflicts in particular pipe stages (查找指令间的“读写”冲突) add extra hardware to push bubbles thru pipe actually, relatively easy can just let the instruction you want to stall go forward through the pipe(让要停顿的指令继续执行) but, turn off the bits that allow any results to get written into the machine state(但不允许其执行结果保存下来) so, the instruction “executes” (it does the work), but doesnt “save”(“执行”但不“保存”),interlock: insert stalls,empty slots in the pipe called bubbles; means no real instruction work getting saved here,how the interlock is implementated ?,forwarding: (提前又称旁路技术) reduce data hazard stalls,if the result you need does not exist at all yet, you are out of luck, sorry.(如果你需要的结果值不存在，那是没办法的) but, what if the result exists, but is not stored back yet? (如果需要的结果值存在但还没存放到相应的位置上，怎么办呢) instead of stalling until the result is stored back in its “natural” home grab the result “on the fly” from “inside” the pipe, and send it to the other instruction (another pipe stage) that wants to use it (把结果值从当前位置上取出来，发送给目前需要使用这个值的指令),forwarding,generic name: forwarding ( bypass, short-circuiting)(常用的名称-提前电路，旁路电路，短路电路) instead of waiting to store the result, we forward it immediately (more or less) to the instruction that wants it(把结果立即转发给需要它的指令,而不是等待直至其放到相应的位置上) mechanically, we add buses to the data path to move these values (在实现上，在数据通路上增加一些传输总线) and these buses always “point backwards” in the datapath, from later stages to earlier stages (这些总线总是往回指，从后面流水节拍往前面的节拍指),forwarding: reduce data hazard stalls,data may be already computed - just not in the register file,ex/mem.aluoutput alu input port mem/wb.aluoutput alu input port,hardware change for forwarding 旁路技术的硬件实现,forwarding path to other input entry,mem/wb.lmd dm input,forwarding isnt always available,so we have to insert stall: load stall,time (clock cycles),i n s t r. o r d e r,lw r1, 0(r2),sub r4,r1,r6,and r6,r1,r7,or r8,r1,r9,example of forwarding and load delay,why forwarding? add r4, r5, r2 lw r15, 0(r4) sw r15, 4(r2) why load delay? add r4, r5, r2 lw r15, 0(r4) sw r15, 4(r2),solution (without forwarding),solution (with forwarding),instruction reordering by compiler to avoid load stall,try producing fast code for a = b + c; d = e f; assuming a, b, c, d ,e, and f in memory. slow code: lw rb,b lw rc,c add ra,rb,rc sw a,ra lw re,e lw rf,f sub rd,re,rf sw d,rd,fast code: lw rb,b lw rc,c lw re,e add ra,rb,rc lw rf,f sw a,ra sub rd,re,rf sw d,rd,summary of data hazard,taxonomy of hazards structural hazards these are conflicts over hardware resources. data hazards instruction depends on result of prior computation which is not ready (computed or stored) yet ok, we did these, double bump, forwarding path, software scheduling, otherwise have to stall control hazards branch condition and the branch pc are not available in time to fetch an instruction on the next clock,the control hazard（控制竞争）,cause(产生原因) branch condition and the branch pc are not available in time to fetch an instruction on the next clock(转移条件及转移地址未能及时知晓) the next pc takes time to compute(转移地址计算要花时间) for conditional branches, the branch direction takes time to compute.(对于条件转移，转移条件的计算要也花时间) control hazards can cause a greater and greater performance loss for mips pipeline than do data hazards. (控制竞争会导致流水线性能的大量损失!),example: branches,branches of basic pipelined datapath,control hazard,b=3,dealing with the control hazard,four simple solutions（四种简单的解决办法） freeze or flush the pipeline（冻结或冲洗流水线） predict-not-taken (predict-untaken) （预测转移不成功） treat every branch as not taken predict-taken （预测转移成功） treat every branch as taken delayed branch （延迟转移技术）,solve the hazard by inserting stalls,48 or 72,the pipeline status,flushing the pipeline(流水线冲洗),simplest hardware: holding or deleting any instruction after branch until the branch destination is know. (保持/删除转移指令之后所取的指令，直至转移目标确定) penalty is fixed.(损失固定) can not be r

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

计算机体系结构PPT教学课件-第三章流水线.ppt

文档简介

温馨提示

最新文档

评论

计算机体系结构PPT教学课件-第三章流水线.ppt

文档简介

温馨提示

最新文档

评论

相关文档