计算机系统结构02时间并行指令级

上传人：我*** IP属地：北京上传时间：2022-02-15 格式：DOCX 页数：49 大小：1.30MB 积分：9.6 举报 版权申诉

已阅读5页，还剩44页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、计算机系统结构（时间指令级并行）邮电大学计算机学院邝坚2009年2月从指令重叠到流水线nT = å(t取指令i + t分析i + t执行i）i=1如果每段时间均为t，则执行n条指令所用的时间为：T3nt优点：简单，成本低；缺点：执行速度慢，部件利用率低。取指令k分析k执行k取指令k+1分析k+1执行k+1从指令重叠到流水线取指k分析k执行k取指k+1分析k+1执行k+1取指k+2分析k+2执行k+2T(2n + 1) tT(n + 2) t优点和缺点？流水线的表示n 连接图 - 逻辑输入输出123456保存结果执行取操作数形成操作数地址译码取指令取指k分析k执行k取指k+1分析k+

2、1执行k+1取指k+2分析k+2执行k+2流水线的表示时空图-时间空间规格化尾数加对阶求阶差0t1t2t3t4t5t6t7t8流水线的锁存器ClockInput指令分析器分析k+1流水锁存器指令执行部件执行k流水锁存器t1t2规格化1规格化2规格化3规格化4规格化5尾数加1尾数加2尾数加3尾数加4尾数加5对阶1对阶2对阶3对阶4对阶5求阶差1求阶差2求阶差3求阶差4求阶差5各段执行时间不相等的流水线空间0t1t2t3t4t5t6t7t8流水线的特点n 各功能段的时间尽量相等n 执行时间最长的n 效率瓶颈n 连续不断地使用（同一种功能）n 装入时间第一个任务从进入到流出n 排空时间最后一个

3、任务从进入到流出12123123123流水线的性能指标n 吞吐率（Throughput）时间流水线所完成的任务数量，或输出结果的数量空间S4 S3 S2S1流水线的性能指标n 计算流水线吞吐率的最基本公式：nTkP =n 各段执行时间相等，输入连续任务情况下：完成n个连续任务需要的总时间为：Tk（kn1）Dt其中：k 为流水线的，Dt为时钟周期。nn 吞吐率为：P =(k + n -1)Dt123n-1n123n-1n123n-1n123n-1Nk·Dt（n1）Dtn·Dt(k1)DtT流水线的性能指标n 最大吞吐率为：n1=P max = Limn ®

4、65;(k + n - 1)DtDtn 各段执行时间不相等，输入连续任务n 吞吐率：P =nkåi =11, D t 2,× × ×, Dtk )n 最大吞吐率：1P max =max( D t1, D t 2,× × ×, D tk )流水线的性能指标n 解决“瓶颈”输入输出DtDtDtDtDtDt输入输出Dt1DtDt3DtDt4DtS2-1Dt23DtS4S3S2-2S1S2-3S4S3S2-3S2-2S2-1S1流水线的性能指标比（Speedup）n 统一任务，不使用流水线所使用时间与使用流水线所用时间比顺序执行时间

5、T 0S =流水线执行时间Tkn 各段执行时间相等，输入连续任务k × n × Dt(k + n - 1)Dtk × nk + n -1比：S =n 最大比为：k × nS max = Limn®¥= kk + n - 1流水线的性能指标比 S10加8速k=10 段6k=6 段比4211248163264128n任务个数流水线的性能指标n 各段执行时间不相等，输入连续任务情况下，实际比：kn × åDtii =1S =kåi =1Dt1, Dt2,×× ×, Dtk）流

6、水线的性能指标n 效率（Efficiency）E = n个任务任务占用有效对应总面积n 各流水段执行时间相等，输入n个连续任务n 流水线的效率为：E =k × n × Dt=nk × (k + n - 1) × Dtk + n - 1n 流水线的最高效率为：nE max = Limn®¥= 1k + n - 1流水线的性能指标n 流水线的吞吐率、比与效率的：S =k × nE =nP =nk + n - 1k + n - 1(k + n -1)DtEP·Dt，Sk·E最佳流水最佳值(k0)流水线k性能价格

7、比 PCR峰值流水线的相关性分析Dependences/Hazards参考模型Pipeline HazardsHazard that prevent the next instruction in the instruction stream from executing during its designated clock cycle.Structural hazardsData hazards Control hazards结构相关Arise from resources when the HWcannot support all possible combination of inst

8、ructions simultaneously in overlapped execution.某些指令组合在流水线重叠执行过程中，如果硬件资源满足不了指令重叠执行的要求，便会产生资源n 在同一个时钟周期用同一个功能部件结构相关n 指令与数据的分离（分时）存取处理：n 分离cache、双端口RAM等数据相关n An instruction j is data dependence on instruction i if either of the following holds:n instruction i produce a result that may be used by instr

9、uction j, orn instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.n 指令之间数据供求水机制所表现的上的依赖性，因流。I(LOAD)IFIDEXMEWBI+1IFIDEXMEWBI+2IFIDEXMEWBI+3stallIFIDEXMEWBI(LOAD)IFIDEXMEWBI+1IFIDEXMEWBI+2IFIDEXMEWBI+3IFIDEXMEWB数据相关举例LOADADD STORER1, 0(R10)R3, R1, R2

10、R3, 5(R10)数据相关类型n RAW(n j tries toafter write) 写后读a source before i write itn be the most common casen WAW(write after write) 写后写n j tries to write a operand before it is written by In Example?n WAR(write after) 读后写n j tries to write a destination before it isn Example?by ILOADIFIDEXMEWB写R1ADDIFID读R

11、1EXMEWB写R3STOREIFID读R3EXMEWB数据相关的典型情况Forwardingn ADD的运算结果实际在EX段的末尾已形成，只是到WB段才写入R1ADD R1,R2,R3IFIDEXMEWBSUB R4,R1,R5IFIDEXMEWBAND R6,R1,R7IFIDEXMEWBOR R8,R1,R9IFIDEXMEWBXOR R10,R1,R11IFIDEXMEWBADD R1,R2,R3IFIDEXMEWBSUB R4,R1,R5IFIDEXMEWBAND R6,R1,R7IFIDEXMEWBOR R8,R1,R9IFIDEXMEWBXOR R10,R1,R11IFIDEXM

12、EWBBypass/ short-circuitingPipeline interlockLOAD R1,0(R2)IFIDEXEXWBADD R4,R1,R7IFIDstallEXMEWBSUB R6,R1,R8IFstallIDEXMEWBOR R3,R1,R9stallIFIDEXMELOAD R1,0(R2)IFIDEXMEWBADD R4,R1,R7IFIDEXMEWBSUB R6,R1,R8IFIDEXMEWBOR R3,R1,R9IFIDEXMEWBControl Dependences相关是因为程序执行转移类指令而引起的相关。if p1 s1;if p2 s2;n s1 is

13、 control dependent on p1,and s2 is control dependent on p2 but not on p1.Control Dependences1. An instruction that is dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.2. An instruction that is not control dependent on a branch cannot

14、 be moved after the branch so that its execution is controlled by the branch.Control (Branch) hazards局部相关如数据相关，仅仅影响到本指令的附近少数指令；全局相关如相关，其影响的范围要大得多，流水线性能损失更大。Control (Branch) hazardsPC normally not changed until the end of ID 计算转移目标地址放在ID段，需要在ID段增加一个加法器。BRANCHIFIDEXMEWBi+1stallIFIDEXMEWBzeBRANCHIFID

15、EXMEWBi+1stallstallstallIFIDEXMEWBControl (Branch) hazardsn Treat every branch as not taken, orn Treat every branch as not takenControl (Branch) hazardsn Delayed branch a SW method.branch instsequential successor branch target if takenn Sequential successor instruction is executed whether or not the

16、 branch is takenBranch delay slotTaken BranchIFIDEXMEWBi+1IFtargetIFIDEXMEWBt+1IFIDEXMEWBUntaken BranchIFIDEXMEWBi+1IFIDEXMEWBi+2IFIDEXMEWBi+3IFIDEXMEWBControl (Branch) hazardsControl (Branch) hazardsTaken BranchIFIDEXMEWBBranch delayIFIDEXMEWBtargetIFIDEXMEWBt+1IFIDEXMEWBUntaken BranchIFIDEXMEWBBra

17、nch delayIFIDEXMEWBi+2IFIDEXMEWBi+3IFIDEXMEWBControl (Branch) hazardsn 两个指令缓冲栈Control (Branch) hazardsn 改进循环Branch process in PowerPCPredict By OpcodePipelining in MIPS 5 stagesTypical 5 stages pipelining - MIPSIF (Instruction Fetch cycle)IR ¬ MemPC; NPC ¬ PC + 4;ID (Instruction Decode/reg

18、ister fetch cycle)A ¬ Regsrs; B ¬ Regsrt;Imm ¬ sign-extend immediate field of IR;EX (EXecution/effective address cycle)1.2.3.Memory reference: ALUOutput ¬ A + Imm; (form effective address)Register-Register: ALUOutput ¬ A func B; Register-Immediate: ALUOutput ¬ A op Imm;

19、 Branch:ALUOutput ¬ NPC + (Imm << 2); (word offset)Cond ¬ (A = 0) (branch condition)Typical 5 stages pipelining - MIPS4.MEM (MEMory access/branch completion cycle)Memory reference: (Load/Store) LMD ¬ MemALUOutput or MemALUOutput ¬ BBranch: if (cond) PC ¬ ALUOutput;5.WB

20、 (Write-Back cycle) Register-Register: Regsrd ¬ ALUOutput; Register-Immediate: Regsrt ¬ ALUOutput; Load instructure: Regsrt ¬ LMDBasic Pipeline for MIPSForwardingID zero test for branchDynamic HW predictionn Branch-prediction buffer or branch history tableDynamic HW predictionn Branch

21、 target bufferFrom in-order to out-of-ordern In-order execution Instructions are issued in program order, and if an instruction is stalled, no later instructions can processed.n Out-of-order execution ID in order (checking structural / data hazard), but we want an instruction to begin execution as s

22、oon as its data operand is available.From in-order to out-of-ordern Out-of-order execution introduces the possibility of RW and WW hazardsn ADDn ANDn MULR1,R2,R10R2,R2,R1 R1,R5,R3From in-order to out-of-ordern Need to modifyn Having multiple instructions in execution at once requires multiple functi

23、on unitsn Split ID pipe stage into two stages:n Issue Decode, check for structural hazardsoperands wait until do data hazards, then operandsWR - forwardingijijn ADDn ANDn MULR1,R2,R10R2,R2,R1 R1,R5,R3WW register renaijijn ADDn ANDn MULR1,R2,R10 R2,R2,R1R1,R5,R3CACABBBCACABBRW - register renaijjin AD

24、Dn ANDn MULR1,R2,R10 R2,R2,R1R1,R5,R3Static schedulingFDIEEEWn 执行段(E)为多功能部件，含LOAD/STOREn 3个执行段不能停顿n 发射段(I)能预约资源，提供互锁，与(W)重叠写n 每个功能部件均有的写回部件，只要不向同一目标写回，则无BCACABBStatic scheduling116LOAD LOAD ADD STORE LOAD LOAD ADDSTORER1,M(A)R2,M(B)FDIEEEWFDIEEEWR3,R1,R2FDIEEEWR3,M(C)R4,M(X)R5,M(Y)FDIEEEWR6,R4,R5R6,

25、M(Z)后续4条指令的时空图？共需要多少个周期？Static scheduling118LOAD LOAD LOAD LOAD ADD ADD STORESTORER1,M(A)R2,M(B)R4,M(X)R5,M(Y)FDIEEEWFDIEEEWFDIEEEWFDIEEEWR3,R1,R2FDIEEEWR6,R4,R5FDIEEEWR3,M(C)R6,M(Z)FDIEEEWFDIEEEW此类调度由谁来完成？Dynamic Schedulingn All instructions pass through the issue stage in order; however, they can

26、be stalled or bypass each other whening operands and thus enter execution out of order.Dynamic Schedulingn 动态调度算法主要有两种n CDC计分牌（scorebord）算法，最先在CDC6600大型机中采用，采用集中n Tomasulo算法最早在大型机。360/91处理机的浮点处理部件中被采用。又称为公共数据总线（CDB：Common Data Bus）法，或令牌法等。采用分散的办法处理数据相关。Dynamic Scheduling流水线发射(I)段的功能有所变化。一条译码后的指令若它所需

27、的功能部件可用，并且目标寄存器也不是其他功能部件已预约的目标寄存器，那么这条指令就可发射，否则等待直到条件满足再发射。这样首先杜绝了WW相关。至于取寄存器操作数已不是I段必须完成的功能了，能取即取，不能取可在执行水线停顿减少。取操作，这样会使流在取寄存器操作数时要判测是否有WR相关。若先前发射出的指令以某寄存器为目标寄存器，则只有该指令标寄存器写入后(目标寄存器表中此项清除)，此寄存器才作为其他指令的源寄存器就绪，从而消除WR相关。在写回(W)段要判测是否有RW相关。发射出的指令若以本指令预定的目标寄存器为源寄存器，而还没有的话，则本指令的写回操作要推迟，直到RW相关清除再写回。因为指令是按

28、序发射，并按发射顺序登记在指令状态表中，故指令发射的先后顺序容易断定。Dynamic Scheduling11620LOAD LOAD ADD STORE LOAD LOAD ADDSTORER1,M(A)R2,M(B)FDIEEEWFDIEEEWR3,R1,R2FDIEEEWR3,M(C)R4,M(X)R5,M(Y)FDIEEEWFDIEEEWFDIEEEWR6,R4,R5FDIEEEWR6,M(Z)FDIEEEW动/静结合的时空图？Instruction-level ParallelismILP(of a program) a measure of the average number o

29、f instructions in a program that a processor might be able to execute at the same timen Mostly determined by the number of data/control dependencies in relation to the number of other instructionsLOAD ADDSTORER1,M(A) R3,R1,R2R3,M(B)LOAD ADDSTORER1,M(A) R3,R4,R5R2,M(B)Machine ParallelismMachine paral

30、lelism (of a processor) a measure of the ability of the processor to take advantage of the ILP of the programn Determined by the number of instructions that can be fetched and executed at the same timeTo achieve high performance, needboth ILP and machine parallelismMachine Parallelismn Options:1.Inc

31、rease the depth of the pipeline to increase the clock rateFetch (and execute) more than one instructions at one time (expand every pipeline stage to accommodate multiple instructions) multiple-issueMix 1 & 22.3.Machine ParallelismMachine typeScalar base machine of k pipe stageSuperscalar machine

32、 of degree mSuperpipelined machine of degree nSuperpipelined Superscalar machine of degree (m,n)Machine pipeline cycle111/n1/nInstruction issue rate1m1mInstruction issue latency111/n1/nILP to fully utilize the pipeline1mnmnSuperscalarn A superscalar machine of degree m m instructions are issued per

33、cycle and ILP should be m in order to fully utilize the pipeline.Superscalar Pipelinen Superscalar Pipeline In an m-issue superscalar machine, the decode and execution resources are increased to form essentially m pipelines operating concurrently, the functional units may be shared by multiple pipel

34、ines.FDEWFDEWFDEWFDEWFDEWFDEWMultiple-Issuen Must handle, with a combination of hardware and software fixes, the fundamental limitations ofn data dependenciesn Limitation more severe due to (usually) low ILPn control dependenciesn Even more severen Use dynamic branch prediction to help resolve the I

35、LP issuen Resources (structural hazards)n Has a much larger number of potential resourcesn Functional units may have to arbitrate for result buses and register file write portsn Resources can be eliminated by duplicating theresource or by pipelining the resourceMultipipeline schedulingn Issue &

36、completion policies:n In-order issue with in-order-completionn In-order issue and out-of-order completionn Out-of-order issue and out-of-order completionSuperscalar Pipelinedual-pipeline, four functional units in execution stageDemo codeI1.I2.I3.I4.I5.I6.LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3

37、,R4 R4,R4,R5 R6,R6R6,R6,R7相关性?Issue & completion110LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3,R4 R4,R4,R5 R6,R6R6,R6,R7In-order issue with in-order-completionIssue & completion110LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3,R4 R4,R4,R5 R6,R6R6,R6,R7In-order issue and out-of-order complet

38、ionf1d1e2s1f2d2a1a2s2f1d1a1a2s1f2d2m1m2m3s1f1d1e1s2f2d2m1m2m3s2f1d1e2s1f2d2a1a2s2f1d1a1a2s1f2d2m1m2m3s2f1d1e1s1f2d2m1m2m3s2Issue & completionn Out-of-Order Issue with Out-of- Order Completionn Fetch and decode instructions beyond the ed one, store them in an instructionbuffer (as long as theres

39、room), and flag those instructions in the buffer that donthave resources or data dependenciesn Flagged instructions are then issued from the buffer without regard to their program orderIssue & completionIssue & completion19LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3,R4 R4,R4,R5 R6,R6R6,R6,

40、R7Out-of-order issue and out-of-ordercompletionIssue & completion17I5.I6.I1.I3.I2.I4.NEG MUL LOAD ADD ADDMULR6,R6 R6,R6,R7 R1,M(A) R3,R3,R4 R2,R2,R1R4,R4,R5Out-of-order issue and out-of-order completion (with SW support)f1d1e2s1f2d2m1m2m3s2f1d1e1s1f2d2a1a2s2f1d1a1a2s1f2d2m1m2m3s2f1d1e2s1f2d2a1a2

41、s2f1d1a1a2s1f2d2m1m2m3s2f1d1e1s1f2d2m1m2m3s2dependenciesn when a later instruction (that completes earlier) produces a data value that destroys a data value used as a source in an earlier instruction (that issues later)MULADD XORR1,R1,R5R3,R1,10 R1,R5,1Superscalar Performancen N instructions, k stag

42、es, m-issuen The ideal execution timeT (m,1) = æ k + N - m öDtç÷ènøn The ideal speedupT (1,1) = m(N + k -1)S (m,1) =N + m(k -1)T (m,1)n When N?How about PowerPC ?Superpipelined Processordegree n = 3n breaking up the pipe stage into many small pieces, so that perhaps mor

43、e then one instructions are in various phases of execution at one time.n In a superpipelined processor of degree n, the pipeline cycle time is 1/n of the base cycle.n Another kind of superpipeline: stages 8.Superpipelined Processorn That have been around for a lone timen Crey1 & CDC7600 (both Cr

44、ays) , n = 3;n MIPS R4000 (SGI), n = 2.n Superpipelining is not possible without a high-speed clocking mechanism.n The internal pipeline of the R4000 processor operates at twice the frequency of the master clock.MIPS R4000 Pipeline StagesP.44 MIPS R4000 Microprocessor User's ManualMIPS R4000 Pip

45、elinen IF - Instruction Fetch, First Halfn IS - Instruction Fetch, Second Halfn RF - Register Fetchn EX - Executionn DF - Data Fetch, First Halfn DS - Data Fetch, Second Halfn TC - Tag Checkn WB - Write BackMIPS R4000 Pipeline指令/数据分立Cache，每个Pclock可以一次，因此在一个MClock内可以从指令Cache中读出两条指令，从数据Cache中读出或写入两个数据

46、。Superpipeline Performancen The minimum time required to execute N instructions for a superpipelined processor of degree n with k stages isT (1, n) = (k + N -1)Dtnn Thus, the potential speedup over the base machine isk + N -1= n(k + N -1)T (1,1)S (1, n) =T (1, n)k + (N -1) / nnk + N -1S (1, n) ¾N¾®¾¥® nSuperpipelined superscalerdegree m = n = 3n The machine executes m instruc

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

计算机系统结构02时间并行指令级

文档简介

温馨提示

最新文档

评论

计算机系统结构02时间并行指令级

文档简介

温馨提示

最新文档

评论

相关文档