版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、计算机系统结构(时间指令级并行)邮电大学计算机学院邝 坚2009年2月从指令重叠到流水线nT = å(t取指令i + t分析i + t执行i)i=1如果每段时间均为t,则执行n条指令所用的时间为:T3nt优点:简单,成本低;缺点:执行速度慢,部件利用率低。取指令k分析k执行k取指令k+1分析k+1执行k+1从指令重叠到流水线取指k分析k执行k取指k+1分析k+1执行k+1取指k+2分析k+2执行k+2T(2n + 1) tT(n + 2) t优点和缺点?流水线的表示n 连接图 - 逻辑输入输出123456保存结果执行取操作数形成操作数地址译码取指令取指k分析k执行k取指k+1分析k+
2、1执行k+1取指k+2分析k+2执行k+2流水线的表示时空图-时间空间规格化尾数加对阶 求阶差0t1t2t3t4t5t6t7t8流水线的锁存器ClockInput指令分析器分析k+1流水锁存器指令执行部件执行k流水锁存器t1t2规格化1规格化2规格化3规格化4规格化5尾数加1尾数加2尾数加3尾数加4尾数加5对阶1对阶2对阶3对阶4对阶5求阶差1求阶差2求阶差3求阶差4求阶差5各段执行时间不相等的流水线空间0t1t2t3t4t5t6t7t8流水线的特点n 各功能段的时间尽量相等n 执行时间最长的n 效率瓶颈n 连续不断地使用(同一种功能)n 装入时间 第一个任务从进入到流出n 排空时间 最后一个
3、任务从进入到流出12123123123流水线的性能指标n 吞吐率(Throughput)时间流水线所完成的任务数量,或输出结果的数量空间S4 S3 S2S1流水线的性能指标n 计算流水线吞吐率的最基本公式:nTkP =n 各段执行时间相等,输入连续任务情况下: 完成n个连续任务需要的总时间为:Tk(kn1)Dt其中:k 为流水线的,Dt为时钟周期。nn 吞吐率为:P =(k + n -1)Dt123n-1n123n-1n123n-1n123n-1Nk·Dt(n1)Dtn·Dt(k1)DtT流水线的性能指标n 最大吞吐率为:n1=P max = Limn ®
4、65;(k + n - 1)DtDtn 各段执行时间不相等,输入连续任务n 吞吐率:P =nkåi =11, D t 2,× × ×, Dtk )n 最大吞吐率:1P max =max( D t1, D t 2,× × ×, D tk )流水线的性能指标n 解决“瓶颈”输入输出DtDtDtDtDtDt输入输出Dt1DtDt3DtDt4DtS2-1Dt23DtS4S3S2-2S1S2-3S4S3S2-3S2-2S2-1S1流水线的性能指标比(Speedup)n 统一任务,不使用流水线所使用时间与使用流水线所用时间比顺序执行时间
5、T 0S =流水线执行时间Tkn 各段执行时间相等,输入连续任务k × n × Dt(k + n - 1)Dtk × nk + n -1比:S =n 最大比为:k × nS max = Limn®¥= kk + n - 1流水线的性能指标比 S10加8速k=10 段6k=6 段比4211248163264128n任 务 个 数流水线的性能指标n 各段执行时间不相等,输入连续任务情况下,实际比:kn × åDtii =1S =kåi =1Dt1, Dt2,×× ×, Dtk)流
6、水线的性能指标n 效率(Efficiency)E = n个任务任务占用有效对应总面积n 各流水段执行时间相等,输入n个连续任务n 流水线的效率为:E =k × n × Dt=nk × (k + n - 1) × Dtk + n - 1n 流水线的最高效率为:nE max = Limn®¥= 1k + n - 1流水线的性能指标n 流水线的吞吐率、比与效率的:S =k × nE =nP =nk + n - 1k + n - 1(k + n -1)DtEP·Dt,Sk·E最佳流水最佳值(k0)流水线k性能价格
7、比 PCR峰值流水线的相关性分析Dependences/Hazards参考模型Pipeline HazardsHazard that prevent the next instruction in the instruction stream from executing during its designated clock cycle.Structural hazardsData hazards Control hazards结构相关Arise from resources when the HWcannot support all possible combination of inst
8、ructions simultaneously in overlapped execution.某些指令组合在流水线重叠执行过程中,如果硬件资源满足不了指令重叠执行的要求,便会产生资源n 在同一个时钟周期用同一个功能部件结构相关n 指令与数据的分离(分时)存取处理:n 分离cache、双端口RAM等数据相关n An instruction j is data dependence on instruction i if either of the following holds:n instruction i produce a result that may be used by instr
9、uction j, orn instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.n 指令之间数据供求水机制所表现的上的依赖性,因流。I(LOAD)IFIDEXMEWBI+1IFIDEXMEWBI+2IFIDEXMEWBI+3stallIFIDEXMEWBI(LOAD)IFIDEXMEWBI+1IFIDEXMEWBI+2IFIDEXMEWBI+3IFIDEXMEWB数据相关举例LOADADD STORER1, 0(R10)R3, R1, R2
10、R3, 5(R10)数据相关类型n RAW(n j tries toafter write) 写后读a source before i write itn be the most common casen WAW(write after write) 写后写n j tries to write a operand before it is written by In Example?n WAR(write after) 读后写n j tries to write a destination before it isn Example?by ILOADIFIDEXMEWB写R1ADDIFID读R
11、1EXMEWB写R3STOREIFID读R3EXMEWB数据相关的典型情况Forwardingn ADD的运算结果实际在EX段的末尾已形成,只是到WB段才写入R1ADD R1,R2,R3IFIDEXMEWBSUB R4,R1,R5IFIDEXMEWBAND R6,R1,R7IFIDEXMEWBOR R8,R1,R9IFIDEXMEWBXOR R10,R1,R11IFIDEXMEWBADD R1,R2,R3IFIDEXMEWBSUB R4,R1,R5IFIDEXMEWBAND R6,R1,R7IFIDEXMEWBOR R8,R1,R9IFIDEXMEWBXOR R10,R1,R11IFIDEXM
12、EWBBypass/ short-circuitingPipeline interlockLOAD R1,0(R2)IFIDEXEXWBADD R4,R1,R7IFIDstallEXMEWBSUB R6,R1,R8IFstallIDEXMEWBOR R3,R1,R9stallIFIDEXMELOAD R1,0(R2)IFIDEXMEWBADD R4,R1,R7IFIDEXMEWBSUB R6,R1,R8IFIDEXMEWBOR R3,R1,R9IFIDEXMEWBControl Dependences相关 是因为程序执行转移类指令而引起的相关。if p1 s1;if p2 s2;n s1 is
13、 control dependent on p1,and s2 is control dependent on p2 but not on p1.Control Dependences1. An instruction that is dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.2. An instruction that is not control dependent on a branch cannot
14、 be moved after the branch so that its execution is controlled by the branch.Control (Branch) hazards局部相关 如数据相关,仅仅影响到本指令的附近少数指令;全局相关 如相关,其影响的范围要大得多,流水线性能损失更大。Control (Branch) hazardsPC normally not changed until the end of ID 计算转移目标地址放在ID段,需要在ID段增加一个加法器。BRANCHIFIDEXMEWBi+1stallIFIDEXMEWBzeBRANCHIFID
15、EXMEWBi+1stallstallstallIFIDEXMEWBControl (Branch) hazardsn Treat every branch as not taken, orn Treat every branch as not takenControl (Branch) hazardsn Delayed branch a SW method.branch instsequential successor branch target if takenn Sequential successor instruction is executed whether or not the
16、 branch is takenBranch delay slotTaken BranchIFIDEXMEWBi+1IFtargetIFIDEXMEWBt+1IFIDEXMEWBUntaken BranchIFIDEXMEWBi+1IFIDEXMEWBi+2IFIDEXMEWBi+3IFIDEXMEWBControl (Branch) hazardsControl (Branch) hazardsTaken BranchIFIDEXMEWBBranch delayIFIDEXMEWBtargetIFIDEXMEWBt+1IFIDEXMEWBUntaken BranchIFIDEXMEWBBra
17、nch delayIFIDEXMEWBi+2IFIDEXMEWBi+3IFIDEXMEWBControl (Branch) hazardsn 两个指令缓冲栈Control (Branch) hazardsn 改进循环Branch process in PowerPCPredict By OpcodePipelining in MIPS 5 stagesTypical 5 stages pipelining - MIPSIF (Instruction Fetch cycle)IR ¬ MemPC; NPC ¬ PC + 4;ID (Instruction Decode/reg
18、ister fetch cycle)A ¬ Regsrs; B ¬ Regsrt;Imm ¬ sign-extend immediate field of IR;EX (EXecution/effective address cycle)1.2.3.Memory reference: ALUOutput ¬ A + Imm; (form effective address)Register-Register: ALUOutput ¬ A func B; Register-Immediate: ALUOutput ¬ A op Imm;
19、 Branch:ALUOutput ¬ NPC + (Imm << 2); (word offset)Cond ¬ (A = 0) (branch condition)Typical 5 stages pipelining - MIPS4.MEM (MEMory access/branch completion cycle)Memory reference: (Load/Store) LMD ¬ MemALUOutput or MemALUOutput ¬ BBranch: if (cond) PC ¬ ALUOutput;5.WB
20、 (Write-Back cycle) Register-Register: Regsrd ¬ ALUOutput; Register-Immediate: Regsrt ¬ ALUOutput; Load instructure: Regsrt ¬ LMDBasic Pipeline for MIPSForwardingID zero test for branchDynamic HW predictionn Branch-prediction buffer or branch history tableDynamic HW predictionn Branch
21、 target bufferFrom in-order to out-of-ordern In-order execution Instructions are issued in program order, and if an instruction is stalled, no later instructions can processed.n Out-of-order execution ID in order (checking structural / data hazard), but we want an instruction to begin execution as s
22、oon as its data operand is available.From in-order to out-of-ordern Out-of-order execution introduces the possibility of RW and WW hazardsn ADDn ANDn MULR1,R2,R10R2,R2,R1 R1,R5,R3From in-order to out-of-ordern Need to modifyn Having multiple instructions in execution at once requires multiple functi
23、on unitsn Split ID pipe stage into two stages:n Issue Decode, check for structural hazardsoperands wait until do data hazards, then operandsWR - forwardingijijn ADDn ANDn MULR1,R2,R10R2,R2,R1 R1,R5,R3WW register renaijijn ADDn ANDn MULR1,R2,R10 R2,R2,R1R1,R5,R3CACABBBCACABBRW - register renaijjin AD
24、Dn ANDn MULR1,R2,R10 R2,R2,R1R1,R5,R3Static schedulingFDIEEEWn 执行段(E)为多功能部件,含LOAD/STOREn 3个执行段不能停顿n 发射段(I)能预约资源,提供互锁,与(W)重叠写n 每个功能部件均有的写回部件,只要不向同一目标写回,则无BCACABBStatic scheduling116LOAD LOAD ADD STORE LOAD LOAD ADDSTORER1,M(A)R2,M(B)FDIEEEWFDIEEEWR3,R1,R2FDIEEEWR3,M(C)R4,M(X)R5,M(Y)FDIEEEWR6,R4,R5R6,
25、M(Z)后续4条指令的时空图?共需要多少个周期?Static scheduling118LOAD LOAD LOAD LOAD ADD ADD STORESTORER1,M(A)R2,M(B)R4,M(X)R5,M(Y)FDIEEEWFDIEEEWFDIEEEWFDIEEEWR3,R1,R2FDIEEEWR6,R4,R5FDIEEEWR3,M(C)R6,M(Z)FDIEEEWFDIEEEW此类调度由谁来完成?Dynamic Schedulingn All instructions pass through the issue stage in order; however, they can
26、be stalled or bypass each other whening operands and thus enter execution out of order.Dynamic Schedulingn 动态调度算法主要有两种n CDC计分牌(scorebord)算法,最先在CDC6600大型机中采用,采用集中n Tomasulo算法最早在大型机。360/91处理机的浮点处理部件中被采用。又称为公共数据总线(CDB:Common Data Bus)法,或令牌法等。采用分散的办法处理数据相关。Dynamic Scheduling流水线发射(I)段的功能有所变化。一条译码后的指令若它所需
27、的功能部件可用,并且目标寄存器也不是其他功能部件已预约的目标寄存器,那么这条指令就可发射, 否则等待直到条件满足再发射。这样首先杜绝了WW相关。至于取寄存器操作数已不是I段必须完成的功能了,能取即取,不能取可在执行水线停顿减少。取操作,这样会使流在取寄存器操作数时要判测是否有WR相关。若先前发射出的指令以某寄存器为目标寄存器,则只有该指令标寄存器写入后(目标寄存器表中此项清除),此寄存器才作为其他指令的源寄存器就绪,从而消除WR相关。在写回(W)段要判测是否有RW相关。发射出的指令若以本指令预定的目标寄存器为源寄存器,而还没有的话,则本指令的写回操作要推迟,直到RW相关清除再写回。因为指令是按
28、序发射,并按发射顺序登记在指令状态表中,故指令发射的先后顺序容易断定。Dynamic Scheduling11620LOAD LOAD ADD STORE LOAD LOAD ADDSTORER1,M(A)R2,M(B)FDIEEEWFDIEEEWR3,R1,R2FDIEEEWR3,M(C)R4,M(X)R5,M(Y)FDIEEEWFDIEEEWFDIEEEWR6,R4,R5FDIEEEWR6,M(Z)FDIEEEW动/静结合的时空图?Instruction-level ParallelismILP(of a program) a measure of the average number o
29、f instructions in a program that a processor might be able to execute at the same timen Mostly determined by the number of data/control dependencies in relation to the number of other instructionsLOAD ADDSTORER1,M(A) R3,R1,R2R3,M(B)LOAD ADDSTORER1,M(A) R3,R4,R5R2,M(B)Machine ParallelismMachine paral
30、lelism (of a processor) a measure of the ability of the processor to take advantage of the ILP of the programn Determined by the number of instructions that can be fetched and executed at the same timeTo achieve high performance, needboth ILP and machine parallelismMachine Parallelismn Options:1.Inc
31、rease the depth of the pipeline to increase the clock rateFetch (and execute) more than one instructions at one time (expand every pipeline stage to accommodate multiple instructions) multiple-issueMix 1 & 22.3.Machine ParallelismMachine typeScalar base machine of k pipe stageSuperscalar machine
32、 of degree mSuperpipelined machine of degree nSuperpipelined Superscalar machine of degree (m,n)Machine pipeline cycle111/n1/nInstruction issue rate1m1mInstruction issue latency111/n1/nILP to fully utilize the pipeline1mnmnSuperscalarn A superscalar machine of degree m m instructions are issued per
33、cycle and ILP should be m in order to fully utilize the pipeline.Superscalar Pipelinen Superscalar Pipeline In an m-issue superscalar machine, the decode and execution resources are increased to form essentially m pipelines operating concurrently, the functional units may be shared by multiple pipel
34、ines.FDEWFDEWFDEWFDEWFDEWFDEWMultiple-Issuen Must handle, with a combination of hardware and software fixes, the fundamental limitations ofn data dependenciesn Limitation more severe due to (usually) low ILPn control dependenciesn Even more severen Use dynamic branch prediction to help resolve the I
35、LP issuen Resources (structural hazards)n Has a much larger number of potential resourcesn Functional units may have to arbitrate for result buses and register file write portsn Resources can be eliminated by duplicating theresource or by pipelining the resourceMultipipeline schedulingn Issue &
36、completion policies:n In-order issue with in-order-completionn In-order issue and out-of-order completionn Out-of-order issue and out-of-order completionSuperscalar Pipelinedual-pipeline, four functional units in execution stageDemo codeI1.I2.I3.I4.I5.I6.LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3
37、,R4 R4,R4,R5 R6,R6R6,R6,R7相关性?Issue & completion110LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3,R4 R4,R4,R5 R6,R6R6,R6,R7In-order issue with in-order-completionIssue & completion110LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3,R4 R4,R4,R5 R6,R6R6,R6,R7In-order issue and out-of-order complet
38、ionf1d1e2s1f2d2a1a2s2f1d1a1a2s1f2d2m1m2m3s1f1d1e1s2f2d2m1m2m3s2f1d1e2s1f2d2a1a2s2f1d1a1a2s1f2d2m1m2m3s2f1d1e1s1f2d2m1m2m3s2Issue & completionn Out-of-Order Issue with Out-of- Order Completionn Fetch and decode instructions beyond the ed one, store them in an instructionbuffer (as long as theres
39、room), and flag those instructions in the buffer that donthave resources or data dependenciesn Flagged instructions are then issued from the buffer without regard to their program orderIssue & completionIssue & completion19LOAD ADD ADD MUL NEGMULR1,M(A) R2,R2,R1 R3,R3,R4 R4,R4,R5 R6,R6R6,R6,
40、R7Out-of-order issue and out-of-ordercompletionIssue & completion17I5.I6.I1.I3.I2.I4.NEG MUL LOAD ADD ADDMULR6,R6 R6,R6,R7 R1,M(A) R3,R3,R4 R2,R2,R1R4,R4,R5Out-of-order issue and out-of-order completion (with SW support)f1d1e2s1f2d2m1m2m3s2f1d1e1s1f2d2a1a2s2f1d1a1a2s1f2d2m1m2m3s2f1d1e2s1f2d2a1a2
41、s2f1d1a1a2s1f2d2m1m2m3s2f1d1e1s1f2d2m1m2m3s2dependenciesn when a later instruction (that completes earlier) produces a data value that destroys a data value used as a source in an earlier instruction (that issues later)MULADD XORR1,R1,R5R3,R1,10 R1,R5,1Superscalar Performancen N instructions, k stag
42、es, m-issuen The ideal execution timeT (m,1) = æ k + N - m öDtç÷ènøn The ideal speedupT (1,1) = m(N + k -1)S (m,1) =N + m(k -1)T (m,1)n When N?How about PowerPC ?Superpipelined Processordegree n = 3n breaking up the pipe stage into many small pieces, so that perhaps mor
43、e then one instructions are in various phases of execution at one time.n In a superpipelined processor of degree n, the pipeline cycle time is 1/n of the base cycle.n Another kind of superpipeline: stages 8.Superpipelined Processorn That have been around for a lone timen Crey1 & CDC7600 (both Cr
44、ays) , n = 3;n MIPS R4000 (SGI), n = 2.n Superpipelining is not possible without a high-speed clocking mechanism.n The internal pipeline of the R4000 processor operates at twice the frequency of the master clock.MIPS R4000 Pipeline StagesP.44 MIPS R4000 Microprocessor User's ManualMIPS R4000 Pip
45、elinen IF - Instruction Fetch, First Halfn IS - Instruction Fetch, Second Halfn RF - Register Fetchn EX - Executionn DF - Data Fetch, First Halfn DS - Data Fetch, Second Halfn TC - Tag Checkn WB - Write BackMIPS R4000 Pipeline指令/数据分立Cache,每个Pclock可以一次,因此在一个MClock内可以从指令Cache中读出两条指令,从数据Cache中读出或写入两个数据
46、。Superpipeline Performancen The minimum time required to execute N instructions for a superpipelined processor of degree n with k stages isT (1, n) = (k + N -1)Dtnn Thus, the potential speedup over the base machine isk + N -1= n(k + N -1)T (1,1)S (1, n) =T (1, n)k + (N -1) / nnk + N -1S (1, n) ¾N¾®¾¥® nSuperpipelined superscalerdegree m = n = 3n The machine executes m instruc
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年桥梁经济分析与投资评估
- 2026春招:行政专员面试题及答案
- 2026春招:销售代表真题及答案
- 2026春招:西部航空试题及答案
- 货运安全课件
- 心理咨询部服务模式改进
- 医疗信息录入员礼仪与职业操守
- 医药销售代表礼仪培训内容
- 医疗大数据与临床决策支持
- 护理团队建设与护理文化建设探索
- 雨水管网改造改造设计方案
- 《高速公路服务区开放设置技术要求》
- 2024-2030年全球与中国巡飞弹系统行业发展战略及投资前景预测报告
- QBT 1619-2018 票夹行业标准
- 代建项目全过程运营管理及风险防控课件
- 广东省佛山市南海区2023-2024学年七年级上学期期末数学试卷+
- 基于区块链的供应链金融平台实施方案
- 牛津版小学英语教材梳理
- 风机安装工程施工强制性条文执行记录表
- GB/T 1355-2021小麦粉
- GB 5135.11-2006自动喷水灭火系统第11部分:沟槽式管接件
评论
0/150
提交评论