




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、计算机系统结构习题讲解 (1)内容提要n基础概念n指令系统n流水线处理机n指令级并行基础概念n1-2. 如有一个经解释实现的计算机,可以按功能划分成4级。每一级为了执行一条指令需要下一级的N条指令解释。若执行第一级的一条指令需要K ns时间,那么执行第2、3、4级的一条指令各需要用多少时间?分析与解答:NK ns、N2K ns、N3K ns4321第i+1级第i级.n个T(i+1)=n*T(i)基础概念n1.6 Dhrystone is a well-known integer benchmark. Computer A is measured to perform DA executions
2、 of the Dhrystone benchmark per second, and to achieve a millions of instructions per second rate of MIPSA while doing Dhrystone. Computer B is measured to perform DB executions of the Dhrystone benchmark per second, and to achieve a millions of instructions per second rate of MIPSB while doing Dhry
3、stone. nQ. What is the fallacy in calculating the MIPS rating of computer B as MIPSB=MIPSA(DB/DA)?基础概念nAnswer The proposed formulation for MIPSB can be rewritten as:MIPSA MIPSB - = - DA DBExamining the units of each factor, we haveComputer A instructions/second Computer B instructions/second- = -Dhr
4、ystoneA/secondDhrystoneB/second基础概念The time units factor out, revealing that the formulation is founded upon the assumption thatComputer A instructions Computer B instructions- = -DhrystoneA DhrystoneBUnless Computer A and B have the same instruction set architecture and execute identically compiled
5、 Dhrystone executables, this assumption is likely false. If so, the formulation for MIPSB is also incorrect.基础概念n1-10. 实现软件移植的主要途径有哪些?它们存在什么问题?适用于什么场合?n1统一高级语言统一高级语言n语言用途不同运行在不同系统结构上基本结构难以统一(MIPS,ARM)n2采用系列机思想采用系列机思想n只能在相同系统结构的机器间实现软件移植,兼容性往往会限制系统结构的变革n3模拟与仿真模拟与仿真n模拟的运行速度慢而系统结构差别大则难以进行仿真基础概念n 1-11.
6、想在系列中发展一种新型号机器,你认为下列想在系列中发展一种新型号机器,你认为下列哪些设想可以考虑,哪些行不通,为什么?哪些设想可以考虑,哪些行不通,为什么?(1)新增加字符数据类型和若干条字符处理指令,以支持事务处理程序的编译;(2)为增强中断处理功能,将中断分析由原来的4级增加到5级,并重新调整中断响应的优先次序;(3)在CPU和主存之间增设Cache存储器,以克服因主存访问速率过低而造成的系统性能瓶颈;基础概念(4)为增加寻址灵活性和减少平均指令字长,将原来全部采用等长操作码的指令改成有3类不同码长的扩展操作码;并将源操作数寻址方式由原来的操作码指明改成增加一个如VAX-11那样的寻址方式
7、位字段来指明;(5)将CPU与主存之间的数据通路宽度由16位扩到32位,以加快主机内部信息的传送;(6)为了减少使用公用总线的冲突,将单总线改为双总线;(7)把原来的0号通用寄存器改作为专用的堆栈指示器。指令系统n*2-1. 数据类型、数据表示和数据结构之间的关系是什么?在设计一个计算机系统时,确定数据表示的原则主要有哪几个?n数据类型指程序设计语言中所允许的变量的种类,包括定点数、浮点数、布尔数、字符、树、图、表等n数据表示指硬件能直接识别和引用的数据类型n数据结构是带有结构的数据元素集合指令系统n数据表示和数据结构都是数据类型的子集,由硬件实现的数据类型即数据表示,而数据结构是由软件实现的
8、数据类型,另外还有软硬件共同实现的。n数据类型由硬件实现速度快,而软件实现节省硬件成本,如何取舍:a.缩短程序运行时间;b.减少CPU与主存之间的通信量;c.这种数据表示的通用性和利用率。指令系统n*2-10.假设有A和B两种不同类型的处理机,A处理机中的数据不带标志符,其指令字长和数据字长均为32位。B处理机的数据带有标志符,每个数据的字长增加至36位,其中有4位是标志符,它的指令条数由最多256条减少至不到64条。如果每执行一条指令平均要访问两个操作数,每个存放在存储器中的操作数平均要被访问8次。对于一个由1000条指令组成的程序,分别计算这个程序在A处理机和B处理机中占用的存储空间大小(
9、包括指令和数据),从中得到什么启发?指令系统n分析:nSOP代表操作码位数nSOD代表操作数地址码位数nSI代表指令位数nSI=SOP+SODnSF代表标志符位数nSV代表数据值位数nSD代表数据位数nSD=SF+SVnSP代表程序的二进制位数nSP=SI+SD=SOP+SOD+SF+SV指令系统nA256条指令n指令32位:n数据32位:nB64条指令n指令30位:n数据36位:操作码(8位)地址码(24位)数据(32位)操作码(6位)地址码(24位)标志符(4位)数据(32位) OPFOPS (B)+S (B)S (A)指令系统n(1) 处理机A nSp(A )=SI(A )+SD(A )
10、n =3210002321000/840000(bit)n处理机BnSP(B)=SI(B)+SD(B)n =3010002361000/839000(bit)OPFOPS (B)+S (B)S (A)OPFOPS (B)+S (B)S (A)指令系统n S0,Sp(B)Sp(A)n启示:n若1 NR 1, S0;n若NR4,则 1, S0;n实际执行时,经测量有NR 10,0.85,S0SOD(A) =SOD(B),Sv(A) =Sv(B)程序存储容量的增量S=Sp(B)Sp(A) =SOP(B)+SF(B)-SOP(A)SOP(A)=8 NI,SOP(B)+SF(B)=6NI+ =8 NI
11、(0.75+ ) 0.75 + =0.875R1 I2: A2*B2-R2 I3: A3*B3-R3 I4: A4*B4-R4 I5: A5*B5-R5 I6: A6*B6-R6 I7: R1+R2-R7 I8: R3+R4-R8 I9: R5+R6-R9 I10: R7+R8-R10 I11: R9+R10-R11标量处理机n时空图S61234567891011S5123456S4123456S37891011S27891011S1123456789101112345678910111213141516171819202122标量处理机吞吐率为:Tp=11/22t=1/2t加速比:S=11*
12、4t/22t=2效率:E=(11*4t)/(6*22t)=1/3流水线基础nAppendix A.2Use the following code fragment:Loop: LDF0,0(R2) LD F4,0(R3) MUL.D F0,F0,F4 ADD.D F2,F0,F2 DADDUIR2,R2,#8 DADDUIR3,R3,#8 DSUBUR5,R4,R2 BNEZR5,LoopnAssume that the initial value of R4 is R2+792792/8 = 99(step is 8)iteration = 99 times流水线基础nFor this ex
13、ercise assume the standard five stage integer pipeline and the MIPS FP pipeline as described in section A.5. If structural hazards are due to write-back contention, assume the earliest instruction gets priority and other instructions are stalled.Qa. Show the timing of this instruction sequence for t
14、he MIPS FP pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle “forward” through the register file. Assume that the branch is handled by flushing the pipeline. If all memory references hit in the cache, how many cycles does this loop
15、 take to execute?流水线基础nQb. Show the timing of this instruction sequence for the MIPS FP pipeline with normal forwarding or bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory references hit in the cache, how many cycles does this loop take to execute?nS
16、ee 3 Hazards & ForwardingnStructural HazardnData HazardnControl Hazard流水线基础nAnswer a. (without forwarding)instructionClock cycle1 2 3 45 6 7 8 13 1415 1617 1819 2021 2223 2425 26 27LD F0,0(R2)F D E MWLD F4,0(R3) F D E M WMUL.D F0,F0,F4 F Ds s E E E MWADD.D F2,F0,F2 Fs s D s s ss E E E E MWDADDUI
17、 R2,R2,#8 F s s ss D E MWDADDUI R3,R3,#8 F D EM WDSUBU R5,R4,R2 F Ds EM WBNEZ R5,Loop Fs Ds rL.D F0,0(R2) Fs s F D E M W流水线基础nAnswer a. (without forwarding)instructionClock cycle1 2 3 45 6 7 8 13 1415 1617 1819 2021 2223 2425 26 27LD F0,0(R2)F D E MWLD F4,0(R3) F D E M WMUL.D F0,F0,F4 F Ds s E E E M
18、WADD.D F2,F0,F2 Fs s D s s ss E E E E MWDADDUI R2,R2,#8 F s s ss D E MWDADDUI R3,R3,#8 F D EM WDSUBU R5,R4,R2 F Ds EM WBNEZ R5,Loop Fs Ds rL.D F0,0(R2) Fs s F D E M WLoop 1Loop 2 流水线基础nAnswer a. total loop execution time= 2299 = 2178 clock cycles流水线基础nAnswer b. (with normal forwarding) instructionCl
19、ock cycle1 2 3 45 6 7 .12 13 1415 1617 1819 2021 2223LD F0,0(R2)F D E MWLD F4,0(R3) F D E M WMUL.D F0,F0,F4 F Ds E E . E M WADD.D F2,F0,F2 Fs D s . s E EE E M W DADDUI R2,R2,#8 F s . s D EM WDADDUI R3,R3,#8 F DE MWDSUBU R5,R4,R2 FD sE MWBNEZ R5,LoopF sD rL.D F0,0(R2)F sF DE M W流水线基础nAnswer b. (with
20、normal forwarding) instructionClock cycle1 2 3 45 6 7 .12 13 1415 1617 1819 2021 2223LD F0,0(R2)F D E MWLD F4,0(R3) F D E M WMUL.D F0,F0,F4 F Ds E E . E M WWADD.D F2,F0,F2 Fs D s . s E EE E M W DADDUI R2,R2,#8 F s . s D EM WDADDUI R3,R3,#8 F DE MWDSUBU R5,R4,R2 FD sE MWBNEZ R5,LoopF sD rL.D F0,0(R2)
21、F sF DE M WLoop 1Loop 2 流水线基础nAnswer b. total loop execution time= 1898 + 19 = 1783 clock cycles流水线基础nAppendix A.3Suppose the branch frequencies (as percentage of all instructions) are as follows:nConditional branches15%nJumps & Calls1%nConditional branches60% are takenWe are examining a four-de
22、ep pipeline where the branch is resolved at the end of the second cycle for unconditional branches and at the end of the third cycle for conditional branches. Assuming that only the first pipe stage can always be done independent of whether the branch goes and ignoring other pipeline stalls, how muc
23、h faster would the machine be without any branch hazards?流水线基础nAnswernPipeline CPI = Ideal pipeline CPI + (Structural Hazard Stalls + Data Hazard Stalls + Control Hazard Stalls) Pipeline Depth Pipeline speedup=- 1 + Pipeline stalls nNo Control Hazard:Pipeline speedupideal = 4/(1+0) = 4流水线基础nHaving C
24、ontrol Hazard:Assume 4 stage: IF,ID,EX and WBnHandle Jump & Call:InstructionClock cycle123456Jump or CallIFIDEXWBi+1IFIFIDEXi+2stallIFIDi+3stallIF流水线基础nHandle taken conditional branch:InstructionClock cycle123456Taken BranchIFIDEXWBi+1IFstallIFIDi+2stallstallIFi+3stallstall流水线基础nHandle not-taken
25、 conditional branch:InstructionClock cycle123456Not-taken BranchIFIDEXWBi+1IFstallIDEXi+2stallIFIDi+3stallIF流水线基础nSummary of above 3 control flow instructions:Pipeline Stallreal = (11%) + (29%) +(16%) = 0.25Pipeline Speedupreal= 4/(1+0.25) = 3.2Control flow typeFrequency (per instruction)Stall (cycl
26、es)Jump & Call1%1Conditional (taken)15%60%=9%2Conditional (not taken)15%40%=6%1流水线基础nPipeline Speedupwithout control hazard= 4/3.2 = 1.25 25% speedup流水线基础nA.4A reduced hardware implementation of the classic five-stage RISC pipeline might use the EX stage hardware to perform a branch instruction
27、comparison and the actually deliver the branch target PC to the IF stage until the clock cycle in which the branch instruction reaches the MEM stage. 流水线基础nA.4Control hazard stalls can be reduced by resolving branch instructions in ID, but improving performance in one respect may reduce performance
28、in other circumstances. How does determining branch outcome in the ID stage have the potential to increase data hazard stall cycles?流水线基础nAnswerregister value computed , comparison performed in EX stage1234567cmptIFIDEXMEMWBbranchIFIDEXMEMWBtargetssIFIDEXregister value computed , comparison performe
29、d in ID stage1234567cmptIFIDEXMEMWBbranchIFID(s)IDEXMEMWBtargetsIFIDEXneed register value流水线基础nAnswerregister value loaded , comparison performed in EX stagen1234567cmptIFIDEXMEMWBbranchIFIDsEXMEMWBtargetsssIFIDregister value loaded , comparison performed in ID stage1234567cmptIFIDEXMEMWBbranchIFssI
30、DEXMEMtargetssIFID指令级并行n3.2 Consider the following four MIPS code fragments each containing two instructions:i.DADDI R1,R1,#4LD R2,7(R1)ii.DADD R3,R1,R2SD R2,7(R1)iii.SD R2,7(R1)SD F2,200(R7)iv.BEZR1,placeSDR1,7(R1)指令级并行na. For each fragment (i) to (iv) identify each type of dependence that exists o
31、r that may exist (a fragment may have no dependence) and describe what data flow, name reuse, or control structure causes or would cause the dependence. For a dependence that may exist, describe the source of the ambiguity and identify the time at which that uncertainty is resolved.nb. For each code
32、 fragment, discuss whether dynamic scheduling is, may be, or is not sufficient to allow out-of-order execution of the fragment.指令级并行nAnswer Code fragmentData Dependence? Dynamic scheduling sufficient for out-of-order execution?DADDI R1,R1,4LD R2,7(R1)True dependence of R1No. Changing instruction ord
33、er will break program semanticsDADD R3,R1,R2SD R2,7(R1)NoneYesSD R2,7(R1)SD F2,200(R7)Output dependence may existMaybe. If the hardware computes the effective addresses early enough, then the store order may be exchanged.BEZ R1,placeSD R1,7(R1)NoneNo. Changing instruction order is speculative until
34、the branch resolved指令级并行3.5List all the dependences (output, anti, and true) in the following code fragment. Indicate whether the true dependences are loop carried or not. Show why the loop is not parallel.n for (i=2;i 40%PB1PB2PB1PB2PB1PB2PB1PB2NTTTNTNTNTNTTTTTNTNTNTNTTCorrect prediction?-no-no-yes
35、-no-yes-no-yes-noPB1PB1PB1PB1NTTTNTNTTTNTCorrect prediction?-no-no-no-no-指令级并行nAnswer b.nPrediction Accuracy decreases: 100% - 0%PB1PB2PB1PB2PB1PB2PB1PB2NTTTNTNTTTNTNTTTNTNTTTNTCorrect prediction?-no-no-no-no-no-no-no-noPB1PB1PB1PB1NTTTTTTTTCorrect prediction?-no-yes-yes-yes-指令级并行n3.14Suppose we hav
36、e a deeply pipeline processor. For which we implement a branch-target buffer for the conditional branches only. Assume that the misprediction penalty is always 4 cycles and the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy and 15% branch frequency.Q. How much faster is
37、 the processor with the branch-target buffer versus a processor that has a fixed 2 cycle branch penalty? Assume a base CPI without branch stall of 1.指令级并行nBranch-Target Buffer (BTB):Address of branch index to get prediction AND branch address (if taken)指令级并行nAnswernCPIBTB - System with a branch-targ
38、et bufferCPINBTB - System without a branch-target buffer CPINBTB CPIbase + StallNBTB nSpeedup = - = - CPIBTB CPIbase + StallBTBCPIbase = 1 exercise statement指令级并行Stall = sStall FrequencysPenaltys StallNBTB = 15%2=0.3 StallBTB= 1.5%3 + 1.3%4 = 0.097BTB resultBTB predictionFrequency (per instruction)
39、Penalty (cycle)Miss-15%10%=1.5%3HitCorrect15%90%90%=12.1%0HitIncorrect15%90%10%=1.3%4Assume 90% hit rate and 90% accuracy and 15% branch frequency指令级并行 CPIbase + StallNBTB 1+0.3Speedup = - = - = 1.2 CPIbase + StallBTB 1+0.097 20% faster指令级并行n3.20 When an instruction is correctly speculated, what is
40、the effect on the three factors comprising the CPU time formula: dynamic instruction count, average clocks per instruction, and clock cycle time? When speculation is incorrect, it is possible for CPU time to increase. Which factor of the CPU time formula best model this increase and why?nAnalysis: t
41、he CPU time=IC*CPI*Clock cycle time指令级并行nAnswer When speculation is correct, it allows an instruction that should execute earlier by reducing or elimination stalls that would occur if execution were delayed until the instruction was no longer speculative. Early execution of a required instruction ha
42、s no effect on instruction count or clock cycle. The reduction in stall cycles improves CPI.指令级并行nAnswer When speculation is incorrect, instructions that are not on the path of execution are executed and their results ignored. There is no effect on clock cycle time, but the dynamic instruction count
43、 increases. The mix of instructions executed may change and lead to a minor effect on CPI, but the majority of the increase in CPU time will be due to the cycles spent on incorrectly speculated instructions, which is best modeled as an increase in IC.存储系统n4.6In systems with a write-through L1 cache
44、backed by a write-back L2 cache instead of main memory, a merging write buffer can be simplified. Q. Explain how this can be done. 存储系统nMerging Write BuffernIf the write buffer is empty, the data and the full address are written in the buffernWrite merging: If the write buffer contains other modifie
45、d blocks, the addresses can be checked to see if the address of this new data matches the address of a valid write buffer entry. if so, the new data are combined with that entry.nIf the buffer is full and there is no address match, the cache (and CPU) must wait until the buffer has an empty entry.存储
46、系统nAnswerThe merging write buffer links the CPU to the write-back L2 cache. Two CPU writes cannot merge if they are to different sets in L2. So, each new entry into the buffer a quick check on only those address bits that determine the L2 set number need be performed at first. If there is no match i
47、n this “screening” test, then the new entry is not merged. If there is a set number match, then all address bits can be checked for a definitive result.存储系统n4.8 一个1616的矩阵,要示在一个存储器周期内实现按行,按列,按对角线和按反对角线的无冲突访问。至少需要多少个存储体?写出矩阵的各元素在各个存储体中存放的位置。nAnswer:对NN数组,同列相邻元素地址距离为1,同行相邻元素地址距离为2.则m取成22p+1.实现无冲突访问的充分条
48、件是使1=2p, 2=1 本题中m=17,p=2, 1=4, 2=1存储系统1234567891011121314151617000102030405060708090100110120130140151011.2021.存储系统n4.19 假设在一个采用组相联映象方式的Cache中,主存由B0-B7共8块组成,Cache有2组,每组2块,每块的大小为16个字节,采用LRU块替换算法。n在一个程序执行过程中依次访问这个Cache的块地址流如下:B6,B2,B4,B1,B4,B6,B3,B0,B4,B5,B7,B3n假设主存与Cache之间的各个块的映象的对应关系如下:C0 C1 C2 C3B0
49、B1B2B3B4B5B6B7区0区11. 写出主存地址的格式,并标出各字段的长度2. 写出Cache的地址格式,并标出各字段的长度3. 如果Cache的各个块号为C0,C1,C2,C3,列出程序执行过程中Cache的块地址流情况4. 如果采用LRU替换算法,计算Cache的块命中率5. 如果采用FIFO替换算法,计算Cache的块命中率存储系统区号1组号1块号1块内地址(4位)组号1块号1块内地址(4位)2.写出Cache的地址格式,并标出各字段的长度1. 写出主存地址的格式,并标出各字段的长度存储系统存储系统3. 如果Cache的各个块号为C0,C1,C2,C3,列出程序执行过程中Cache
50、的块地址流情况B6B2B4B1B4B6B3B0B4B5B7B3C0B4B4*B4B4*B4*B4*B4B4*B4*B4*C1B1B1*B1*B1*B0B0*B5B5*B5*C2B6B6*B6*B6*B6*B6B6*B6*B6*B6*B7B7*C3B2B2*B2*B2*B2*B3B3*B3*B3*B3*B3B6,B2, B4, B1,B4, B6, B3, B0,B4,B5, B7,B3C2C3 C0 C1C0 C2 C3 C1C0C1 C2C3n4. 如果采用LRU替换算法,计算Cache的块命中率B6B2B4B1B4B6B3B0B4B5B7B3C0B4B4*B4B4*B4*B4*B4B4*B
51、4*B4*C1B1B1*B1*B1*B0B0*B5B5*B5*C2B6B6*B6*B6*B6*B6B6*B6*B6*B6*B7B7*C3B2B2*B2*B2*B2*B3B3*B3*B3*B3*B3命中率为:命中率为:4/12=1/3=33.3%存储系统存储系统B6B2B4B1B4B6B3B0B4B5B7B3C0B4B4*B4B4*B4*B0B0*B5B5*B5*C1B1B1*B1*B1*B1*B4B4*B4*B4*C2B6B6*B6*B6*B6*B6B3B3*B3*B3*B3*B3C3B2B2*B2*B2*B2*B2*B2*B2*B2*B7B7*命中率为:命中率为:3/12=1/4=25%5.
52、如果采用FIFO替换算法,计算Cache的块命中率存储系统n4.20Some memory systems handle TLB missed in software (as an exception), while others use hardware for TLB (Translation Lookaside Buffer) misses.存储系统Qa. What are the trade-off between two methods for handling TLB misses?Qb. Will TLB miss handling in software always be s
53、lower than TLB miss handling in hardware? Explain.Qc. Are there page table structures that would be difficult to handle in hardware, but possible in software? Are there any such structure that would be difficult for software to handle but easy for hardware to manage?存储系统Qd. Use the data from Figure
54、5.45 to calculate the penalty to CPI for TLB misses on the following workload assuming hardware TLB handlers require 10 cycles per miss and software TLB handlers takes 30 cycles per miss: (50% gcc, 25% perl, 25%ijpeg), (30% swim, 30% wave5, 20% hydro2d, 10% gcc). Qe. Are the TLB miss times in part(d
55、) realistic? Discuss.Qf. Why are TLB miss rate for floating-point program generally higher than those for integer program?存储系统nAnswer a. Software is slower because of the overhead of a context switch to the handler code, but the replacement algorithm can be higher than hardware and a wider variety of virtual memory organizations can be readily accommodated.Hardware - faster but less flexiblenAnswer b. Factors affecting on the ha
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论