超标量流水线(与“指令”有关的文档共89张)

上传人：可*** IP属地：江西上传时间：2022-11-27 格式：PPTX 页数：89 大小：35MB 积分：9.6 举报 版权申诉

已阅读5页，还剩84页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

超标量流水线第一页，共89页。PipeliningtoSuperscalarForecastLimitsofpipeliningThecaseforsuperscalarInstruction-levelparallelmachinesSuperscalarpipelineorganizationSuperscalarpipelinedesign第二页，共89页。LimitsofPipeliningIBMRISCExperience（P91，TilakAgerwalaandJohnCocke，1987）（原理性问题）Controlanddatadependencesadd15%Deeperpipelines(higherfrequency)magnifydependencepenaltiesThisanalysisassumes100%cachehitrates（存储问题）Hitratesapproach100%forsomeprogramsManyimportantprogramshavemuchworsehitratesLater!第三页，共89页。ProcessorPerformance（P17）Inthe1980’s(decadeofpipelining):Inthe1990’s(decadeofsuperscalar):CPI:1.15=>0.5(bestcase)ProcessorPerformance=---------------TimeProgramInstructionsCyclesProgramInstructionTimeCycle(codesize)=XX(CPI)(cycletime)第四页，共89页。Amdahl’sLaw（P18）h=fractionoftimeinserialcodef=fractionthatisvectorizablev=speedupforfOverallspeedup:No.ofProcessorsNTime1h1-h1-ff第五页，共89页。RevisitAmdahl’sLawSequentialbottleneckEvenifvisinfinitePerformancelimitedbynonvectorizableportion(1-f)No.ofProcessorsNTime1h1-h1-ff第六页，共89页。PipelinedPerformanceModel（HaroldStone，1987，P19）g=fractionoftimepipelineisfilled1-g=fractionoftimepipelineisnotfilled(stalled)三个阶段:第一：N条指令进入流水线第二：流水线充满阶段，假定没有流水线干扰引起的停顿，此时是流水线最优的性能第三：流水线排空阶段，没有新指令进入流水线，当前正在流水线中的指令完成执行第七页，共89页。PipelinedPerformanceModelTyrannyofAmdahl’sLaw[BobColwell]Whengisevenslightlybelow100%,abigperformancehitwillresultStalledcyclesarethekeyadversaryandmustbeminimizedasmuchaspossible1-ggPipelineDepthN1第八页，共89页。MotivationforSuperscalar

[AgerwalaandCocke](P23)TypicalRangeSpeedupjumpsfrom3to4.3forN=6,f=0.8,buts=2insteadofs=1(scalar)第九页，共89页。SuperscalarProposalModeratetyrannyofAmdahl’sLawEasesequentialbottleneckMoregenerallyapplicableRobust(lesssensitivetof)RevisedAmdahl’sLaw:第十页，共89页。LimitsonInstructionLevelParallelism(ILP)WeissandSmith[1984]1.58SohiandVajapeyam[1987]1.81TjadenandFlynn[1970]1.86(Flynn’sbottleneck)TjadenandFlynn[1973]1.96Uht[1986]2.00Smithetal.[1989]2.00JouppiandWall[1988]2.40Johnson[1991]2.50Acostaetal.[1986]2.79Wedig[1982]3.00Butleretal.[1991]5.8MelvinandPatt[1991]6Wall[1991]7(Jouppidisagreed)Kucketal.[1972]8RisemanandFoster[1972]51(nocontroldependences)NicolauandFisher[1984]90(Fisher’soptimism)Variancedue:benchmarks,machinemodels,cachelatency&hitrate,compilers,religiousbias,gen.Purposevs.specialpurpose/scientific,Cvs.FortranNotmonotonicwithtime第十一页，共89页。SuperscalarProposalGobeyondsingleinstructionpipeline,achieveIPC>1DispatchmultipleinstructionspercycleProvidemoregenerallyapplicableformofconcurrency(notjustvectors)GearedforsequentialcodethatishardtoparallelizeotherwiseExploitfine-grainedorinstruction-levelparallelism(ILP)Not100or1000degreeofparallelism,but2-3-4Fine-grainedvs.medium-grained(loopiterations)vs.coarse-grained(threads)第十二页，共89页。ClassifyingILPMachines[Jouppi,DECWRL1991]BaselinescalarRISCIssueparallelism=IP=1Operationlatency=OP=1PeakIPC=1IP=maxinstructions/cycleOplatency=#cyclestillresultavailableIssuinglatency第十三页，共89页。ClassifyingILPMachines[Jouppi,DECWRL1991]Superpipelined:cycletime=1/mofbaselineIssueparallelism=IP=1inst/minorcycleOperationlatency=OP=mminorcyclesPeakIPC=minstr/majorcycle(mxspeedup?)第十四页，共89页。ClassifyingILPMachines[Jouppi,DECWRL1991]Superscalar:Issueparallelism=IP=ninst/cycleOperationlatency=OP=1cyclePeakIPC=ninstr/cycle(nxspeedup?)第十五页，共89页。ClassifyingILPMachines[Jouppi,DECWRL1991]VLIW:VeryLongInstructionWordIssueparallelism=IP=ninst/cycleOperationlatency=OP=1cyclePeakIPC=ninstr/cycle=1VLIW/cycleCharacteristics:-parallelismpackagedbycompiler,hazardsmanagedbycompiler,no(orfew)hardwareinterlocks,lowcodedensity(NOPs),clean/regularhardware第十六页，共89页。ClassifyingILPMachines[Jouppi,DECWRL1991]Superpipelined-SuperscalarIssueparallelism=IP=ninst/minorcycleOperationlatency=OP=mminorcyclesPeakIPC=nxminstr/majorcycle第十七页，共89页。Superscalarvs.SuperpipelinedRoughlyequivalentperformanceIfn=mthenbothhaveaboutthesameIPCParallelismexposedinspacevs.time第十八页，共89页。Superpipelining:ResultLatency第十九页，共89页。SuperscalarChallenges第二十页，共89页。LimitationsofScalarPipelinesScalarupperboundonthroughputIPC<=1orCPI>=1InefficientunifiedpipelineLonglatencyforeachinstructionRigidpipelinestallpolicyOnestalledinstructionstallsallnewerinstructions第二十一页，共89页。ParallelPipelinesTemporalvs.Spatialvs.Both第二十二页，共89页。IntelPentiumParallelPipeline486pipelineontheleft:2decodestagesduetocomplexISAPentiumparallepipeline:Upipeisuniversal(canhandleanyop),Vpipecan’thandlethemostcomplexopsStages:Fetchandalign,decode&generatecontrolword,decodecontrolword&genmemaddr,ALUorD$Usedbranchprediction第二十三页，共89页。DiversifiedPipelinesUnifiedpipelinesareinefficientandunnecessary.Inascalarorganizationtheymakesense.Withmultipleissue,specializedpipelinesmakemuchmoresense.NotethatallinstructionsaretreatedidenticallyinIF,ID,(alsoRD,moreorless),andWB.Why?Becausetheybehaveverymuchthesameway.第二十四页，共89页。Power4DiversifiedPipelinesPCI-CacheBRScanBRPredictFetchQDecodeReorderBufferBR/CRIssueQCRUnitBRUnitFX/LD1IssueQFX1UnitLD1UnitFX/LD2IssueQLD2UnitFX2UnitFPIssueQFP1UnitFP2UnitStQD-Cache第二十五页，共89页。RigidPipelineStallPolicyBypassingofStalledInstructionStalledInstructionBackwardPropagationofStallingNotAllowed第二十六页，共89页。DynamicPipelinesIn-orderfrontend,dynamicexecutioninamicro-dataflow-machine,in-orderbackendInterlockhardware(later)maintainsdependencesReorderbuffertrackscompletion,exceptions,providespreciseinterrupts:drainpipeline,restartInordermachinestatefollowsthesequentialexecutionmodelinheritedfromnonpipelined/pipelinedmachines第二十七页，共89页。InterstageBuffersScalarpipe:justpipelinelatchesorflip-flopsIn-ordersuperscalarpipe:justwideronesOut-of-order:starttolookmorelikeregisterfiles,withrandomaccessnecessary,orshiftregisters.Mayrequireeffectivecrossbarbetweenslotsbefore/afterbufferMayneedtobeamultiportedCAM第二十八页，共89页。SuperscalarPipelineStagesInProgramOrderInProgramOrderOutofOrder第二十九页，共89页。LimitationsofScalarPipelinesScalarupperboundonthroughputIPC<=1orCPI>=1Solution:wide(superscalar)pipelineInefficientunifiedpipelineLonglatencyforeachinstructionSolution:diversified,specializedpipelinesRigidpipelinestallpolicyOnestalledinstructionstallsallnewerinstructionsSolution:Out-of-orderexecution,distributedexecutionpipelines第三十页，共89页。几种典型的超标量处理器90年代初，超标量处理器开始用双流出处理器。在同一时钟周期内提供多条指令的取指、译码、流出、执行、写回操作。第一个成功的商用超标量微处理器，Inteli960RISC处理器，在1990年投入市场。第一代双流出超标量RISC处理器有Motorola88110，Alpha21064、HPPA-7100和Pentium。第三十一页，共89页。几种典型的超标量处理器90年代中期有：IBMPOWER2RISCSystem/6000处理器，PowerPC601、603、604、750(G4)、620、IBMPOWERDECAlpha21164、Alpha21264SunUltraSPARC、UltraSPARC-II、IIi、IIIHPPA-8000，PA-8500MIPSR10000。MIPSR120004流出和6流出第三十二页，共89页。几种典型的超标量处理器超标量微处理器占主导地位的Intel，生产Intelx86ISA系列产品：1993年的双流出Pentium处理器PentiumPro、PentiumII，它的新一代Celeron、PentiumIII、Pentium4Intel微处理器由于其ISA特性而被认为是CISC微处理器有些公司还设计了与Intel兼容的处理器如AMD的K5、K6、K6-2和K6-3，Cyrix的6x86、MII和MXICISC微处理器有附加的流水段，从x86指令集产生一种叫做RISC86操作或微操作，因此它们就有比超标量RISC处理器更复杂的流水线。第三十三页，共89页。I-cacheD-cacheBusInter-faceUnit

BranchUnitInstructionFetchUnitReorderBufferInstructionIssueUnitRetireUnitLoad/StoreUnit

IntegerUnit(s)Floating-PointUnit(s)RenameRegistersGeneralPurposeRegistersFloating-PointRegistersBTACBHTMMUMMU32(64)DataBus32(64)AddressBusControlBusInstructionBufferInstructionDecodeandRegisterRenameUnitComponentsofaSuperscalarProcessor第三十四页，共89页。ComponentsofaSuperscalarProcessor超标量RISC微处理器的体系结构通常具有32位定长指令的Load/Store体系结构。处理器包含以下单元：取指单元（含分支单元）译码单元寄存器重命名单元流出单元几个独立的执行功能部件(FUs)第三十五页，共89页。ComponentsofaSuperscalarProcessor指令退出单元32个通用寄存器，32个浮点寄存器，附加的重命名物理寄存器总线接口和外部存储器总线与二级cache相连指令cache数据cache附加的内部缓冲器（如指令缓冲器和重排序缓冲器）第三十六页，共89页。功能部件装载/存储单元浮点单元整数单元多媒体单元分支单元功能部件的类型和数量取决于特定的处理器。第三十七页，共89页。SuperscalarPipelineDesignInstructionFetchingIssuesInstructionDecodingIssuesInstructionDispatchingIssuesInstructionExecutionIssuesInstructionCompletion&RetiringIssues第三十八页，共89页。InstructionFlowChallenges:Branches:controldependencesBranchtargetmisalignmentInstructioncachemissesSolutionsCodealignment(staticvs.dynamic)Prediction/speculationInstructionMemoryPC3instructionsfetchedObjective:FetchmultipleinstructionspercycleDon’tstarvethepipeline:n/cycleMustfetchn/cyclefromIF第三十九页，共89页。I-CacheAccessandInstructionFetchHarvardarchitecture:separateinstructionanddatamemoryandaccesspathsTheI-cacheislesscomplicatedtocontrolthantheD-cache,becauseitisread-onlyitisnotsubjectedtocachecoherenceincontrasttotheD-cacheMESI协议只有shareandInvalid两位SometimestheinstructionsintheI-cachearepredecodedontheirwayfromthememoryinterfacetotheI-cachetosimplifythedecodestage(PowerPC620)第四十页，共89页。InstructionFetch(1)指令获取部件的主要问题是处理诸jump、branch,call,return,andinterrupt指令顺序取指的过程将被中断此中断过程可发生在某个取指Block的中间或者刚刚结束的时刻，该中断点的后续指令都需要作废WallaceandBagherzadeh证明：在一个8流出的超标量结构中，简单的硬件取指每拍取到的有效指令不超过4条（SPECint95）如果PC指针指向的起始地址不是一个Cacheline的地址，则只需要将小于取指宽度的必要指令返回给译码部件。如果取指包包含分支指令，则分支指令后的指令自动无效第四十一页，共89页。InstructionFetch(2)Amultiplecachelinesfetchfromdifferentlocationsmaybeneededinfuture(取多宽？)verywide-issueprocessorswhereoftenmorethanonebranchwillbecontainedinasinglecontiguousfetchblockEagerexcutionofbothsidesofbranchMultithreadedprocessors第四十二页，共89页。InstructionFetch(3)另一个问题：目标指令的地址可能与Cacheline的地址不对齐（取哪里？）通过Self-aligned指令cache实现硬件解决方案一个周期内连续读相邻的两个Cache行确保取指带宽能够被满足Implementation:eitherbyuseofadual-portI-cache,byperformingtwoseparatecacheaccessesinasinglecycleorbyatwo-bankedI-cache(preferred).第四十三页，共89页。PrefetchingandInstructionFetchPredictionPrefetchingimprovestheinstructionfetchperformance,butfetchingisstilllimitedbecauseinstructionsafteracontroltransfermustbeinvalidatedInstructionfetchpredictionhelpstodeterminethenextinstructionstobefetchedfromthememorysubsystemInstructionfetchpredictionisappliedinconjunctionwithbranchprediction.新的基于预测的指令Cache替换算法？？？？指令Cache访问主存地址流的分析？？？？第四十四页，共89页。I-CacheOrganizationRow

Decoder•••CacheLine•••TAGTAGAddress1cacheline=1physicalrow•••CacheLine•••TAGTAGAddress1cacheline=2physicalrowsTAGTAGRow

Decoder

阻碍每拍获得最大指令数的两个因素

FetchAlignmentThepresenceofcontrol-flowchanginginstructionsinthefetchgroup第四十五页，共89页。FetchAlignmentFetchsizen=4:losingfetchbandwidthifnotaligned第四十六页，共89页。SolutionforFetchMisalignmentProblemStatic/compiler:alignbranchtargetsat00(mayneedtopadwithNOPs)implementationspecificUsinghardwareatruntime第四十七页，共89页。RIOS-IFetchHardware1989designusedinthefirstIBMRS/6000(POWERorRIOS-I):第四十八页，共89页。RIOS-IFetchHardware（1）1989designusedinthefirstIBMRS/6000(POWERorRIOS-I):4-widemachinewithInt,FP,BR,CR(typically2orfewerissue)2-wayset-assoc,linesize64Bspans4physicalrows,eachinstructionwordinterleavedSayfetchiisB10,i+1isB11,i+2isB12,i+3isB13.T-logicdetectsmisalignmentandchoosesappropriateindex第四十九页，共89页。RIOS-IFetchHardware（2）I-buffernetworkrotatesinstructionssotheyleaveinprogramorder“Interleavedsequential”improvesbyinterleavingtagarray;allowscombiningofopsfromtwocachelines.Ifbothhit,canget4everycycle.第五十页，共89页。IssuesinDecodingPrimaryTasksIdentifyindividualinstructions(!)DetermineinstructiontypesDeterminedependencesbetweeninstructionsTwoimportantfactorsInstructionsetarchitecturePipelinewidthRISCvs.CISCRISC:fixedlength,regularformat,easierCISC:canbemultiplestages(lotsofwork),P6:I$=>decodeis5cycles,oftentranslatesintointernalRISC-likeuopsorROPs第五十一页，共89页。DecodeStageSuperscalarprocessor:有序组织的前端(In-Order

Issue

Front-end)单元，乱序内核(Out-of-Order

Core)单元和有序的退出(In-Order

Retirement)单元Instructiondelivery:流水线的取指段和译码段比执行段具有较高的带宽。Deliverytask:保持指令窗的始终处于充满状态预取指令越深，则允许更多的指令发射给各功能单元。becauseofmispredictedbranchpaths通常情况下，指令预取宽度与指令译码宽度相等第五十二页，共89页。Decodingvariable-lengthinstructions固定指令长度的微处理器一般支持多指令预取和译码Variableinstructionlength:CISCinstructionsetsastheIntelX86ISA.amultistagedecodeisnecessary.第一栈定界：处理判断指令流里面的指令边界。并将确定长度的指令发送给第二栈。第二栈译码微操作：对每条指令进行译码，生成一条或者多条微操作AMDK系列：复杂CISC指令集结构ComplexCISCinstructionsaresplitintomicro-opswhichresembleordinaryRISCinstructions.微操作可以是数条简单指令，或者一个简单指令构成的指令流。CISC指令集相比与RISC指令集：优点：有更高的指令密度缺点：指令译码结构更加复杂第五十三页，共89页。PentiumProFetch/Decode16B/cycledeliveredfromI$intoFIFOinstructionbufferDecoder0isfullygeneral,1&2canhandleonlysimpleuops.Entercentralizedwindow(reservationstation);waithereuntiloperandsready,structuralhazardsresolved.Whyisthisbad?Branchpenalty;needagoodbranchpredictor.Otheroption:predecodebits第五十四页，共89页。Pre-decoding如果指令操作码允许，取指段就可以分析部分操作，并利用它进行预测。Pre-decode:transferredfrommemorytotheI-cache.thedecodestageismoresimple.MIPSR10000:对32位指令进行预译码，形成36位格式存储在指令CACHE中。4位扩展位指示将使用哪一个功能单元执行该条指令。对每条指令的操作数选择域和目的寄存器选择域进行重排，使之存储在同样的位置，修改操作码以简化整数或者浮点目的寄存器译码。译码器对这类扩展后的指令译码速度远远高于对原来的指令格式第五十五页，共89页。PredecodingintheAMDK5K5:notoriouslylateandslow,butstillinteresting(AMD’sfirstnon-clonex86processor)~50%largerI$,predecodebitsgeneratedasinstructionsfetchedfrommemoryonacachemiss:Powerfulprincipleinarchitecture:memoization!Predecoderecordsstartandendofx86ops,#ofROPs,locationofopcodes&prefixesUpto4ROPspercycle.AlsousefulinRISCs:PPC620used7bits/instPA8000,MIPSR10000used4/5bits/instTheseusedtoIDbranchesearly,reducebranchpenalty第五十六页，共89页。InstructionDispatchandIssueParallelpipelineCentralizedinstructionfetchCentralizedinstructiondecodeDiversifiedpipelineDistributedinstructionexecution第五十七页，共89页。IssueandDispatchTheinstructionwindow：译码段和执行段之间所有的等待站组成.流水线中，指令窗将执行段和译码段隔离开来，但并不是流水线的附加阶段。Instructionissue:微处理器中的功能部件的指令执行初始化过程。issuetoaFUorareservationstationdispatch,ifasecondissuestageexiststodenotewhenaninstructionisstartedtoexecuteinthefunctionalunit.指令流出策略就是用于流出指令的约定微处理器的“向前看”的能力，就是检查当前执行点以外希望找到不相关指令去执行，允许后续不相关指令发往执行第五十八页，共89页。NecessityofInstructionDispatchMusthavecomplexinterstagebufferstoholdinstructionstoavoidrigidpipeline第五十九页，共89页。InstructionWindowOrganizations（3-1）acentralinstructionwindow对应于单段流出流向所有功能单元的所有指令置于一个共同的指令窗口缓冲器。缺点：从一个大的中央指令窗流出指令限制了微处理器主频的提高。更新操作的能力，相关资源（功能单元选择，重排缓冲选择）检测的能力指令窗越大，更新和选择的复杂度增加的越快。第六十页，共89页。InstructionWindowOrganizations（3-2）解决方案：multi-stageissue:Operandavailabilityandresourceavailabilitycheckingissplitintotwoseparatestages.资源相关流出先进入保留站（对应于每一个功能单元或者每一组功能单元）。当操作数准备就绪，允许执行时进入第二站，可以派发给各功能单元。decouplingofinstructionwindows:提供一组指令窗或者保留站Eachinstructionwindowissharedbyagroupof(usuallyrelated)functionalunits.mostcommon:separatefloating-pointwindowandintegerwindow第六十一页，共89页。InstructionWindowOrganizations(3-3)combinationofmulti-stageissueanddecouplingofinstructionwindows从指令窗流出的指令可以是顺序流出也可以是乱序流出Inatwo-stageissuescheme，withresourcedependentissueprecedingthedata-dependentdispatchthefirststageisdonein-orderthesecondstageisperformedout-of-order.第六十二页，共89页。FunctionalUnitsIssueandDispatchDecodeandRenameThecommonissueschemesSingle-level,centralissuesingle-levelissueoutofacentralwindowasinPentiumIIprocessor第六十三页，共89页。DecodeandRenameFunctionalUnitsIssueandDispatchFunctionalUnitsSingle-level,two-windowissueSingle-level,two-windowissuesingle-levelissuewithainstructionwindowdecouplingusingtwoseparatewindowsmostcommon:separatefloatingpointandintegerwindowsasinHP8000processor第六十四页，共89页。DecodeandRenameDispatchIssueFunctionalUnitFunctionalUnitFunctionalUnitFunctionalUnitReservationStationsTwo-levelissuewithmultiplewindowsTwo-levelissuewithmultiplewindowswithacentralizedwindowinthefirststageandseparatewindowsinthesecondstage(PowerPC604and620processors).第六十五页，共89页。CentralizedReservationStationDispatch:basedontype;Issue:wheninstructionentersfunctionalunittoexecute(samethinghere)Centralized:efficient,sharedresource;hasscalingproblems(later)第六十六页，共89页。DistributedReservationStationDistributed,withlocalizedcontrol(easywin:breakupbasedondatatype,I.e.FPeger)Lessefficientutilization,buteachunitissmallersincecanbesingle-portedMusttuneforproperutilization,Mustmake1000littledecisions(juggle100pingpongballs)第六十七页，共89页。IssuesinInstructionExecutionCurrenttrendsMoreparallelismbypassingverychallengingDeeperpipelinesMorediversityFunctionalunittypesIntegerFloatingpointLoad/storemostdifficulttomakeparallelBranchSpecializedunits(media)第六十八页，共89页。BypassNetworksO(n2)interconnectfrom/toFUinputsandoutputsAssociativetag-matchtofindoperandsSolutions(hurtIPC,helpcycletime)UseRFonly(IBMPower4)withnobypassnetworkDecomposeintoclusters(Alpha21264)PCI-CacheBRScanBRPredictFetchQDecodeReorderBufferBR/CRIssueQCRUnitBRUnitFX/LD1IssueQFX1UnitLD1UnitFX/LD2IssueQLD2UnitFX2UnitFPIssueQFP1UnitFP2UnitStQD-Cache第六十九页，共89页。SpecializedunitsIntelPentium4staggeredaddersFireballRunat2xclockfrequencyTwo16-bitbitslicesDependentopsexecuteonhalf-cycleboundariesFullresultnotavailableuntilfullcyclelater第七十页，共89页。SpecializedunitsFPmultiply-accumulateR=(AxB)+CDoublesFLOP/instructionLoseRISCinstructionformatsymmetry:3sourceoperandsWidelyusedIBMPOWER/PowerPCFMAorMAF:3sourceoperands(lossofregularityinISA)MIPSR8000alsohadthisMIPSR10000(OOO)gaveuponit,decodecracksFMAintoMandA第七十一页，共89页。MediaDataTypesSubwordparallelvectorextensionsMediadata(pixels,quantizeddatum)often1-2bytesSeveraloperandspackedinsingle32/64bregister{a,b,c,d}and{e,f,g,h}storedintwo32bregistersVectorinstructionsoperateon4/8operandsinparallelNewinstructions,e.g.motionestimationme=|a–e|+|b–f|+|c–g|+|d–h|SubstantialthroughputimprovementUsuallyrequireshand-codingofcriticalloopsefghabcd第七十二页，共89页。MediaProcessorsandMultimediaUnits使用了基于单指令多数据的字内并行机制(dataparallelinstructions,SIMD)单周期内处理多组小数据，并获得多个结果。多媒体单元采用：SIMD指令SaturationarithmeticAdditionalarithmeticinstructions,e.g.maskingandselectioninstructions,reorderingandconversionx1x2x3x4y1y2y3y4x1*y1x2*y2x3*y3x4*y4R1:R2:R3:****4个16位数计算第七十三页，共89页。IssuesinCompletion/RetirementOut-of-orderexecutionALUinstructionsLoad/storeinstructionsIn-ordercompletion/retirementPreciseexceptionsMemorycoherenceandconsistencySolutionsReorderbufferStorebufferLoadqueuesnooping(later)第七十四页，共89页。ADynamicSuperscalarProcessor第七十五页，共89页。SuperscalarOverviewInstructionflowBranches,jumps,calls:predicttarget,directionFetchalignmentInstructioncachemissesRegisterdataflowRegisterrenaming:RAW/WAR/WAWMemorydataflowIn-orderstores:WAR/WAWStorequeue:RAWDatacachemisses第七十六页，共89页。SuperscalarVsVLIW技术特征Superscalarmachinesaredistinguishedbytheirabilityto(dynamically)issuemultipleinstructionseachclockcyclefromaconventionallinearinstructionstream.

VLIWprocessorsusealonginstructionwordthatcontainsausuallyfixednumberofinstructionsthatarefetched,decoded,issued,andexecutedsynchronously.第七十七页，共89页。SuperscalarVsVLIWVLIW技术特征InstructionsareissuedfromasequentialstreamofnormalinstructionsVLIW（指令组或指令包）whereasequentialstreamofinstructiontuplesisused.TheinstructionsthatareissuedarescheduleddynamicallybythehardwareVLIWprocessorswhichrelyonastaticschedulingbythecompiler.第七十八页，共89页。SuperscalarVsVLIW同时执行指令数目SuperscalarMorethanoneinstructioncanbeissuedeachcycle(motivatingthetermsuperscalarinsteadofscalar).Thenumberofissuedinstructionsisdetermineddynamically

byhardware,thatis,theactualnumberofinstructionsissuedinasinglecyclecanbezerouptoamaximuminstructionissuebandwidth。VLIWVLIWwherethenumberofscheduledinstructionsisfixedduetopaddinginstructionswithno-opsincasethefullissuebandwidthwouldnotbemet.第七十九页，共89页。SuperscalarVsVLIW指令调度Dynamicissueofsuperscalarprocessorscanallowissueofinstructionseitherin-order,oritcanallowalsoanissueofinstructionsoutofprogramorder.Onlyin-orderissueispossiblewithVLIWprocessors.ThedynamicinstructionissuecomplicatesthehardwareschedulerofasuperscalarprocessorifcomparedwithaVLIW.第八十页，共89页。SuperscalarVsVLIW指令调度技术Theschedulercomplexityincreaseswhenmultipleinstructionsareissuedout-of-orderfromalargeinstructionwindow.ItisapresumptionofsuperscalarthatmultipleFUsareavailable.ThenumberofavailableFUsisatleastthemaximumissuebandw

人人文库> 全部分类> 专业文献 > 工业制造

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

超标量流水线(与“指令”有关的文档共89张)

文档简介

温馨提示

最新文档

评论

超标量流水线(与“指令”有关的文档共89张)

文档简介

温馨提示

最新文档

评论

相关文档