21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf_第1页
21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf_第2页
21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf_第3页
21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf_第4页
21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf_第5页
已阅读5页,还剩3页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

INSTRUCTIONISSUELOGICFORHIGH-PERFORMANCE,INTERRUPTABLEP1PELINEDPROCESSORSGurindarS.SohiandSriramVajapeyamComputerSciencesDepartmentUniversityofWisconsin.Madison1210WestDaytonStreetMadison,Wisconsin53706AbstractTheperformanceofpipelinedprocessorsisseverelylimitedbydatadependencies.Inordertoachievehighperformance,amechanismtoalleviatetheeffectsofdatadependenciesmustexist.IfapipelinedCPUwithmultiplefunc-tionalunitsistobeusedinthepresenceofavir-tualmemoryhierarchy,amechanismmustalsoexistfordeterminingthestateofthemachineprecisely.Inthispaper,wecombinetheissuesofdependency-resolutionandprecisenessofstate.Wepresentadesignforinstructionissuelogicthatresolvesdependenciesdynamicallyand,atthesametime,guaranteesaprecisestateofthemachine,withoutasignificanthardwareoverhead.Detailedsimulationstudiesfortheproposedmechanism,usingtheLawrenceLivermoreloopsasabenchmark,arepresented.1.INTRODUCTIONAsthedemandforprocessingpowerincreases,computersystemdesignersareforcedtousetechniquesthatresultinhigh-performanceprocessingunits.Awidelyusedtechniqueispipelining1,inwhichtheoveralllogicofthesystemissplitintoseveralstageswitheachstageperformingasub-taskofacompletetask.Considerableoverlapcanbeachievedbecauseeachstagecanperformasub-taskforadifferenttask.PipelinedCPUshavetwomajorimpedimentstotheirperfor-mance:i)datadependenciesandii)branchinstructions.Aninstructioncannotbeginexecutionuntilitsoperandsareavail-able.Ifanoperandistheresultofapreviousinstruction,theinstructionmustwaittillthepreviousinstructionhascom-pletedexecution,therebydegradingperformance.Theperfor-mancedegradationduetobranchinstructionsisevenmoresevere.Notonlymustaconditionalbranchinstructionwaltforitsconditiontobeknown(resultinginbubblesinthepipe-line),anadditionalpenaltyisincurredinfetchinganinstruc-tionfromthetakenbranchpathtotheinstructiondecodeandissuestage.Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.Amajorproblemthatarisesinpipelinedcomputerdesignisthataninterruptcanbeimprecise2,3.Thisproblemisespeciallysevereinmultiplefunctionalunitcomputersinwhichinstructionscancompleteexecutionoutofprogramorder2,4.Forahigh-performance,pipelinedCPU,anade-quatesolutionmustbefoundfortheimpreciseinterruptprob-lemandmeansmustbeprovidedforovercomingtheperformance-degradingfactors.1.1.BackgroundandPreviousWorkThedetrimentaleffectsofbranchinstructionscanbealleviatedbyusingdelayedbranchinstructions.However,theutilityofdelayedbranchinstructionsislimitedforlongpipe-lines.Insuchcases,othermeansmustexisttoalleviatethedetrimentaleffects.Acommonapproachistousebranchprediction5,6.Usingpredictiontechniques,theprobableexecutionpathofabranchinstructionisdetermined.Instruc-tionsfromthepredictedpathcanthenbefetchedintoinstruc-tionbuffersorevenexecutedinaconditionalmodel2,7,8.Whiletheconditionalmodeofexecutionwillresultinahigherpipelinethroughput,especiallyiftheoutcomeofthebranchesispredictedcorrectly,ahardwaremechanismmustexistwhichwillallowthemachinetorecoverfromanincorrectsequenceofconditionalinstructions.Bothhardwareandsoftwaresolutionsexisttothedatadependencyproblem.Softwaresolutionsusecodeschedulingtechniques(combinedwithalargesetofregisters)toincreasethedependencydistanceandtoprovideinterlocks9.Hardwaresolutionsemploywaitingstationsorreservationsta-tionswhereaninstructioncanwaitforitsoperandsandallowsubsequentinstructionstoproceed10.Inapipelinedmachine,impreciseinterruptscanbecausedbyinstruction-generatedtrapssuchasarithmeticexcep-tionsandpagefaults.Animpreciseinterruptcanleavethemachineinanirrecoverablestate.Whiletheoccurrenceofarithmeticexceptionsisrare,theoccurrenceofpagefaultsinamachinethatsupportsvualmemoryisnot.Therefore,ifvir-tualmemoryistobeusedwithapipelinedCPU,itiscrucialthatinterruptsbeprecise.Severalhardwaresolutionstotheproblemaredescribedin3.Weareunawareofanysoftwaresolutionstotheimpreciseinterruptproblemformultiplefunc-tionalunitcomputers.Asoftwaresolutionwillbeextremelydifficult,ifnotimpossible.Notonlymustthesoftwareallowfortheworst-caseexecutiontimeforanyinstruction,itmustalsokeeptrackofinstructionsthathavecompletedoutofpre-1987ACM0084-7495/877527gramorderandgeneratetheappropriatecodesequencetoundotheeffectsofthoseinstructions.Ineithercase,somehardwaresupportmustbeprovidedtomaintainruntimeinformation.1.2.OutlineofthePaperInthispaper,wetreattheproblemsofdependencyresolu-tionandimpreciseinterruptssimultaneously.Sinceahardwaremechanismmustexistforimplementingpreciseinterrupts,whynotextendthismechanismtoresolvedependenciesandallowout-of-orderinstructionexecution?Insection2,wediscussTomasulosdependency-resolutionalgorithmandextendit,givingseveralvariations,sothatthecostofimplementingitisnotprohibitiveevenforalargenumberofregisters.Insection3,wediscusstheproblemofimpreciseinterruptsandpresentsolutions.Section4describesaunitthatresolvesdependenciesaswellasimple-mentspreciseinterrupts.Thepreciseinterruptanddependency-resolutionmechanismsmutuallyaidandsimplifyeachother.AsimulationanalysisoftheproposedmechanismusingseveralLivermoreloopsasbenchmarksiscarriedoutinsection5.Finally,wediscusshowourmechanismmightbeusedtoalleviatethedegradationduetobranchinstructions.Throughoutthepaper,wediscussincrementalmodificationstothebasicprinciples.Datasupportingourclaimsforsuchmodificationshavebeenomittedforreasonsofconciseness.However,wedopresentdetailedsimulationdataforourfinaldesign.1.3.ModelArchitectureThemodelarchitecturethatweuseforourstudiesispresentedinFigure1.IthasthesamecapabilitiesandexecutesthesameinstructionsetasthescalarunitoftheCRAY-14,11.However,thereisamajordifference.Inourarchitec-ture,allinstructions,whethertheyarecomposedofIparcel(16bits)or2parcels(32bits)canissueinasinglecycleifissueconditionsarefavorable.Therefore,thebest-caseexecu-tiontimeofaconditionalbranchinstructionis4clockcyclesaftertheconditionisknownasopposedto5clockcyclesfortheCRAY-111.TheCRAY-1waschosenbecauseitrepresentsastate-of-the-artscalarunitanditsexecutioncanbemodeledprecisely.TheauthorsalsohadeasyaccesstotoolsthatcouldbeusedtogenerateinstructiontracesfortheCRAY-1scalarunit12.Themodelmachine,therefore,con-sistsofseveralfunctionalunitsconnectedtoacommonresultbus.Onlyonefunctioncanoutputdataontotheresultbusinanyclockcycle.InstructionsarefetchedbytheInstructionFetchUnitanddecodedandissuedbytheDecodeandIssueUnit.Oncedependencieshavebeenresolvedinthedecodeandissueunit,instructionsareforwardedtothefunctionalunitsforexecution.Theresultsofthefunctionalunitsarewrit-tendirect2yintotheregisterfile.Theregisterfileconsistsof8A,8S,64Band64Tregisters.2.DEPENDENCYRESOLUTION:OUT-OF-ORDERINSTRUCTIONEXECUTIONWhenaninstructionreachesthedecodeandissuestageinthepipeline,checksmustbemadetodetermineiftheoperandsfortheinstructionareavailable,i.e.,ifalldependenciesforthisinstructionhavebeenresolved.Ifanoperandisnotavailable,theinstructionmustwait.Consequently,subsequentinstruc-tionscannotproceedeventhoughtheymaybereadytoexe-FunctionalUnitsiFromMemoryIRegisterInstructionFetchUnit,IrFileli,IZRestBusFigure1.TheBasicArchitecturecute.Subsequentinstructionscanproceedifthewaitinginstructionstepsaside,andallowsotherinstructionstobypassitwhileitwaitsforitsoperanfs.Reservationstationspermitaninstructiontodothis10.2.1.TomasulosAlgorithmTomasulosdependency-resolutionalgorithmwasfirstpresentedforthefloating-pointunitoftheIBM360/9110.AnextensionofthisalgorithmforthescalarunitoftheCRAY-1ispresentedin13.Thealgorithmoperatesasfol-lows.AninstructionwhoseoperandsarenotavailablewhenitentersthedecodeandissuestageisforwardedtoaReservationStation(RS)associatedwiththefunctionalunitthatitwillbeusing.ItwaitsintheRSuntilitsdatadependencieshavebeenresolved,i.e.,itsoperandsareavailable.Onceatareservationstation,aninstructioncanresolveitsdependenciesbymonitor-ingtheCommonDataBus(theResultBusinourmodelarchi-tecture).Whenalltheoperandsforaninstructionareavail-able,itisdispatchedtotheappropriatefunctionalunitforexe-cution.Theresultbuscanbereservedeitherwhentheinstruc-tionisdispatchedtothefunctionalunit13orsoonbeforeitisabouttheleavethefunctionalunit10.Eachsourceregisterisassignedabitthatdeterminesifrtheregisterisbusy.Aregisterisbusyifitisthedestinationofaninstructionthatisstillinexecution.Adestinationregisterisalsocalledasinkregister10.Eachsinkregisterisassignedatagwhichidentifiestheresultthatmustbewrittenintotheregister.Sinceanyregisterintheregisterfilecanbeasink,eachregistermustbeassignedatag.Eachreservationstationhasthefollowingfields:SourceOperand1SourceOperand2Destination28Ifasourceregisterisbusywhentheinstructionreachestheissuestage,thetagforthesourceregisterisobtainedandtheinstructionisforwardedtoareservationstation.Ifthesinkregisterisbusy,theinstructionfetchesanewtag,updatesthetagofthesinkregisterandproceedstoareservationstation.Theregistersaswellasthereservationstationsmonitortheresultbusandupdatetheircontentswhenamatchingtagisfound.Memoryistreatedasaspecialfunctionalunit.Detailsofthealgorithmcanbefoundin10and13.Whilethisalgorithmisstraightforwardandeffective,itisexpensivetoimplementbecauseeachregisterneedstobetaggedandeachtagneedsassociativecomparisonhardwaretocarryoutthetag-matchingprocess.Thismaynotbepracticalifthenumberofpossiblesinkfields,i.e.,thenumberofregis-tersislarge.Forourmodelarchitecturewhichhas8A,8S,64Band64Tregisters,clearlytheuseof144tag-matchinghardwareunitsisimpractical.2.2.ExtensionstoTomasulosAlgorithm2.2.1.ASeparateTagUnitOncloserinspectionweseethatveryfewofallpossiblesinkregistersmayactuallybeactive,i.e.,bewaitingforaresultatanygiventime.Therefore,ifweassociateatagwitheachpossiblesinkregister,alotofassociativetag-matchinghardwarewillbeidleatanygiventime.Whynothaveacom-montagpoolandassignatagonlytoacurrentlyactivesinkregisterratherthanassociatingatagwitheachpossiblesinkfield?InTomasulosalgorithm,acurrentlyactiveregisterisonewhosebusybitison.WeconsolidatethetagsfromallcurrentlyactiveregistersintoaTagUnit(TU).Eachregisternowhasonlyasinglebusybit.Atinstructionissuetime,ifasourceregisterisbusy,theTUisqueriedforthecurrenttagoftheappropriateregisterandthetagisforwardedtothereservationstations.Anewtagisobtainedforthedestinationregisteroftheinstruction.Ifthedestinationregisterisnotbusy,acquiringsuchatagfromtheTUissaightforward.Ifthedestinationregisterisbusy,i.e.,theTUalreadyholdsatagfortheregister,anewtagisobtainedandtheinstructionholdingtheoldtagisinformedthat,whileitmayupdatetheregister,itmaynotunlocktheregister,i.e.,clearthebusybit.Instructionissueblocksifnotagcanbeobtained,i.e.,theTUisfull.Asbefore,theinstructionalongwithitsassociatedtags/operandsisforwardedtoareservationstationwhereitwaitsforitsoperandstobecomeready.Theresultfromafunctionalunit(alongwithitstag)isbroadcasttoallreservationstationsandisalsoforwardedtotheTU.Reservationsta-tionsmonitortheresultbusandgateintheresultifthetagofthedataontheresultbusmatchesthetagstoredinthereserva-tionstation.TheTUforwardstheresulttotheregisterspecifiedintheappropriateslotoftheTU.Allregistersare,therefore,updatedonlybytheTUwhentheirdataisavailableandnodirectconnectionisneededbetweenthefunctionalunitsandtheregisterfile.WhentheregisterhasbeenupdatedbytheTU,thecorrespondingtagisreleasedandismarkedfreeintheTU.Inordertoensurecorrectoperation,i.e.,onlythelatesttagforeachregisterisusedbyallsubsequentinstructionsandonlythelatestinstructionupdatesthebusybitoftheregister,weassociateanotherbitwitheachTUentry.Thisbitindicatesifthetagisthelatesttagfortheregisterandiftheinstructionhasakeytounlocktheregister,i.e.,clearthebusybit.ThemodifiedarchitecturethatincorporatesaTagUnitandreserva-tionstationsassociatedwitheachfunctionalunitisshowninFigure2.ThereservationstationsaremodifiedsothattheresultcanbeforwardedtotheappropriateslotintheTU.Thenewreservationstationhasthefollowingfields:SourceOperand1SourceOperand2DestinationIstationsIIFunctionalIUnitsResultBusIFromMemoryRegisterIFileInstructionFetchUnitIIDecodeandIssueUnit,flIl.IIVFigure2.IssueLogicwithaTagUnitandDistributedReservationStations.ExampleTheoperationoftheTagUnitisbestillustratedbyanexample.ConsideraTUthathas6entriesasshowninFigure3.EachentryintheTUhasabitindicatingifthetagisfree,i.e.,availableforusebytheissuelogic,abitindicatingifthetagisthelatesttagfortheregisterandafieldforthenumberofthedestinationregister.TheTUisindexedbythetagnumber.Considertheexe-cutionofaninstruction11thataddsthecontentsofregistersSOand$7andputtheresultin$4.AssumethatthestateoftheTUisasshowninFigure3.Whentheissuelogicdecodes11,itattemptstogetanewtagforthedestinationregister$4fromtheTUandobtainstag3.SincetheTUalreadyhasatagfor$4,theoldtag(4)isupdatedtoindicatethatitnolongerrepresentsthelatestcopyoftheregister.SinceS7scontentsarevalid,theycanbereadfromtheregisterfileandforwardedtothereservationstationsdirectly.However,sincethecontentsofSOarenotvalid,thelatesttagforSO(tag2)mustbeobtainedfromtheTU.Theissueunitforwardsapackettothereservationstationassociatedwiththeaddfunctionalunit.Thepacketcontainsthecontentsof$7,atag(2)forSOandatag(3)forthedestinationregister$4.WhenItcompletesexe-cution,i.e.,leavestheaddfunctionalunit,theresultisfor-wardedtoallreservationstationsthathaveamatchingtag(3)andalsototheTU.TheTUforwardstheresulttotheregister29TagRegisterTagNumberNumberFree1A0N2SON3NILY4$4N5SON6$3NLatestCoYYYYNYFigure3:ATagUnitfiletobewritteninto$4.Sincetag3isthelatesttagfor$4,S4sbusybitcanberesetwhenthedatahasbeenwritteninto$4.Tag3isthenmarkedfree,i.e.,isavailableforreusebytheissuelogic.2.2.2.MergingtheReservationStationsIfeachfunctionalunithasaseparatesetofreservationstations,itislikelythatsomefunctionalunitwillrunoutofreservationstationswhilethereservationstationsassociatedwithanotherfunctionalunitareidle.Assuggestedin13,wecancombineallthereservationstationsintoacommonRSPoolratherthanhavingdisjointpoolsofreservationstationsassociatedwitheachfunctionalunit.AllinstructionsthatwerepreviouslyissuedtodistributedreservationstationsassociatedwiththefunctionalunitsnowgotothecommonRSPool.Instructionissueisblockedifnofreereservationstationisavailable,i.e.,iftheRSPoolisfull.AsinstructionsbecomereadyintheRSPool,theyareissuedtothefunctionalunits.Alltheotherfunctionsareasbefore.Anorganizationwithmergedreservationstationsdoeshaveonedisadvantageoverdistributedreservationstations-onlyoneinstructioncanissuefromtheRSPooltothefunc-tionalunitsunlessmultiplepathsareprovidedbetweentheRSPoolandthefunctionalunits.Ontheotherhand,abetteruseofthereservationsstations,resultssincethereservationstationscanbesharedamongstseveralfunctionalunits.WechosetoprovideonlyasinglepathfromtheRSPooltothefunctionalunitsbecauseoursimulationsshowedthatmultiplepathsbetweentheRSPoolandthefunctionalunitswouldnothaveasignificantimpactonperformance.Ratherthanpresentdetailedsimulationstosupportourdecision,weuseanargu-mentbasedoninstructionflowtoconvincethereader.TheRSPoolisessentiallyareservoirofinstructionsthatisfilledbythedecodeandissuelogicanddrainedbythefunctionalunits.Sincethedecodeandissuelogiccanfillthisreservoiratamaximumrateof1instructionpercycle,havingadrainthatiscapableofdrainingmorethan1instructionpercyclewillnotbeveryusefulinasteadystate.2.2.3.MergingtheRSPoolwiththeTagUnitIntheTagUnit,thereisoneentryforeveryinstructionthatispresentineithertheRSPoolorinthefunctionalunits.Therefore,atanytime,thereisaone-to-onecorrespondencebetweentheentriesintheTUandthenumberofinstructionsinthereservationstationsorthefunctionalunits.ThissuggeststhatwecancombinetheRSPoolandtheTagUnitintoasingleRSTagUnit(RSTU).IntheRSTU,areservationstationisreservedatthesametimethatatagisreserved.Ofcourse,areservationstationiswastedifitisassociatedwithaninstruc-tionthatisinafunctionalunit.However,asweshallseeinsection4,thisorganizationcaneasilybeextendedtoallowfortheimplementationofpreciseinterrupts.Whenaninstructionissues,itobtainsatagfromtheRSTUandindoingsoautomat-icallyreservesareservationstation.Alltheotherfunctionsareasbefore.EachentryintheRSTUisasfollows:TagTagLatestSourceOperand1NumberFreeCopyrIrY+oIIY+oIIR+yiTa+icontIISourceOperand2Destination3.IMPLEMENTATIONOFPRECISEINTERRUPTSNowweaddresstheissueofpreciseinterrupts.Acom-pletedescriptionofseveralschemesthatimplementpreciseinterruptsisgivenin3.Theschemeofinteresttousisthereorderbuffer.Thereorderbufferallowsinstructionstofinishexecutionoutoforderbutupdatesthestateofthemachine(registers,memory,etc.),i.e.,commitstheinstructionsintheorderthattheinstructionswerepresentintheprogram,therebyassuringthataprecisestateofthemachineisrecoverableatanytime.Byforcinganorderingofcommitmentamongsttheinstructions,thereorderbufferaggravatesdatadependencies-thevalueofaregistercannotbereadtillithasbeenupdatedbythereorderbuffer,eventhoughtheinstructionthatcomputedavaluefortheregistermayhavecompletedalready.Analternativetoasimplereorderbufferistoassociatebypasslogicwiththereorderbuffer.Insuchanorganization,aninstructiondoesnothavetowaitforthereorderbuffertoupdateasourceregister;itcanfetchthevaluefromthereorderbuffer(ifitisavailable)andcanissue.Withabypassmechan-ism,theissuerateofthemachineisnotdegradedconsiderablyifthesizeofthebufferisreasonablylarge3.However,abypassmechanismisexpensivetoimplementsinceitrequiresasearchcapabilityandadditionaldatapathsforeachbufferentry.4.MERGINGDEPENDENCYRESOLUTIONANDPRE-CISEINTERRUPTSWenotethattheRSTUofsection2.2.3canbemodifiedtobehavelikeareorderbufferifitisforcedtoupdatethestateofthemachineintheorderthattheinstructionsareencoun-tered.ThisiseasilyaccomplishedbymanagingtheRSTUasaqueue.Therefore,allthatwehavetodotoimplementpreciseinterruptsinanarchitecturewithaRSTUistomanagetheRSTUlikeaqueue.ThemodifiedlogiciscalledtheRegisterUpdateUnit(RUU).TheRUUisessentiallytheRSTUcon-strainedtocommitinstructionsintheorderthattheinstructionswerereceivedbythedecodeandissuelogic(andconsequentlybytheRUU).Thefunctionalunitsremainunchanged.ThemodifiedarchitecturethatusesaRUUtoexecuteinstructionsoutofprogramorderandtoensureaprecisestateofthemachineisgiveninFigure4.30FunctionalIFromMemoryInstructionFetchUnitjmRegisterUpdateUnit,RegisterFileITLoadRegistersUnitsResultBusFigure4.TheModifiedArchitecturewithaRUUNotetheabsenceofadirectpathbetweenthedecodeandissuelogicandthefunctionalunits.Inordertoimplementpre-ciseinterrupts,everyinstructionmustreserveanentryintheRUU.SinceeveryinstructionmustpassthroughtheRUU,nodirectconnectionisneededbetweenthedecodeandissuelogicandthefunctionalunits.AlsonotethattheCPUsinteractionswiththememoryfunctionalunithavebeendepictedinmoredetail.Inthenextfewsections,wedescribeinsomedetailtheoperationofthemodifiedarchitecturewithaRUU.4.1.DecodeandIssueUnitWhenaninstructionisdecoded,theissuelogicrequestsanentryintheRUU.Ifnofreeentryisavailable,i.e.,theRUUisfull,instructionissueisblocked.Ifanentryisavail-able,theissuelogicobtainsthepositionoftheentry(anindexintotheRUU).Itthenforwardsthecontentsofthesourceregisters(iftheyareavailable)oraregisteridentifier(theregisternumberappendedwithsomeextracontrolbitstobeusedasatag)totheselectedreservationstationintheRUU.Controlbitsforthedestinationregister(acompletedescriptionofwhichfollowsinsection4.2.2)intheregisterfileareupdatedandtheidentifierforthedestinationregisterforwardedtotheRUU.4.2.TheRegisterUpdateUnitTheRUUistheunitthat(i)determineswhichinstructionshouldbeissuedtothefunctionalunitsforexecution,reservestheresultbusanddispatchestheinstructiontothefunctionalunit,(ii)determineswhichinstructioncancommit,i.e.,updatethestateofthemachine,(iii)monitorstheresultbustoresolvedependenciesand(iv)providestagstoandacceptsnewinstructionsfromthedecodeandissuelogic.TheRUUismanagedlikeaqueueusingRUU_HeadandRUU_Tailpointers.RUUslotsthatdonotliebetweenRUU_HeadandRUU_Tailarefree.IfRUU_Head=RUU_Tail,theRUUisfullandcannotacceptanymoreinstructionsfromthedecodeandissuelogic.IndesigningtheRUU,wekeepinmindthat(i)itshouldnotinvolvealargeamountofcomparisonhardwareand(ii)itshouldnotaffecttheclockspeedtoanintolerableextent.Inthenextfewsections,wedescribethecomponentsoftheRUUinsomemoredetail.4.2.1.SourceOperandFieldsThedesignofthesourceoperandfieldsisstraightforward.Eachsourceoperandfieldhasareadybit,atagsub-fieldandacontentsub-fieldasbelow:SourceOperandIRoadyIT.gICoeIIftheoperandisnotready,thetagsub-fieldmonitorstheresultbusforamatchingtag.Ifamatchisdetected,thedataonthebusisgatedintothecontentfield.4.2.2.DestinationFieldRecallthatintheRSTUofsection2.2.3,theissuelogicneededtosearchtheTUtoobtainthecorrecttagforthesourceoperandandtoupdatethelatestcopyfieldforthedestinationregister.Suchawideassociativesearchmaynotacceptablebecauseofthelargeamountofhardwarerequired.Ifmultipleinstancesofthesamedestinationregisteraredisallowed,noassociativelogicisnecessary.Aninstanceofaregisterisanewcopyoftheregister.Byprovidinganewinstanceforabusydestinationregister,thearchitecturecanprocessseveralinstructionsthatwriteintothesameregistersimultaneously.Unfortunately,disallowingmultipleinstancesofadestinationregisterdegradesperformance13.However,allisnotlost.Asnotedin10,itispossibletoeliminatetheassociativesearchanduseacountertoprovidemultipleinstancesforeachregisterifwecanguaranteethatresultsreturntotheregistersinorder.Thisisexactlythegoaloftheprecise-interruptmechanism.Theimplementationofpreciseinterrupts,there-fore,simplifiesthedesignofthedependencyresolutionmechanism.Theschemeweuseassociates2n-bitcounters(controlbits)witheachregisterintheregisterfile.Thereisnobusybit.Thecounters,theNumberofInstances(NI)andtheLatestInstance(LI),representthenumberofinstancesofaregisterintheRUUandthenumberofthelatestinstance,respectively.WhenaninstructionthatwritesintoregisterRiisissuedtotheRUU,bothNIandLIareincremented.LIisincrementedmodulon.Upto2n-1instancesofaregistercanbepresentintheRUUatanytime;issueisblockedifNIforadestinationregisteris2n-1.WhenaninstructionleavestheRUUandupdatesthevalueofRi,theassociatedNIisdecremented.AregisterisfreeifNI=0,i.e.,thereisnoinstructionintheRUUthatisgoingtowriteintotheregister.TheregistertagsenttotheRUUnowconsistsoftheregisternumberRiappendedwiththeLIcounter.Thisguaran-teesthatfutureinstructionsaccessthelatestinstance,i.e.,obtainthelatestcopyoftheregistercontentsandthatinstruc-tionsalreadypresentintheRUUgetthecorrectversionofthedata.Inourexperiments,eachofthesecounterswas3bitswide.A3-bitcounterensuredthat,forourbenchmarkpro-grams,aninstructionneverblockedinthedecodeandissuestagebecauseaninstanceofaregisterwasunavailable.Sincewehadatotalof144registers,thetagfieldwas11(8+3)bitswide.ThereisnoneedforaLatestCopyfieldintheRUUand31noassociativelogicisneededtosearchwithintheRUU.4.2.3.BypassLogicintheRUUOneoftheprimarydrawbacksofthesimplereorderbufferpresentedin3isthatperformancemaybedegradedbecauseinstructionissueisblockedifasourceregisterisbusyeventhoughitsresultmaybepresentinthereorderbuffer.Thisperformance-degradingproblemiseasilyrectifiedifbypasslogicisprovidedsothatasourceoperandcouldbereaddirectlyfromthereorderbufferbeforeitiswrittenintotheregisterfile.Suchbypasslogicthoughsimple,iscumbersomeandexpensivetoimplement.DoestheRUUneedsuchlogic?ConsideraninstructionIithatusestheresultofaprevi-ousinstructionlj.Recallthatthereservationstationsassoci-atedwiththeRUUalreadyhavethecapabilitytomonitortheresultbus.Therefore,ifljcompletesexecutionafterIiisissuedtotheRUU,licangateintheresultfromljwhenitappearsontheresultbus.Inthiscase,nobypasslogicisneeded.Theonlycasethatbypasslogicmightbehelpfuliswhenljhascompletedexecutionbuthasnotcommitted,i.e.,updatedtheregisterfile,whenIiisissuedtotheRUU.Ratherthatprovidingbypasslogicforthiscase,weextendthemonitoringcapabilitiesofthereservationstationstomonitorboththeresultbusandtheRUUtoregisterbus.Thiscanbeaccomplishedwithoutasubstantialincreaseinhardware.Therefore,IisdependencyonljisresolvedwhenIjputsitsresultontheRUUtoregisterbusifljhascompletedexecutionbeforeliisissuedtotheRUU.IfliisissuedtotheRUUbeforeljcompletes,lisdependencyonljcanberesolvedwhenljcompletesandputsitsresultontheresultbus.Therefore,instructionIineedstowaitinthedecodeandissuestageonlyiftheRUUisfull.4.3.InteractionswithMemoryInstructionsthatinteractwiththememory,i.e.,load/storeinstructions,arehandledinaspecialmanner.RatherthanusingLoadaddresses,aStoredatabufferandaConflictbufferasin13,wekeepasetofLoadRegisterstoresolvedepen-denciesinthememoryfunctionalunit.Thereservationsta-tionsforload/storeinstructionsareprovidedbytheRUU.Theloadregisterscontaintheaddressesofcurrentlyactivememorylocations.EachloadregisterhastheLIandNIcounterstoallowformultipleinstancesofamemoryaddress.Iftheaddressofaload/storeoperationisunavailable,subsequentload/storeinstructionsintheRUUarenotallowedtoproceed.Whenaloadinstructionisallowedtoproceed,itcheckstoseeiftheaddressfortheloadoperationmatchesanaddressstoredintheloadregisters.Ifamatchoccursandtheloadregisterisnotfree(NIisnonzero)theloadinstructionsimplyforwardsatagtotheRUU.Theloadoperationisnotsubmittedtothememory.ThetagisthenumberoftheloadregisterappendedwiththeLIcounter.Amatchcanoccurifthereiseitherapendingloadorapendingstoreoperation.Ineithercase,theloadneednotbesubmittedtomemorysincethedesireddatacanbeobtainedwhenthependingloadorstoreoperationcompletes.Ifamatchoccursforastoreinstruction,theNIandLIcountersareincrementedandthenewtagfor-wardedtotheRUU.Ifnomatchoccursforeitheroperation,afreeloadregis-terisobtained.Aloadregisterisfreeiftherearenopendingloadorstoreinstructionstothememoryaddressheldintheloadregister,i.e.,NI=0.TheNIcounterissetto1andtheLIcounterissetto0.Theloadrequestissubmittedtomemory.ThecorrespondingtagisalsosubmittedtomemorysothatthedatasuppliedbythememorymaybereadbytheappropriatesourceoperandsintheRUU.Load/storeinstructionsarenotissuedbytheRUUifafreeloadregisterisnotavailable.WhentheresultforaloadoperationreturnsfromthememoryorthestoreoperationiscommittedbytheRUU,NIisdecre-mented.Thedataandtheaddressareforwardedtothememoryincaseofastoreoperation.Notethatdecodeandissueunitlogicneedstosearchtheloadregistersassociativelyformemoryaddresses.However,thehardwareneededforthiscomparisonisnotverygreatforasmallnumberofloadregisters.Inoursimulations,weused6loadregistersthough4weresufficientformostcases.4.4.OperationoftheRUUIneachclockcycle,theRUUcarriesout4distincttasks:(i)itacceptsaninstructionfromtheissuelogic,(ii)itcommitsaninstruction,i.e.,updatestheregisterfile,(iii)itissuesaninstructiontothefunctionalunitsand(iv)itmonitorsthebussesformatchingtags.Thisconstitutesalotofwork;how-ever,eachofthesetaskscanbecardedoutinparallel.Acceptinganewtaskisstraightforward.IfanentryintheRUUisfree,theissuelogicupdatesthefieldsoftheselectedentry.IftheinstructionattheheadoftheRUUhasfinishedexecution,itsresultsareforwardedtotheregisterfile.IftheoperandsofaninstructionintheRUUareready,theinstruc-tioncanissuetothefunctionalunits.Priorityisfirstgiventoload/storeinstructionsandthentoaninstructionwhichenteredtheRUUearlier.TheRUUreservestheresultbuswhenitissuesaninstructiontothefunctionalunits.Thefinaltaskofmonitoringthebussesislefttothetag-matchinglogicinthesource-operandfields.EachentryintheRUUis,therefore:SourceOperand1SourceOperand2DestinationExecutedProgramCounterIRogisr,IcontIIIIcootITheProgramCounterfieldisneededfortheimplementationofpreciseinterrupts3.Forthesakeofbrevity,wehaveomittedthedetailsofextrainformationthatmustbecardedaroundwitheachinstruction(suchastagsandRUUentrynumbers).Thedetailsofsuchinformationareobvious.5.SIMULATIONRESULTSInordertoevaluatetheeffectivenessoftheRUU,wecar-riedouttrace-drivensimulations.ThebenchmarkprogramsusedforalloursimulationsweretheLawrenceLivermoreloops14.Thefirst14loopswerechosenbecausetheywerereadilyavailable.Henceforth,weshallrefertothemasLL1,LL2.LL14.Thesimulationswerecardedoutasfollows.32Thebenchmarkprograms,ascompiledbytheCFTcompilerforthescalarunit,werefedintotheCRAY-1simulator12.TheCRAY-1simulatorgeneratesaninstructiontraceforeachprogram.Vectorinstructionsarenotused.EachinstructiontracewasthenfedintooursimulatortocalculatetheexecutiontimeandtherelativespeedupfordifferentRUUsizes.Oursimulatorconverts2parcelinstructionsto1parcelinstructionswhentheyareencountered.Inoursimulations,theLIandNIcounterswereeach3bitswidetherebyallowingupto7instancesofaregisterintheRUU.Thiswasusefulinloops7,8,9and14whichupdatedthecontentsofregistersfrequently.Weused6loadregisterssothattheissueofaload/storeinstructionisneverblockedbecausealoadregisterisunavailable.Furthermore,aninstruc-tionlefttheRUUonlywhenitwasexecutedcompletely.Specifically,loadinstructionsdidnotleavetheRUUforatleast10cyclesaftertheywereissuedtothememory(thetimetakenfortheresulttocomebackfromthememory).Table1presentsthespeedupsforaRUUwithbypasslogicoverasimpleCRAY-likeinstructionissuemechan-ism13fordifferentsizesoftheRUU.Aspeedupofgreaterthan1impliesthattheinstructionissuemechanismusingaRUUisfasterthanthesimpleCRAY-likeinstructionissuemechanism.NotethattheCRAY-likeinstructionissuemechanismdoesnotimplementpreciseinterrupts.Theaver-agecolumnistheaverageforall14loops.Theresultsarequiteencouraging.ARUUwithareasonablenumberofentries(8-12),notonlyspeedsupexecution,italsoprovidespreciseinterrupts.WewouldliketopointoutthatwehaveassumedthattheclockperiodforourmechanismisthesameastheclockperiodforthesimpleCRAY-likeinstructionissuemechanism.Unfortunately,wecannotverifythisassumptiontillahardwareimplementationisactuallyrealized.Iftheclockperiodsareindeeddifferent,thespeedupfactorswouldhavetobenormalizedaccordingly.Sincebypasslogicisexpensive,wedecidedtoevaluateaRUUthatdidnothaveanybypasslogicbutitsreservationsta-tionsmonitoredboththeresultbusandtheRUUtoregisterbusasdiscussedinsection4.2.3.TheresultsarepresentedinTable2.Formanycases,thepresenceofbypasslogicmadeanegligibledifference,ifany.Ontheaverage,aRUUwithnobypasslogicisstillabletospeeduptheexecutiontimeand,atthesametime,implementpreciseinterrupts.TheRUUisspe-ciallyabletospeeduploopsthatmakeheavyuseoftheBandTregisterfiles(loops3,4and8).Fromtables1and2,itmayseemthatareasonablylargesizedRUUisneededtoachieveaperformanceimprovement.ThemainreasonforthelargeRUUsizeisthat,inoursimula-tions,loadinstructionsdidnotfreeaslotintheRUUtilltheinstructionwascompletelyexecuted(10cycles).Conse-quently,instructionissueisblockedbecauseofunavailableRUUslots.If,asin3,wehadallowedloadinstructionstofreeRUUslotsassoonasitwasdeterminedthattheywouldnotcauseexceptions,muchsmallerRUUsizeswouldbeneeded.Evenforthepresentedresultswenotethatanarchi-tecturewithaRUUofsize10hascomparablehardwarerequirementstoanarchitecturethatassociatesonlyasinglereservationstationwitheachofthefunctionalunitsanddoesnotassociateanytagswiththeregisters.6.BRANCHPREDICTIONANDCONDITIONALINSTRUCTIONSAsmentionedearlier,theperformancedegradationduetobranchescanbereducedbyconditionallyexecutinginstruc-tionsfromapredictedbranchpath.Severalarchitecturesemploythisapproach2,8,15.Toallowconditionalexecu-tionofinstructions,ahardwaremechanismisneededthatwouldallowthemachinetorecoverfromanincorrectbranchprediction.TheRUUprovidesaverypowerfulmechanismfornulli-fyinginstructions,betheinstructionsvalidinstructionsorinstructionsthatexecutedinaconditionalmode.Validinstructionsmaybenullifiedbecauseofaninterruptcausedbyapreviousinstruction;conditionallyexecutedinstructionsmaybenullifiediftheyarefromanincorrectexecutionpath.Therefore,theconditionalexecutionofinstructionswithaRUUisveryeasy.Ifthedecodeandissueunitpredictstheoutcomeofbranchesandactuallyexecutesinstructionsfromapredictedpathinaconditionalmode,recoveryfromincorrectbranchpredictionscanbeachievedveryeasilywithoutdupli-catingtheregisterfile.WecanidentifysuchinstructionsthroughtheuseofanadditionalfieldintheRUUandpreventthemfrombeingcommitteduntiltheyareproventobefromacorrectpath.Furthermore,thereisnohardlimittothenumberofbranchesthatcanbepredictedinabranchpath;theRUUcanprovidemultipleinstancesofaregisterforthedifferentpaths.Thisisincontrasttotheapproachtakenin15.ExtendingtheRUUtoaccommodatebranchpredictionandconditionalexecutionisatopicforfutureresearch.7.CONCLUSIONInthispaper,wehavecombinedtheissuesofhardwaredependency-resolutionandimplementationofpreciseinter-rupts.Wedevisedaschemethatcanresolvedependenciesandtherebyallowout-of-orderinstructionexecutionwithoutasso-ciatingtag-matchinghardwarewitheachregister.Suchaschemecan,therefore,beusedeveninthepresenceofalargenumberofregisterswithoutasubstantialhardwarecost.Thenweextendedtheschemetoincorporatepreciseinterrupts.Thepreciseinterruptandthedependency-resolutionmechanismsmutuallyaidandsimplifyeachother.Weevaluatedtheper-formanceoftheresultinghardwarethatallowsout-of-orderinstructionexecutionandalsoimplementspreciseinterruptsusingseveralLivermoreloopsasthebenchmark.Theresultsarequiteencouraging.Thecombinedmechanism,calledtheRUU,isabletoimplementpreciseinterruptsandisabletoachieveasignificantperformanceimprovementoverasimpleinstructionissuemechanismwithoutasubstantialcostinhardware.Wenotedthatthismechanismcaneasilybeextendedtosupportconditionalexecutionofinstructionsfromapredictedbranchpath.AcknowledgmentsThisworkwassupp

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论