会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf21-Instruction Issue Logic for High-Performance, Interruptable Pipelined Processors.pdf -- 5 元

宽屏显示 收藏 分享

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

INSTRUCTIONISSUELOGICFORHIGHPERFORMANCE,INTERRUPTABLEP1PELINEDPROCESSORSGurindarS.SohiandSriramVajapeyamComputerSciencesDepartmentUniversityofWisconsin.Madison1210WestDaytonStreetMadison,Wisconsin53706AbstractTheperformanceofpipelinedprocessorsisseverelylimitedbydatadependencies.Inordertoachievehighperformance,amechanismtoalleviatetheeffectsofdatadependenciesmustexist.IfapipelinedCPUwithmultiplefunctionalunitsistobeusedinthepresenceofavirtualmemoryhierarchy,amechanismmustalsoexistfordeterminingthestateofthemachineprecisely.Inthispaper,wecombinetheissuesofdependencyresolutionandprecisenessofstate.Wepresentadesignforinstructionissuelogicthatresolvesdependenciesdynamicallyand,atthesametime,guaranteesaprecisestateofthemachine,withoutasignificanthardwareoverhead.Detailedsimulationstudiesfortheproposedmechanism,usingtheLawrenceLivermoreloopsasabenchmark,arepresented.1.INTRODUCTIONAsthedemandforprocessingpowerincreases,computersystemdesignersareforcedtousetechniquesthatresultinhighperformanceprocessingunits.Awidelyusedtechniqueispipelining\1\,inwhichtheoveralllogicofthesystemissplitintoseveralstageswitheachstageperformingasubtaskofacompletetask.Considerableoverlapcanbeachievedbecauseeachstagecanperformasubtaskforadifferenttask.PipelinedCPUshavetwomajorimpedimentstotheirperformanceidatadependenciesandiibranchinstructions.Aninstructioncannotbeginexecutionuntilitsoperandsareavailable.Ifanoperandistheresultofapreviousinstruction,theinstructionmustwaittillthepreviousinstructionhascompletedexecution,therebydegradingperformance.Theperformancedegradationduetobranchinstructionsisevenmoresevere.Notonlymustaconditionalbranchinstructionwaltforitsconditiontobeknownresultinginbubblesinthepipeline,anadditionalpenaltyisincurredinfetchinganinstructionfromthetakenbranchpathtotheinstructiondecodeandissuestage.Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.Amajorproblemthatarisesinpipelinedcomputerdesignisthataninterruptcanbeimprecise\2,3\.Thisproblemisespeciallysevereinmultiplefunctionalunitcomputersinwhichinstructionscancompleteexecutionoutofprogramorder\2,4\.Forahighperformance,pipelinedCPU,anadequatesolutionmustbefoundfortheimpreciseinterruptproblemandmeansmustbeprovidedforovercomingtheperformancedegradingfactors.1.1.BackgroundandPreviousWorkThedetrimentaleffectsofbranchinstructionscanbealleviatedbyusingdelayedbranchinstructions.However,theutilityofdelayedbranchinstructionsislimitedforlongpipelines.Insuchcases,othermeansmustexisttoalleviatethedetrimentaleffects.Acommonapproachistousebranchprediction\5,6\.Usingpredictiontechniques,theprobableexecutionpathofabranchinstructionisdetermined.Instructionsfromthepredictedpathcanthenbefetchedintoinstructionbuffersorevenexecutedinaconditionalmodel2,7,8\.Whiletheconditionalmodeofexecutionwillresultinahigherpipelinethroughput,especiallyiftheoutcomeofthebranchesispredictedcorrectly,ahardwaremechanismmustexistwhichwillallowthemachinetorecoverfromanincorrectsequenceofconditionalinstructions.Bothhardwareandsoftwaresolutionsexisttothedatadependencyproblem.Softwaresolutionsusecodeschedulingtechniquescombinedwithalargesetofregisterstoincreasethedependencydistanceandtoprovideinterlocks\9\.Hardwaresolutionsemploywaitingstationsorreservationstationswhereaninstructioncanwaitforitsoperandsandallowsubsequentinstructionstoproceed\10\.Inapipelinedmachine,impreciseinterruptscanbecausedbyinstructiongeneratedtrapssuchasarithmeticexceptionsandpagefaults.Animpreciseinterruptcanleavethemachineinanirrecoverablestate.Whiletheoccurrenceofarithmeticexceptionsisrare,theoccurrenceofpagefaultsinamachinethatsupportsvualmemoryisnot.Therefore,ifvirtualmemoryistobeusedwithapipelinedCPU,itiscrucialthatinterruptsbeprecise.Severalhardwaresolutionstotheproblemaredescribedin\3\.Weareunawareofanysoftwaresolutionstotheimpreciseinterruptproblemformultiplefunctionalunitcomputers.Asoftwaresolutionwillbeextremelydifficult,ifnotimpossible.Notonlymustthesoftwareallowfortheworstcaseexecutiontimeforanyinstruction,itmustalsokeeptrackofinstructionsthathavecompletedoutofpre©1987ACM00847495/87/06000027500,7527gramorderandgeneratetheappropriatecodesequencetoundotheeffectsofthoseinstructions.Ineithercase,somehardwaresupportmustbeprovidedtomaintainruntimeinformation.1.2.OutlineofthePaperInthispaper,wetreattheproblemsofdependencyresolutionandimpreciseinterruptssimultaneously.Sinceahardwaremechanismmustexistforimplementingpreciseinterrupts,whynotextendthismechanismtoresolvedependenciesandallowoutoforderinstructionexecutionInsection2,wediscussTomasulosdependencyresolutionalgorithmandextendit,givingseveralvariations,sothatthecostofimplementingitisnotprohibitiveevenforalargenumberofregisters.Insection3,wediscusstheproblemofimpreciseinterruptsandpresentsolutions.Section4describesaunitthatresolvesdependenciesaswellasimplementspreciseinterrupts.Thepreciseinterruptanddependencyresolutionmechanismsmutuallyaidandsimplifyeachother.AsimulationanalysisoftheproposedmechanismusingseveralLivermoreloopsasbenchmarksiscarriedoutinsection5.Finally,wediscusshowourmechanismmightbeusedtoalleviatethedegradationduetobranchinstructions.Throughoutthepaper,wediscussincrementalmodificationstothebasicprinciples.Datasupportingourclaimsforsuchmodificationshavebeenomittedforreasonsofconciseness.However,wedopresentdetailedsimulationdataforourfinaldesign.1.3.ModelArchitectureThemodelarchitecturethatweuseforourstudiesispresentedinFigure1.IthasthesamecapabilitiesandexecutesthesameinstructionsetasthescalarunitoftheCRAY1\4,11\.However,thereisamajordifference.Inourarchitecture,allinstructions,whethertheyarecomposedofIparcel16bitsor2parcels32bitscanissueinasinglecycleifissueconditionsarefavorable.Therefore,thebestcaseexecutiontimeofaconditionalbranchinstructionis4clockcyclesaftertheconditionisknownasopposedto5clockcyclesfortheCRAY1\11\.TheCRAY1waschosenbecauseitrepresentsastateoftheartscalarunitanditsexecutioncanbemodeledprecisely.TheauthorsalsohadeasyaccesstotoolsthatcouldbeusedtogenerateinstructiontracesfortheCRAY1scalarunit\12\.Themodelmachine,therefore,consistsofseveralfunctionalunitsconnectedtoacommonresultbus.Onlyonefunctioncanoutputdataontotheresultbusinanyclockcycle.InstructionsarefetchedbytheInstructionFetchUnitanddecodedandissuedbytheDecodeandIssueUnit.Oncedependencieshavebeenresolvedinthedecodeandissueunit,instructionsareforwardedtothefunctionalunitsforexecution.Theresultsofthefunctionalunitsarewrittendirect2yintotheregisterfile.Theregisterfileconsistsof8A,8S,64Band64Tregisters.2.DEPENDENCYRESOLUTIONOUTOFORDERINSTRUCTIONEXECUTIONWhenaninstructionreachesthedecodeandissuestageinthepipeline,checksmustbemadetodetermineiftheoperandsfortheinstructionareavailable,i.e.,ifalldependenciesforthisinstructionhavebeenresolved.Ifanoperandisnotavailable,theinstructionmustwait.Consequently,subsequentinstructionscannotproceedeventhoughtheymaybereadytoexeFunctionalUnitsiFromMemoryIRegisterInstructionFetchUnit,IrFileli\,IZRestBusFigure1.TheBasicArchitecturecute.Subsequentinstructionscanproceedifthewaitinginstructionstepsaside,andallowsotherinstructionstobypassitwhileitwaitsforitsoperanfs.Reservationstationspermitaninstructiontodothis\10\.2.1.TomasulosAlgorithmTomasulosdependencyresolutionalgorithmwasfirstpresentedforthefloatingpointunitoftheIBM360/91\10\.AnextensionofthisalgorithmforthescalarunitoftheCRAY1ispresentedin\13\.Thealgorithmoperatesasfollows.AninstructionwhoseoperandsarenotavailablewhenitentersthedecodeandissuestageisforwardedtoaReservationStationRSassociatedwiththefunctionalunitthatitwillbeusing.ItwaitsintheRSuntilitsdatadependencieshavebeenresolved,i.e.,itsoperandsareavailable.Onceatareservationstation,aninstructioncanresolveitsdependenciesbymonitoringtheCommonDataBustheResultBusinourmodelarchitecture.Whenalltheoperandsforaninstructionareavailable,itisdispatchedtotheappropriatefunctionalunitforexecution.Theresultbuscanbereservedeitherwhentheinstructionisdispatchedtothefunctionalunit\13\orsoonbeforeitisabouttheleavethefunctionalunit\10\.Eachsourceregisterisassignedabitthatdeterminesifrtheregisterisbusy.Aregisterisbusyifitisthedestinationofaninstructionthatisstillinexecution.Adestinationregisterisalsocalledasinkregister\10\.Eachsinkregisterisassignedatagwhichidentifiestheresultthatmustbewrittenintotheregister.Sinceanyregisterintheregisterfilecanbeasink,eachregistermustbeassignedatag.EachreservationstationhasthefollowingfieldsSourceOperand1SourceOperand2Destination28Ifasourceregisterisbusywhentheinstructionreachestheissuestage,thetagforthesourceregisterisobtainedandtheinstructionisforwardedtoareservationstation.Ifthesinkregisterisbusy,theinstructionfetchesanewtag,updatesthetagofthesinkregisterandproceedstoareservationstation.Theregistersaswellasthereservationstationsmonitortheresultbusandupdatetheircontentswhenamatchingtagisfound.Memoryistreatedasaspecialfunctionalunit.Detailsofthealgorithmcanbefoundin\10\and\13\.Whilethisalgorithmisstraightforwardandeffective,itisexpensivetoimplementbecauseeachregisterneedstobetaggedandeachtagneedsassociativecomparisonhardwaretocarryoutthetagmatchingprocess.Thismaynotbepracticalifthenumberofpossiblesinkfields,i.e.,thenumberofregistersislarge.Forourmodelarchitecturewhichhas8A,8S,64Band64Tregisters,clearlytheuseof144tagmatchinghardwareunitsisimpractical.2.2.ExtensionstoTomasulosAlgorithm2.2.1.ASeparateTagUnitOncloserinspectionweseethatveryfewofallpossiblesinkregistersmayactuallybeactive,i.e.,bewaitingforaresultatanygiventime.Therefore,ifweassociateatagwitheachpossiblesinkregister,alotofassociativetagmatchinghardwarewillbeidleatanygiventime.WhynothaveacommontagpoolandassignatagonlytoacurrentlyactivesinkregisterratherthanassociatingatagwitheachpossiblesinkfieldInTomasulosalgorithm,acurrentlyactiveregisterisonewhosebusybitison.WeconsolidatethetagsfromallcurrentlyactiveregistersintoaTagUnitTU.Eachregisternowhasonlyasinglebusybit.Atinstructionissuetime,ifasourceregisterisbusy,theTUisqueriedforthecurrenttagoftheappropriateregisterandthetagisforwardedtothereservationstations.Anewtagisobtainedforthedestinationregisteroftheinstruction.Ifthedestinationregisterisnotbusy,acquiringsuchatagfromtheTUissaightforward.Ifthedestinationregisterisbusy,i.e.,theTUalreadyholdsatagfortheregister,anewtagisobtainedandtheinstructionholdingtheoldtagisinformedthat,whileitmayupdatetheregister,itmaynotunlocktheregister,i.e.,clearthebusybit.Instructionissueblocksifnotagcanbeobtained,i.e.,theTUisfull.Asbefore,theinstructionalongwithitsassociatedtags/operandsisforwardedtoareservationstationwhereitwaitsforitsoperandstobecomeready.TheresultfromafunctionalunitalongwithitstagisbroadcasttoallreservationstationsandisalsoforwardedtotheTU.Reservationstationsmonitortheresultbusandgateintheresultifthetagofthedataontheresultbusmatchesthetagstoredinthereservationstation.TheTUforwardstheresulttotheregisterspecifiedintheappropriateslotoftheTU.Allregistersare,therefore,updatedonlybytheTUwhentheirdataisavailableandnodirectconnectionisneededbetweenthefunctionalunitsandtheregisterfile.WhentheregisterhasbeenupdatedbytheTU,thecorrespondingtagisreleasedandismarkedfreeintheTU.Inordertoensurecorrectoperation,i.e.,onlythelatesttagforeachregisterisusedbyallsubsequentinstructionsandonlythelatestinstructionupdatesthebusybitoftheregister,weassociateanotherbitwitheachTUentry.Thisbitindicatesifthetagisthelatesttagfortheregisterandiftheinstructionhasakeytounlocktheregister,i.e.,clearthebusybit.ThemodifiedarchitecturethatincorporatesaTagUnitandreservationstationsassociatedwitheachfunctionalunitisshowninFigure2.ThereservationstationsaremodifiedsothattheresultcanbeforwardedtotheappropriateslotintheTU.ThenewreservationstationhasthefollowingfieldsSourceOperand1SourceOperand2DestinationIstationsIIFunctionalIUnitsResultBusIFromMemoryRegisterIFileInstructionFetchUnitIIDecodeandIssueUnit\\\\,flIl...IIVFigure2.IssueLogicwithaTagUnitandDistributedReservationStations2.2.1.1.ExampleTheoperationoftheTagUnitisbestillustratedbyanexample.ConsideraTUthathas6entriesasshowninFigure3.EachentryintheTUhasabitindicatingifthetagisfree,i.e.,availableforusebytheissuelogic,abitindicatingifthetagisthelatesttagfortheregisterandafieldforthenumberofthedestinationregister.TheTUisindexedbythetagnumber.Considertheexecutionofaninstruction11thataddsthecontentsofregistersSOand7andputtheresultin4.AssumethatthestateoftheTUisasshowninFigure3.Whentheissuelogicdecodes11,itattemptstogetanewtagforthedestinationregister4fromtheTUandobtainstag3.SincetheTUalreadyhasatagfor4,theoldtag4isupdatedtoindicatethatitnolongerrepresentsthelatestcopyoftheregister.SinceS7scontentsarevalid,theycanbereadfromtheregisterfileandforwardedtothereservationstationsdirectly.However,sincethecontentsofSOarenotvalid,thelatesttagforSOtag2mustbeobtainedfromtheTU.Theissueunitforwardsapackettothereservationstationassociatedwiththeaddfunctionalunit.Thepacketcontainsthecontentsof7,atag2forSOandatag3forthedestinationregister4.WhenItcompletesexecution,i.e.,leavestheaddfunctionalunit,theresultisforwardedtoallreservationstationsthathaveamatchingtag3andalsototheTU.TheTUforwardstheresulttotheregister29TagRegisterTagNumberNumberFree1A0N2SON3NILY44N5SON63NLatestCoYYYYNYFigure3ATagUnitfiletobewritteninto4.Sincetag3isthelatesttagfor4,S4sbusybitcanberesetwhenthedatahasbeenwritteninto4.Tag3isthenmarkedfree,i.e.,isavailableforreusebytheissuelogic.2.2.2.MergingtheReservationStationsIfeachfunctionalunithasaseparatesetofreservationstations,itislikelythatsomefunctionalunitwillrunoutofreservationstationswhilethereservationstationsassociatedwithanotherfunctionalunitareidle.Assuggestedin\13\,wecancombineallthereservationstationsintoacommonRSPoolratherthanhavingdisjointpoolsofreservationstationsassociatedwitheachfunctionalunit.AllinstructionsthatwerepreviouslyissuedtodistributedreservationstationsassociatedwiththefunctionalunitsnowgotothecommonRSPool.Instructionissueisblockedifnofreereservationstationisavailable,i.e.,iftheRSPoolisfull.AsinstructionsbecomereadyintheRSPool,theyareissuedtothefunctionalunits.Alltheotherfunctionsareasbefore.AnorganizationwithmergedreservationstationsdoeshaveonedisadvantageoverdistributedreservationstationsonlyoneinstructioncanissuefromtheRSPooltothefunctionalunitsunlessmultiplepathsareprovidedbetweentheRSPoolandthefunctionalunits.Ontheotherhand,abetteruseofthereservationsstations,resultssincethereservationstationscanbesharedamongstseveralfunctionalunits.WechosetoprovideonlyasinglepathfromtheRSPooltothefunctionalunitsbecauseoursimulationsshowedthatmultiplepathsbetweentheRSPoolandthefunctionalunitswouldnothaveasignificantimpactonperformance.Ratherthanpresentdetailedsimulationstosupportourdecision,weuseanargumentbasedoninstructionflowtoconvincethereader.TheRSPoolisessentiallyareservoirofinstructionsthatisfilledbythedecodeandissuelogicanddrainedbythefunctionalunits.Sincethedecodeandissuelogiccanfillthisreservoiratamaximumrateof1instructionpercycle,havingadrainthatiscapableofdrainingmorethan1instructionpercyclewillnotbeveryusefulinasteadystate.2.2.3.MergingtheRSPoolwiththeTagUnitIntheTagUnit,thereisoneentryforeveryinstructionthatispresentineithertheRSPoolorinthefunctionalunits.Therefore,atanytime,thereisaonetoonecorrespondencebetweentheentriesintheTUandthenumberofinstructionsinthereservationstationsorthefunctionalunits.ThissuggeststhatwecancombinetheRSPoolandtheTagUnitintoasingleRSTagUnitRSTU.IntheRSTU,areservationstationisreservedatthesametimethatatagisreserved.Ofcourse,areservationstationiswastedifitisassociatedwithaninstructionthatisinafunctionalunit.However,asweshallseeinsection4,thisorganizationcaneasilybeextendedtoallowfortheimplementationofpreciseinterrupts.Whenaninstructionissues,itobtainsatagfromtheRSTUandindoingsoautomaticallyreservesareservationstation.Alltheotherfunctionsareasbefore.EachentryintheRSTUisasfollowsTagTagLatestSourceOperand1NumberFreeCopyrIrY°oIIYoIIRyiTaicontIISourceOperand2Destination3.IMPLEMENTATIONOFPRECISEINTERRUPTSNowweaddresstheissueofpreciseinterrupts.Acompletedescriptionofseveralschemesthatimplementpreciseinterruptsisgivenin\3\.Theschemeofinteresttousisthereorderbuffer.Thereorderbufferallowsinstructionstofinishexecutionoutoforderbutupdatesthestateofthemachineregisters,memory,etc.,i.e.,commitstheinstructionsintheorderthattheinstructionswerepresentintheprogram,therebyassuringthataprecisestateofthemachineisrecoverableatanytime.Byforcinganorderingofcommitmentamongsttheinstructions,thereorderbufferaggravatesdatadependenciesthevalueofaregistercannotbereadtillithasbeenupdatedbythereorderbuffer,eventhoughtheinstructionthatcomputedavaluefortheregistermayhavecompletedalready.Analternativetoasimplereorderbufferistoassociatebypasslogicwiththereorderbuffer.Insuchanorganization,aninstructiondoesnothavetowaitforthereorderbuffertoupdateasourceregisteritcanfetchthevaluefromthereorderbufferifitisavailableandcanissue.Withabypassmechanism,theissuerateofthemachineisnotdegradedconsiderablyifthesizeofthebufferisreasonablylarge\3\.However,abypassmechanismisexpensivetoimplementsinceitrequiresasearchcapabilityandadditionaldatapathsforeachbufferentry.4.MERGINGDEPENDENCYRESOLUTIONANDPRECISEINTERRUPTSWenotethattheRSTUofsection2.2.3canbemodifiedtobehavelikeareorderbufferifitisforcedtoupdatethestateofthemachineintheorderthattheinstructionsareencountered.ThisiseasilyaccomplishedbymanagingtheRSTUasaqueue.Therefore,allthatwehavetodotoimplementpreciseinterruptsinanarchitecturewithaRSTUistomanagetheRSTUlikeaqueue.ThemodifiedlogiciscalledtheRegisterUpdateUnitRUU.TheRUUisessentiallytheRSTUconstrainedtocommitinstructionsintheorderthattheinstructionswerereceivedbythedecodeandissuelogicandconsequentlybytheRUU.Thefunctionalunitsremainunchanged.ThemodifiedarchitecturethatusesaRUUtoexecuteinstructionsoutofprogramorderandtoensureaprecisestateofthemachineisgiveninFigure4.30FunctionalIFromMemoryInstructionFetchUnitjmRegisterUpdateUnit\,RegisterFileITLoadRegistersUnitsResultBusFigure4.TheModifiedArchitecturewithaRUUNotetheabsenceofadirectpathbetweenthedecodeandissuelogicandthefunctionalunits.Inordertoimplementpreciseinterrupts,everyinstructionmustreserveanentryintheRUU.SinceeveryinstructionmustpassthroughtheRUU,nodirectconnectionisneededbetweenthedecodeandissuelogicandthefunctionalunits.AlsonotethattheCPUsinteractionswiththememoryfunctionalunithavebeendepictedinmoredetail.Inthenextfewsections,wedescribeinsomedetailtheoperationofthemodifiedarchitecturewithaRUU.4.1.DecodeandIssueUnitWhenaninstructionisdecoded,theissuelogicrequestsanentryintheRUU.Ifnofreeentryisavailable,i.e.,theRUUisfull,instructionissueisblocked.Ifanentryisavailable,theissuelogicobtainsthepositionoftheentryanindexintotheRUU.ItthenforwardsthecontentsofthesourceregistersiftheyareavailableoraregisteridentifiertheregisternumberappendedwithsomeextracontrolbitstobeusedasatagtotheselectedreservationstationintheRUU.Controlbitsforthedestinationregisteracompletedescriptionofwhichfollowsinsection4.2.2intheregisterfileareupdatedandtheidentifierforthedestinationregisterforwardedtotheRUU.4.2.TheRegisterUpdateUnitTheRUUistheunitthatidetermineswhichinstructionshouldbeissuedtothefunctionalunitsforexecution,reservestheresultbusanddispatchestheinstructiontothefunctionalunit,iidetermineswhichinstructioncancommit,i.e.,updatethestateofthemachine,iiimonitorstheresultbustoresolvedependenciesandivprovidestagstoandacceptsnewinstructionsfromthedecodeandissuelogic.TheRUUismanagedlikeaqueueusingRUU_HeadandRUU_Tailpointers.RUUslotsthatdonotliebetweenRUU_HeadandRUU_Tailarefree.IfRUU_HeadRUU_Tail,theRUUisfullandcannotacceptanymoreinstructionsfromthedecodeandissuelogic.IndesigningtheRUU,wekeepinmindthatiitshouldnotinvolvealargeamountofcomparisonhardwareandiiitshouldnotaffecttheclockspeedtoanintolerableextent.Inthenextfewsections,wedescribethecomponentsoftheRUUinsomemoredetail.4.2.1.SourceOperandFieldsThedesignofthesourceoperandfieldsisstraightforward.Eachsourceoperandfieldhasareadybit,atagsubfieldandacontentsubfieldasbelowSourceOperandIRoadyIT.gICoeIIftheoperandisnotready,thetagsubfieldmonitorstheresultbusforamatchingtag.Ifamatchisdetected,thedataonthebusisgatedintothecontentfield.4.2.2.DestinationFieldRecallthatintheRSTUofsection2.2.3,theissuelogicneededtosearchtheTUtoobtainthecorrecttagforthesourceoperandandtoupdatethelatestcopyfieldforthedestinationregister.Suchawideassociativesearchmaynotacceptablebecauseofthelargeamountofhardwarerequired.Ifmultipleinstancesofthesamedestinationregisteraredisallowed,noassociativelogicisnecessary.Aninstanceofaregisterisanewcopyoftheregister.Byprovidinganewinstanceforabusydestinationregister,thearchitecturecanprocessseveralinstructionsthatwriteintothesameregistersimultaneously.Unfortunately,disallowingmultipleinstancesofadestinationregisterdegradesperformance\13\.However,allisnotlost.Asnotedin\10\,itispossibletoeliminatetheassociativesearchanduseacountertoprovidemultipleinstancesforeachregisterifwecanguaranteethatresultsreturntotheregistersinorder.Thisisexactlythegoalofthepreciseinterruptmechanism.Theimplementationofpreciseinterrupts,therefore,simplifiesthedesignofthedependencyresolutionmechanism.Theschemeweuseassociates2nbitcounterscontrolbitswitheachregisterintheregisterfile.Thereisnobusybit.Thecounters,theNumberofInstancesNIandtheLatestInstanceLI,representthenumberofinstancesofaregisterintheRUUandthenumberofthelatestinstance,respectively.WhenaninstructionthatwritesintoregisterRiisissuedtotheRUU,bothNIandLIareincremented.LIisincrementedmodulon.Upto2n1instancesofaregistercanbepresentintheRUUatanytimeissueisblockedifNIforadestinationregisteris2n1.WhenaninstructionleavestheRUUandupdatesthevalueofRi,theassociatedNIisdecremented.AregisterisfreeifNI0,i.e.,thereisnoinstructionintheRUUthatisgoingtowriteintotheregister.TheregistertagsenttotheRUUnowconsistsoftheregisternumberRiappendedwiththeLIcounter.Thisguaranteesthatfutureinstructionsaccessthelatestinstance,i.e.,obtainthelatestcopyoftheregistercontentsandthatinstructionsalreadypresentintheRUUgetthecorrectversionofthedata.Inourexperiments,eachofthesecounterswas3bitswide.A3bitcounterensuredthat,forourbenchmarkprograms,aninstructionneverblockedinthedecodeandissuestagebecauseaninstanceofaregisterwasunavailable.Sincewehadatotalof144registers,thetagfieldwas1183bitswide.ThereisnoneedforaLatestCopyfieldintheRUUand31noassociativelogicisneededtosearchwithintheRUU.4.2.3.BypassLogicintheRUUOneoftheprimarydrawbacksofthesimplereorderbufferpresentedin\3\isthatperformancemaybedegradedbecauseinstructionissueisblockedifasourceregisterisbusyeventhoughitsresultmaybepresentinthereorderbuffer.Thisperformancedegradingproblemiseasilyrectifiedifbypasslogicisprovidedsothatasourceoperandcouldbereaddirectlyfromthereorderbufferbeforeitiswrittenintotheregisterfile.Suchbypasslogicthoughsimple,iscumbersomeandexpensivetoimplement.DoestheRUUneedsuchlogicConsideraninstructionIithatusestheresultofapreviousinstructionlj.RecallthatthereservationstationsassociatedwiththeRUUalreadyhavethecapabilitytomonitortheresultbus.Therefore,ifljcompletesexecutionafterIiisissuedtotheRUU,licangateintheresultfromljwhenitappearsontheresultbus.Inthiscase,nobypasslogicisneeded.Theonlycasethatbypasslogicmightbehelpfuliswhenljhascompletedexecutionbuthasnotcommitted,i.e.,updatedtheregisterfile,whenIiisissuedtotheRUU.Ratherthatprovidingbypasslogicforthiscase,weextendthemonitoringcapabilitiesofthereservationstationstomonitorboththeresultbusandtheRUUtoregisterbus.Thiscanbeaccomplishedwithoutasubstantialincreaseinhardware.Therefore,IisdependencyonljisresolvedwhenIjputsitsresultontheRUUtoregisterbusifljhascompletedexecutionbeforeliisissuedtotheRUU.IfliisissuedtotheRUUbeforeljcompletes,lisdependencyonljcanberesolvedwhenljcompletesandputsitsresultontheresultbus.Therefore,instructionIineedstowaitinthedecodeandissuestageonlyiftheRUUisfull.4.3.InteractionswithMemoryInstructionsthatinteractwiththememory,i.e.,load/storeinstructions,arehandledinaspecialmanner.RatherthanusingLoadaddresses,aStoredatabufferandaConflictbufferasin\13\,wekeepasetofLoadRegisterstoresolvedependenciesinthememoryfunctionalunit.Thereservationstationsforload/storeinstructionsareprovidedbytheRUU.Theloadregisterscontaintheaddressesofcurrentlyactivememorylocations.EachloadregisterhastheLIandNIcounterstoallowformultipleinstancesofamemoryaddress.Iftheaddressofaload/storeoperationisunavailable,subsequentload/storeinstructionsintheRUUarenotallowedtoproceed.Whenaloadinstructionisallowedtoproceed,itcheckstoseeiftheaddressfortheloadoperationmatchesanaddressstoredintheloadregisters.IfamatchoccursandtheloadregisterisnotfreeNIisnonzerotheloadinstructionsimplyforwardsatagtotheRUU.Theloadoperationisnotsubmittedtothememory.ThetagisthenumberoftheloadregisterappendedwiththeLIcounter.Amatchcanoccurifthereiseitherapendingloadorapendingstoreoperation.Ineithercase,theloadneednotbesubmittedtomemorysincethedesireddatacanbeobtainedwhenthependingloadorstoreoperationcompletes.Ifamatchoccursforastoreinstruction,theNIandLIcountersareincrementedandthenewtagforwardedtotheRUU.Ifnomatchoccursforeitheroperation,afreeloadregisterisobtained.Aloadregisterisfreeiftherearenopendingloadorstoreinstructionstothememoryaddressheldintheloadregister,i.e.,NI0.TheNIcounterissetto1andtheLIcounterissetto0.Theloadrequestissubmittedtomemory.ThecorrespondingtagisalsosubmittedtomemorysothatthedatasuppliedbythememorymaybereadbytheappropriatesourceoperandsintheRUU.Load/storeinstructionsarenotissuedbytheRUUifafreeloadregisterisnotavailable.WhentheresultforaloadoperationreturnsfromthememoryorthestoreoperationiscommittedbytheRUU,NIisdecremented.Thedataandtheaddressareforwardedtothememoryincaseofastoreoperation.Notethatdecodeandissueunitlogicneedstosearchtheloadregistersassociativelyformemoryaddresses.However,thehardwareneededforthiscomparisonisnotverygreatforasmallnumberofloadregisters.Inoursimulations,weused6loadregistersthough4weresufficientformostcases.4.4.OperationoftheRUUIneachclockcycle,theRUUcarriesout4distincttasksiitacceptsaninstructionfromtheissuelogic,iiitcommitsaninstruction,i.e.,updatestheregisterfile,iiiitissuesaninstructiontothefunctionalunitsandivitmonitorsthebussesformatchingtags.Thisconstitutesalotofworkhowever,eachofthesetaskscanbecardedoutinparallel.Acceptinganewtaskisstraightforward.IfanentryintheRUUisfree,theissuelogicupdatesthefieldsoftheselectedentry.IftheinstructionattheheadoftheRUUhasfinishedexecution,itsresultsareforwardedtotheregisterfile.IftheoperandsofaninstructionintheRUUareready,theinstructioncanissuetothefunctionalunits.Priorityisfirstgiventoload/storeinstructionsandthentoaninstructionwhichenteredtheRUUearlier.TheRUUreservestheresultbuswhenitissuesaninstructiontothefunctionalunits.Thefinaltaskofmonitoringthebussesislefttothetagmatchinglogicinthesourceoperandfields.EachentryintheRUUis,thereforeSourceOperand1SourceOperand2DestinationExecutedProgramCounterIRogisr,IcontIIIIcootITheProgramCounterfieldisneededfortheimplementationofpreciseinterrupts\3\.Forthesakeofbrevity,wehaveomittedthedetailsofextrainformationthatmustbecardedaroundwitheachinstructionsuchastagsandRUUentrynumbers.Thedetailsofsuchinformationareobvious.5.SIMULATIONRESULTSInordertoevaluatetheeffectivenessoftheRUU,wecarriedouttracedrivensimulations.ThebenchmarkprogramsusedforalloursimulationsweretheLawrenceLivermoreloops\14\.Thefirst14loopswerechosenbecausetheywerereadilyavailable.Henceforth,weshallrefertothemasLL1,LL2.....LL14.Thesimulationswerecardedoutasfollows.32Thebenchmarkprograms,ascompiledbytheCFTcompilerforthescalarunit,werefedintotheCRAY1simulator\12\.TheCRAY1simulatorgeneratesaninstructiontraceforeachprogram.Vectorinstructionsarenotused.EachinstructiontracewasthenfedintooursimulatortocalculatetheexecutiontimeandtherelativespeedupfordifferentRUUsizes.Oursimulatorconverts2parcelinstructionsto1parcelinstructionswhentheyareencountered.Inoursimulations,theLIandNIcounterswereeach3bitswidetherebyallowingupto7instancesofaregisterintheRUU.Thiswasusefulinloops7,8,9and14whichupdatedthecontentsofregistersfrequently.Weused6loadregisterssothattheissueofaload/storeinstructionisneverblockedbecausealoadregisterisunavailable.Furthermore,aninstructionlefttheRUUonlywhenitwasexecutedcompletely.Specifically,loadinstructionsdidnotleavetheRUUforatleast10cyclesaftertheywereissuedtothememorythetimetakenfortheresulttocomebackfromthememory.Table1presentsthespeedupsforaRUUwithbypasslogicoverasimpleCRAYlikeinstructionissuemechanism\13\fordifferentsizesoftheRUU.Aspeedupofgreaterthan1impliesthattheinstructionissuemechanismusingaRUUisfasterthanthesimpleCRAYlikeinstructionissuemechanism.NotethattheCRAYlikeinstructionissuemechanismdoesnotimplementpreciseinterrupts.Theaveragecolumnistheaverageforall14loops.Theresultsarequiteencouraging.ARUUwithareasonablenumberofentries812,notonlyspeedsupexecution,italsoprovidespreciseinterrupts.WewouldliketopointoutthatwehaveassumedthattheclockperiodforourmechanismisthesameastheclockperiodforthesimpleCRAYlikeinstructionissuemechanism.Unfortunately,wecannotverifythisassumptiontillahardwareimplementationisactuallyrealized.Iftheclockperiodsareindeeddifferent,thespeedupfactorswouldhavetobenormalizedaccordingly.Sincebypasslogicisexpensive,wedecidedtoevaluateaRUUthatdidnothaveanybypasslogicbutitsreservationstationsmonitoredboththeresultbusandtheRUUtoregisterbusasdiscussedinsection4.2.3.TheresultsarepresentedinTable2.Formanycases,thepresenceofbypasslogicmadeanegligibledifference,ifany.Ontheaverage,aRUUwithnobypasslogicisstillabletospeeduptheexecutiontimeand,atthesametime,implementpreciseinterrupts.TheRUUisspeciallyabletospeeduploopsthatmakeheavyuseoftheBandTregisterfilesloops3,4and8.Fromtables1and2,itmayseemthatareasonablylargesizedRUUisneededtoachieveaperformanceimprovement.ThemainreasonforthelargeRUUsizeisthat,inoursimulations,loadinstructionsdidnotfreeaslotintheRUUtilltheinstructionwascompletelyexecuted10cycles.Consequently,instructionissueisblockedbecauseofunavailableRUUslots.If,asin\3\,wehadallowedloadinstructionstofreeRUUslotsassoonasitwasdeterminedthattheywouldnotcauseexceptions,muchsmallerRUUsizeswouldbeneeded.EvenforthepresentedresultswenotethatanarchitecturewithaRUUofsize10hascomparablehardwarerequirementstoanarchitecturethatassociatesonlyasinglereservationstationwitheachofthefunctionalunitsanddoesnotassociateanytagswiththeregisters.6.BRANCHPREDICTIONANDCONDITIONALINSTRUCTIONSAsmentionedearlier,theperformancedegradationduetobranchescanbereducedbyconditionallyexecutinginstructionsfromapredictedbranchpath.Severalarchitecturesemploythisapproach\2,8,15\.Toallowconditionalexecutionofinstructions,ahardwaremechanismisneededthatwouldallowthemachinetorecoverfromanincorrectbranchprediction.TheRUUprovidesaverypowerfulmechanismfornullifyinginstructions,betheinstructionsvalidinstructionsorinstructionsthatexecutedinaconditionalmode.Validinstructionsmaybenullifiedbecauseofaninterruptcausedbyapreviousinstructionconditionallyexecutedinstructionsmaybenullifiediftheyarefromanincorrectexecutionpath.Therefore,theconditionalexecutionofinstructionswithaRUUisveryeasy.Ifthedecodeandissueunitpredictstheoutcomeofbranchesandactuallyexecutesinstructionsfromapredictedpathinaconditionalmode,recoveryfromincorrectbranchpredictionscanbeachievedveryeasilywithoutduplicatingtheregisterfile.WecanidentifysuchinstructionsthroughtheuseofanadditionalfieldintheRUUandpreventthemfrombeingcommitteduntiltheyareproventobefromacorrectpath.Furthermore,thereisnohardlimittothenumberofbranchesthatcanbepredictedinabranchpaththeRUUcanprovidemultipleinstancesofaregisterforthedifferentpaths.Thisisincontrasttotheapproachtakenin\15\.ExtendingtheRUUtoaccommodatebranchpredictionandconditionalexecutionisatopicforfutureresearch.7.CONCLUSIONInthispaper,wehavecombinedtheissuesofhardwaredependencyresolutionandimplementationofpreciseinterrupts.Wedevisedaschemethatcanresolvedependenciesandtherebyallowoutoforderinstructionexecutionwithoutassociatingtagmatchinghardwarewitheachregister.Suchaschemecan,therefore,beusedeveninthepresenceofalargenumberofregisterswithoutasubstantialhardwarecost.Thenweextendedtheschemetoincorporatepreciseinterrupts.Thepreciseinterruptandthedependencyresolutionmechanismsmutuallyaidandsimplifyeachother.WeevaluatedtheperformanceoftheresultinghardwarethatallowsoutoforderinstructionexecutionandalsoimplementspreciseinterruptsusingseveralLivermoreloopsasthebenchmark.Theresultsarequiteencouraging.Thecombinedmechanism,calledtheRUU,isabletoimplementpreciseinterruptsandisabletoachieveasignificantperformanceimprovementoverasimpleinstructionissuemechanismwithoutasubstantialcostinhardware.Wenotedthatthismechanismcaneasilybeextendedtosupportconditionalexecutionofinstructionsfromapredictedbranchpath.AcknowledgmentsThisworkwassupportedinpartbytheUniversityofWisconsinGraduateResearchCommittee.TheauthorswouldliketothankJimGoodman,AndyPleszkun,JimSmithandtheanonymousreviewersfortheirusefulcomments.33Table1RelativeSpeedupswithBypassLogicRUUSizeBenchmarkLL1LL2LL3LL4LL5LL6LL7LL8LL9LL10LL11LL12LL13LL14Average46810121416182030500.951.041.261.481.591.781.781.781.781.781.940.760.921.041.201.221.221.221.221.221.221.701.001.051.271.421.761.761.841.942.052.052.051.021.131.201.281.371.781.781.781.781.781.780.870.981.191.261.341.401.441.441.441.442.020.811.011.171.261.341.401.461.491.501.502.040.811.101.431.601.731.841.941.971.911.962.010.540.780.931.041.091.151.161.161.151.241.600.700.901.061.161.221.271.321.421.421.411.800.831.011.061.171.201.201.201.201.201.201.750.710.750.931.031.031.031.031.031.031.031.390.931.001.331.471.551.551.551.551.551.551.690.821.091.181.3511.411.501.521.531.571.681.790.901.131.311.381.431.551.831.891.952.062.09I0.850.991.17I1.29\1.371.441.481.501.511.531.81Table2RelativeSpeedupswithNoBypassLogicBenchmarkLL1LL2LL3LL4LL5LL6LL7LL8LL9LL10LL11LL12LL13LL14AverageRUUSize46810l121416182030500.911.021.22,1.301.301.341.341.341.341.341.870.740.881.01\1.171.191.191.191.191.191.191.681.001.051.2711.421.761.761.841.942.052.05i2.050.971.021.081.141.361.461.461.521.541.471.770.820.951.061.101.101.101.101.101.101.101.900.770.931.011.061.101.101.101.101.081.192.050.811.051.211.261.311.241.241.271.281.622.000.540.760.910.980.991.051.061.061.091.111.560.690.881.041.111.141.191.181.171.191.201.800.831.001.031.141.131.141.141.141.141.161.750.690.750.931.031.031.031.031.031.031.031.390.931.001.331.471.551.551.551.551.551.551.690.821.021.161.271.391.351.441.451.451.391.700.830.951.051.041.071.081.201.181.231.531.98ititiiiiii0.820.941.101.181.251.261.291.301.311.351.79References\1\P.M.Kogge,TheArchitectureofPipelinedComputers.NewYork\9\McGrawHill,1981.\2\D.W.Anderson,F.J.Sparacio,andR.M.Tomasulo,TheIBMSystem/360Model91MachinePhilosophyandInstructionHandling,IBMJournalofResearchandDevelopment,pp.824,\10\January1967.\3\J.E.SmithandA.R.Pleszkun,ImplementationofPreciseInterruptsinPipelinedProcessors,Proc.12thAnnualSymposiumon\11\ComputerArchitecture,pp.3644,June1985.\4\R.MRussel,TheCRAY1ComputerSystem,CACM,vol.21,\12\pp.6372,January1978.\5\J.E.Smith,AStudyofBranchPredictionStrategies,Proc.8th\13\InternationalSymposiumonComputerArchitecture,pp.135148,May1981.\6\J.K.F.LeeandA.J.Smith,BranchPredictionStrategiesand\14\BranchTargetBufferDesign,IEEEComputer,vol.17,pp.622,January1984.\15\\7\P.Y.T.HsuandE.S.Davidson,HighlyConcurrentScalarProcessing,Proc.13thAnnualSymposiumonComputerArchitecture,pp.386395,June1986.\8\A.Pleszkun,J.Goodman,W.C.Hsu,R.Joersz,G.Bier,P.Woest,andP.Schecter,WISQARestartableArchitectureUsingQueues,inProc.14thAnnualSymposiumonComputerArchitecture,Pittsburgh,PA,June,1987.J.Hennessy,N.Jouppi,F.Baskett,T.Gross,andJ.Gill,Hardware/SoftwareTradeoffsforIncreasedPerformance,Proc.Int.Symp.onArch.SupportforProg.Lang.andOperatingSys.,pp.211,March1982.R.M.Tomasulo,AnEfficientAlgorithmforExploitingMultipleArithmeticUnits,IBMJournalofResearchandDevelopment,pp.2533,January1967.CRAY1ComputerSystems,HardwareReferenceManual.ChippewaFalls,WICrayResearch,Inc.,1982.N.PangandJ.E.Smith,CRAY1SimulationTools,Tech.ReportECE8311,UniversityofWisconsinMadison,Dec.1983.S.WeissandJ.E.Smith,InstructionIssueLogicforPipelinedSupercomputers,Proc.11thAnnualSymposiumonComputerArchitecture,pp.110118,June1984.F.H.McMahon,FORTRANCPUPerformanceAnalysis.LawrenceLivermoreLaboratories,1972.W.HwuandY.NPatt,HPSm,aHighPerformanceRestrictedDataFlowArchitectureHavingMinimalFunctionality,Proc.13thAnnualSymposiumonComputerArchitecture,pp.297307,June1986.34
编号:201401051948126804    大小:795.51KB    格式:PDF    上传时间:2014-01-05
  【编辑】
5
关 键 词:
工业、机械、能源、设计、建模、模具、工学
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:18次
baixue100上传于2014-01-05

官方联系方式

客服手机:13961746681   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

工业、机械、能源、设计、建模、模具、工学  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5