会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf

  • 资源星级:
  • 资源大小:80.04KB   全文页数:12页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf

ExploitingChoiceInstructionFetchandIssueonanImplementableSimultaneousMultithreadingProcessorDeanM.Tullsen,SusanJ.Eggers,JoelS.Emery,HenryM.Levy,JackL.Lo,andRebeccaL.StammyDeptofComputerScienceandEngineeringyDigitalEquipmentCorporationUniversityofWashingtonHLO23/J3Box35235077ReedRoadSeattle,WA981952350Hudson,MA01749AbstractSimultaneousmultithreadingisatechniquethatpermitsmultipleindependentthreadstoissuemultipleinstructionseachcycle.Inpreviousworkwedemonstratedtheperformancepotentialofsimultaneousmultithreading,basedonasomewhatidealizedmodel.Inthispaperweshowthatthethroughputgainsfromsimultaneousmultithreadingcanbeachievedwithoutextensivechangestoaconventionalwideissuesuperscalar,eitherinhardwarestructuresorsizes.Wepresentanarchitectureforsimultaneousmultithreadingthatachievesthreegoals1itminimizesthearchitecturalimpactontheconventionalsuperscalardesign,2ithasminimalperformanceimpactonasinglethreadexecutingalone,and3itachievessignificantthroughputgainswhenrunningmultiplethreads.Oursimultaneousmultithreadingarchitectureachievesathroughputof5.4instructionspercycle,a2.5foldimprovementoveranunmodifiedsuperscalarwithsimilarhardwareresources.Thisspeedupisenhancedbyanadvantageofmultithreadingpreviouslyunexploitedinotherarchitecturestheabilitytofavorforfetchandissuethosethreadsmostefficientlyusingtheprocessoreachcycle,therebyprovidingthebestinstructionstotheprocessor.1IntroductionSimultaneousmultithreadingSMTisatechniquethatpermitsmultipleindependentthreadstoissuemultipleinstructionseachcycletoasuperscalarprocessorsfunctionalunits.SMTcombinesthemultipleinstructionissuefeaturesofmodernsuperscalarswiththelatencyhidingabilityofmultithreadedarchitectures.Unlikeconventionalmultithreadedarchitectures1,2,15,23,whichdependonfastcontextswitchingtoshareprocessorexecutionresources,allhardwarecontextsinanSMTprocessorareactivesimultaneously,competingeachcycleforallavailableresources.Thisdynamicsharingofthefunctionalunitsallowssimultaneousmultithreadingtosubstantiallyincreasethroughput,attackingthetwomajorimpedimentstoprocessorutilizationlonglatenciesandlimitedperthreadparallelism.Tullsen,etal.,27showedthepotentialofProceedingsofthe23rdAnnualInternationalSymposiumonComputerArchitecture,Philadelphia,PA,May,1996anSMTprocessortoachievesignificantlyhigherthroughputthaneitherawidesuperscalaroramultithreadedprocessor.Thatpaperalsodemonstratedtheadvantagesofsimultaneousmultithreadingovermultipleprocessorsonasinglechip,duetoSMTsabilitytodynamicallyassignexecutionresourceswhereneededeachcycle.ThoseresultsshowedSMTspotentialbasedonasomewhatidealizedmodel.Thispaperextendsthatworkinfoursignificantways.First,wedemonstratethatthethroughputgainsofsimultaneousmultithreadingarepossiblewithoutextensivechangestoaconventional,wideissuesuperscalarprocessor.Weproposeanarchitecturethatismorecomprehensive,realistic,andheavilyleveragedoffexistingsuperscalartechnology.Oursimulationsshowthataminimalimplementationofsimultaneousmultithreadingachievesthroughput1.8timesthatoftheunmodifiedsuperscalarsmalltuningofthisarchitectureincreasesthatgainto2.5reachingthroughputashighas5.4instructionspercycle.Second,weshowthatSMTneednotcompromisesinglethreadperformance.Third,weuseourmoredetailedarchitecturalmodeltoanalyzeandrelievebottlenecksthatdidnotexistinthemoreidealizedmodel.Fourth,weshowhowsimultaneousmultithreadingcreatesanadvantagepreviouslyunexploitableinotherarchitecturesnamely,theabilitytochoosethebestinstructions,fromallthreads,forbothfetchandissueeachcycle.Byfavoringthethreadsmostefficientlyusingtheprocessor,wecanboostthethroughputofourlimitedresources.Wepresentseveralsimpleheuristicsforthisselectionprocess,anddemonstratehowsuchheuristics,whenappliedtothefetchmechanism,canincreasethroughputbyasmuchas37.Thispaperisorganizedasfollows.Section2presentsourbaselinesimultaneousmultithreadingarchitecture,comparingitwithexistingsuperscalartechnology.Section3describesoursimulatorandourworkload,andSection4showstheperformanceofthebaselinearchitecture.InSection5,weexaminetheinstructionfetchprocess,presentseveralheuristicsforimprovingitbasedonintelligentinstructionselection,andgiveperformanceresultstodifferentiatethoseheuristics.Section6examinestheinstructionissueprocessinasimilarway.WethenusethebestdesignschosenfromourfetchandissuestudiesinSection7asabasistodiscoverbottlenecksforfurtherperformanceimprovement.WediscussrelatedworkinSection8andsummarizeourresultsinSection9.ThisresearchwassupportedbyONRgrantsN0001492J1395andN000149411136,NSFgrantsCCR9200832andCDA9123308,NSFPYIAwardMIP9058439,theWashingtonTechnologyCenter,DigitalEquipmentCorporation,andfellowshipsfromIntelandtheComputerMeasurementGroup.InstructionCache8DecodeRegisterRenamingfloatingpointinstructionqueueintegerinstructionqueuefpunitsint/ldstoreunitsDataCachePCFetchUnitintegerregistersfpregistersFigure1Ourbasesimultaneousmultithreadinghardwarearchitecture.2ASimultaneousMultithreadingProcessorArchitectureInthissectionwepresentthearchitectureofoursimultaneousmultithreadingprocessor.Weshowthatthethroughputgainsprovidedbysimultaneousmultithreadingarepossiblewithoutaddingunduecomplexitytoaconventionalsuperscalarprocessordesign.OurSMTarchitectureisderivedfromahighperformance,outoforder,superscalararchitectureFigure1,withouttheextraprogramcounterswhichrepresentsaprojectionofcurrentsuperscalardesigntrends35yearsintothefuture.Thissuperscalarprocessorfetchesuptoeightinstructionspercyclefetchingiscontrolledbyaconventionalsystemofbranchtargetbuffer,branchprediction,andsubroutinereturnstacks.Fetchedinstructionsarethendecodedandpassedtotheregisterrenaminglogic,whichmapslogicalregistersontoapoolofphysicalregisters,removingfalsedependences.Instructionsarethenplacedinoneoftwoinstructionqueues.ThoseinstructionqueuesaresimilartotheonesusedbytheMIPSR1000020andtheHPPA800021,inthiscaseholdinginstructionsuntiltheyareissued.Instructionsareissuedtothefunctionalunitsoutoforderwhentheiroperandsareavailable.Aftercompletingexecution,instructionsareretiredinorder,freeingphysicalregistersthatarenolongerneeded.OurSMTarchitectureisastraightforwardextensiontothisconventionalsuperscalardesign.Wemadechangesonlywhennecessarytoenablesimultaneousmultithreading,andingeneral,structureswerenotreplicatedorresizedtosupportSMToramultithreadedworkload.Thus,nearlyallhardwareresourcesremaincompletelyavailableevenwhenthereisonlyasinglethreadinthesystem.Thechangesnecessarytosupportsimultaneousmultithreadingonthatarchitecturearemultipleprogramcountersandsomemechanismbywhichthefetchunitselectsoneeachcycle,aseparatereturnstackforeachthreadforpredictingsubroutinereturndestinations,perthreadinstructionretirement,instructionqueueflush,andtrapmechanisms,athreadidwitheachbranchtargetbufferentrytoavoidpredictingphantombranches,andalargerregisterfile,tosupportlogicalregistersforallthreadsplusadditionalregistersforregisterrenaming.Thesizeoftheregisterfileaffectsthepipelineweaddtwoextrastagesandtheschedulingofloaddependentinstructions,whichwediscusslaterinthissection.Noticeablyabsentfromthislistisamechanismtoenablesimultaneousmultithreadedschedulingofinstructionsontothefunctionalunits.Becauseanyapparentdependencesbetweeninstructionsfromdifferentthreadsareremovedbytheregisterrenamingphase,aconventionalinstructionqueueIQdesignedfordynamicschedulingcontainsallofthefunctionalitynecessaryforsimultaneousmultithreading.Theinstructionqueueissharedbyallthreadsandaninstructionfromanythreadinthequeuecanissuewhenitsoperandsareavailable.WefetchfromoneprogramcounterPCeachcycle.ThePCischosen,inroundrobinorder,fromamongthosethreadsnotalreadyexperiencinganIcachemiss.Thisschemeprovidessimultaneousmultithreadingatthepointofissue,butonlyfinegrainmultithreadingofthefetchunit.WewilllookinSection5atwaystoextendsimultaneousmultithreadingtothefetchunit.Wealsoinvestigatealternativethreadprioritymechanismsforfetching.Aprimaryimpactofmultithreadingonourarchitectureisonthesizeoftheregisterfile.Wehaveasingleregisterfile,asthreadspecificlogicalregistersaremappedontoacompletelysharedphysicalregisterfilebytheregisterrenaming.Tosupporteightthreads,weneedaminimumof832256physicalintegerregistersfora32registerinstructionsetarchitecture,plusmoretoenableregisterrenaming.Accesstosuchalargeregisterfilewillbeslow,almostcertainlyaffectingthecycletimeofthemachine.Toaccountforthesizeoftheregisterfile,wetaketwocyclestoreadregistersinsteadofone.Inthefirstcyclevaluesarereadintoabufferclosertothefunctionalunits.Theinstructionissenttoasimilarbufferatthesametime.Thenextcyclethedataissenttoafunctionalunitforexecution.Writestotheregisterfilearetreatedsimilarly,requiringanextraregisterwritestage.Figure2showsthepipelinemodifiedfortwophaseregisteraccess,comparedtothepipelineoftheoriginalsuperscalar.Thetwostageregisteraccesshasseveralramificationsonourarchitecture.First,itincreasesthepipelinedistancebetweenfetchandexec,increasingthebranchmispredictionpenaltyby1cycle.Second,ittakesanextracycletowritebackresults,requiringanextralevelofbypasslogic.Third,increasingthedistancebetweenQueueExecRegReadDecodeFetchRenameCommitmisfetchpenalty2cyclesregisterusage4cycleminimummispredictpenalty6cyclesamisfetchpenalty2cyclesQueueRegWriteExecRegReadRegReadDecodeFetchRenameCommitregisterusage6cycleminimummispredictpenalty7cyclesmisqueuepenalty4cyclesbFigure2ThepipelineofaaconventionalsuperscalarprocessorandbthatpipelinemodifiedforanSMTprocessor,alongwithsomeimplicationsofthosepipelines.queueandexecincreasestheperiodduringwhichwrongpathinstructionsremaininthepipelineafteramispredictionisdiscoveredthemisqueuepenaltyinFigure2.Wrongpathinstructionsarethoseinstructionsbroughtintotheprocessorasaresultofabranchmisprediction.Thoseinstructionsconsumeinstructionqueueslots,renamingregistersandpossiblyissueslots,allofwhich,onanSMTprocessor,couldbeusedbyotherthreads.Thispipelinedoesnotincreasetheinterinstructionlatencybetweenmostinstructions.Dependentsinglecyclelatencyinstructionscanstillbeissuedonconsecutivecycles,forexample,aslongasinterinstructionlatenciesarepredetermined.Thatisthecaseforallinstructionsbutloads.Sinceweareschedulinginstructionsacycleearlierrelativetotheexeccycle,loadhitlatencyincreasesbyonecycletotwocycles.Ratherthansufferthispenalty,wescheduleloaddependentinstructionsassuminga1cycledatalatency,butsquashthoseinstructionsinthecaseofanL1cachemissorabankconflict.Therearetwoperformancecoststothissolution,whichwecalloptimisticissue.Optimisticallyissuedinstructionsthatgetsquashedwasteissueslots,andoptimisticinstructionsmuststillbeheldintheIQanextracycleaftertheyareissued,untilitisknownthattheywontbesquashed.Thelastimplicationofthetwophaseregisteraccessisthattherearetwomorestagesbetweenrenameandcommit,thusincreasingtheminimumtimethataphysicalregisterisheldbyaninflightinstruction.Thisincreasesthepressureontherenamingregisterpool.Weassume,foreachmachinesize,enoughphysicalregisterstosupportallactivethreads,plus100moreregisterstoenablerenaming,bothfortheintegerfileandthefloatingpointfilei.e.,forthesinglethreadresults,wemodel132physicalintegerregisters,andforan8threadmachine,356.Weexpectthatinthe35yeartimeframe,theschemewehavedescribedwillremoveregisterfileaccessfromthecriticalpathfora4threadmachine,but8threadswillstillbeasignificantchallenge.Nonetheless,extendingourresultstoan8threadmachineallowsustoseetrendsbeyondthe4threadnumbersandanticipatesothersolutionstothisproblem.Thenumberofregistersavailableforrenamingdeterminesthenumberofinstructionsthatcanbeintheprocessorbetweentherenamestageandthecommitstage.Thisarchitectureallowsustoaddressseveralconcernsaboutsimultaneousmultithreadedprocessordesign.Inparticular,thispapershowsthatInstructionschedulingisnomorecomplexthanonadynamicallyscheduledsuperscalar.Registerfiledatapathsarenomorecomplexthaninthesuperscalar,andtheperformanceimplicationsoftheregisterfileanditsextendedpipelinearesmall.Therequiredinstructionfetchthroughputisattainable,evenwithoutanyincreaseinfetchbandwidth.UnmodifiedforanSMTworkloadcacheandbranchpredictionstructuresdonotthrashonthatworkload.Evenaggressivesuperscalartechnologies,suchasdynamicschedulingandspeculativeexecution,arenotsufficienttotakefulladvantageofawideissueprocessorwithoutsimultaneousmultithreading.Wehaveonlypresentedanoutlineofthehardwarearchitecturetothispointthenextsectionprovidesmoredetail.2.1HardwareDetailsTheprocessorcontains3floatingpointfunctionalunitsand6integerunitsfourofthesixintegerunitsalsoexecuteloadsandstores.Thepeakissuebandwidthoutofthetwoinstructionqueuesisthereforeninehowever,thethroughputofthemachineisboundedbythepeakfetchanddecodebandwidths,whichareeightinstructionspercycle.Weassumethatallfunctionalunitsarecompletelypipelined.Table1showstheinstructionlatencies,whicharederivedfromtheAlpha211648.Weassumea32entryintegerinstructionqueuewhichhandlesintegerinstructionsandallload/storeoperationsanda32entryfloatingpointqueue,notsignificantlylargerthantheHPPA800021,whichhastwo28entryqueues.ThecachesTable2aremultiportedbyinterleavingthemintobanks,similartothedesignofSohiandFranklin26.WemodellockupfreecachesandTLBs.TLBmissesrequiretwofullmemoryaccessesandnoexecutionresources.Wemodelthememorysubsystemingreatdetail,simulatingbandwidthlimitationsandaccessconflictsatmultiplelevelsofthehierarchy,toaddresstheconcernInstructionClassLatencyintegermultiply8,16conditionalmove2compare0allotherinteger1FPdivide17,30allotherFP4loadcachehit1Table1SimulatedinstructionlatenciesICacheDCacheL2L3Size32KB32KB256KB2MBAssociativityDMDM4wayDMLineSize64646464Banks8881Transfertime1cycle114Accesses/cyclevar14411/4Cachefilltime2cycles228Latencytonextlevel661262Table2Detailsofthecachehierarchythatmemorythroughputcouldbealimitingconditionforsimultaneousmultithreading.Eachcycle,onethreadisgivencontrolofthefetchunit,chosenfromamongthosenotstalledforaninstructioncacheIcachemiss.Ifwefetchfrommultiplethreads,weneverattempttofetchfromthreadsthatconflictonanIcachebankwitheachother,althoughtheymayconflictwithotherIcacheactivitycachefills.BranchpredictionisprovidedbyadecoupledbranchtargetbufferBTBandpatternhistorytablePHTscheme4.Weusea256entryBTB,organizedasfourwaysetassociative.The2Kx2bitPHTisaccessedbytheXORofthelowerbitsoftheaddressandtheglobalhistoryregister18,30.Returndestinationsarepredictedwitha12entryreturnstackpercontext.Weassumeanefficient,butnotperfect,implementationofdynamicmemorydisambiguation.Thisisemulatedbyusingonlypartoftheaddress10bitstodisambiguatememoryreferences,sothatitisoccasionallyoverconservative.3MethodologyThemethodologyinthispapercloselyfollowsthesimulationandmeasurementmethodologyof27.Oursimulatorusesemulationbased,instructionlevelsimulation,andborrowssignificantlyfromMIPSI22,aMIPSbasedsimulator.ThesimulatorexecutesunmodifiedAlphaobjectcodeandmodelstheexecutionpipelines,memoryhierarchy,TLBs,andthebranchpredictionlogicoftheprocessordescribedinSection2.InanSMTprocessorabranchmispredictionintroduceswrongpathinstructionsthatinteractwithinstructionsfromotherthreads.Tomodelthisbehavior,wefetchdownwrongpaths,introducethoseinstructionsintotheinstructionqueues,tracktheirdependences,andissuethem.Weeventuallysquashallwrongpathinstructionsacycleafterabranchmispredictionisdiscoveredintheexecstage.Ourthroughputresultsonlycountusefulinstructions.OurworkloadcomesprimarilyfromtheSPEC92benchmarksuite7.Weusefivefloatingpointprogramsalvinn,doduc,fpppp,ora,andtomcatvandtwointegerprogramsespressoandxlispfromthatsuite,andthedocumenttypesettingprogramTeX.Weassignadistinctprogramtoeachthreadintheprocessorthemultiprogrammedworkloadstressesourarchitecturemorethanaparallelprogrambypresentingthreadswithwidelyvaryingprogramcharacteristicsandwithnooverlapofcache,TLBorbranchpredictionusage.Toeliminatetheeffectsofbenchmarkdifferences,asingledatapointiscomposedof8runs,eachT300millioninstructionsinlength,whereTisthenumberofthreads.Eachofthe8runsusesadifferentcombinationofthebenchmarks.WecompileeachprogramwiththeMultiflowtraceschedulingcompiler17,modifiedtoproduceAlphacode.Incontrastto27,weturnofftraceschedulinginthecompilerforthisstudy,fortworeasons.Inourmeasurements,wewanttodifferentiatebetweenusefulanduselessspeculativeinstructions,whichiseasywithhardwarespeculation,butnotpossibleforsoftwarespeculationwithoursystem.Also,softwarespeculationisnotasbeneficialonanarchitecturewhichfeatureshardwarespeculation,andinsomecasesisharmful.However,theMultiflowcompilerisstillagoodchoiceforourcompilationengine,becauseofthehighqualityoftheloopunrolling,instructionschedulingandalignment,andotheroptimizations,aswellastheeasewithwhichthemachinemodelcanbechanged.Thebenchmarksarecompiledtooptimizesinglethreadperformanceonthebasehardware.4PerformanceoftheBaseHardwareDesignInthissectionweexaminetheperformanceofthebasearchitectureandidentifyopportunitiesforimprovement.Figure3showsthatwithonlyasinglethreadrunningonourSMTarchitecture,thethroughputislessthan2belowasuperscalarwithoutSMTsupport.ThedropinthroughputisduetothelongerpipelinedescribedinSection2usedbytheSMTprocessor.Itspeakthroughputis84higherthanthesuperscalar.Thisgainisachievedwithvirtuallynotuningofthebasearchitectureforsimultaneousmultithreading.Thisdesigncombineslowsinglethreadimpactwithhighspeedupforevenafewthreads,enablingsimultaneousmultithreadingtoreapbenefitseveninanenvironmentwheremultipleprocessesarerunningonlyasmallfractionofthetime.Wealsonote,however,thatthethroughputpeaksbefore8threads,andtheprocessorutilization,atlessthan50ofthe8issueprocessor,iswellshortofthepotentialshownin27.Wemakeseveralconclusionsaboutthepotentialbottlenecksofthissystemasweapproach8threads,aidedbyFigure3andTable3.Issuebandwidthisclearlynotabottleneck,asthethroughputrepresentsafractionofavailableissuebandwidth,andourdatashowsthatnofunctionalunittypeisbeingoverloaded.Weappeartohaveenoughphysicalregisters.Thecachesandbranchpredictionlogicarebeingstressedmoreheavilyat8threads,butweexpectthelatencyhidingpotentialoftheadditionalthreadstomakeupforthosedrops.Theculpritappearstobeoneormoreofthefollowingthreeproblems1IQsizeIQfullconditionsarecommon,12to21ofcyclestotalforthetwoqueues2fetchthroughputeveninthosecycleswherewedontexperienceanIQfullcondition,ourdatashowsthatwearesustainingonly4.2usefulinstructionsfetchedpercycle4.5includingwrongpathand3lackofparallelismalthoughthequeuesarereasonablyfull,wefindfewer12345ThroughputInstructionsPerCycle2468NumberofThreadsUnmodifiedSuperscalarFigure3Instructionthroughputforthebasehardwarearchitecture.thanfouroutof,onaverage,27instructionspercycletoissue.Weexpecteightthreadstoprovidemoreparallelism,soperhapswehavethewronginstructionsintheinstructionqueues.Therestofthispaperfocusesonimprovingthisbasearchitecture.ThenextsectionaddresseseachoftheproblemsidentifiedherewithdifferentfetchpoliciesandIQconfigurations.Section6examineswaystopreventissuewaste,andSection7reexaminestheimprovedarchitecturefornewbottlenecks,identifyingdirectionsforfurtherimprovement.5TheFetchUnitInSearchofUsefulInstructionsInthissectionweexaminewaystoimprovefetchthroughputwithoutincreasingthefetchbandwidth.OurSMTarchitecturesharesasinglefetchunitamongeightthreads.Wecanexploitthehighlevelofcompetitionforthefetchunitintwowaysnotpossiblewithsinglethreadedprocessors1thefetchunitcanfetchfrommultiplethreadsatonce,increasingourutilizationofthefetchbandwidth,and2itcanbeselectiveaboutwhichthreadorthreadstofetchfrom.Becausenotallpathsprovideequallyusefulinstructionsinaparticularcycle,anSMTprocessorcanbenefitbyfetchingfromthethreadsthatwillprovidethebestinstructions.Weexamineavarietyoffetcharchitecturesandfetchpoliciesthatexploitthoseadvantages.Specifically,theyattempttoimprovefetchthroughputbyaddressingthreefactorsfetchefficiency,bypartitioningthefetchunitamongthreadsSection5.1fetcheffectiveness,byimprovingthequalityoftheinstructionsfetchedSection5.2andfetchavailability,byeliminatingconditionsthatblockthefetchunitSection5.3.5.1PartitioningtheFetchUnitRecallthatourbaselinearchitecturefetchesuptoeightinstructionsfromonethreadeachcycle.Thefrequencyofbranchesintypicalinstructionstreamsandthemisalignmentofbranchdestinationsmakeitdifficulttofilltheentirefetchbandwidthfromonethread,NumberofThreadsMetric148outofregistersofcycles373Icachemissrate2.57.814.1missesperthousandinstructions61729Dcachemissrate3.16.511.3missesperthousandinstructions122543L2cachemissrate17.615.012.5missesperthousandinstructions359L3cachemissrate55.133.645.4missesperthousandinstructions134branchmispredictionrate5.07.49.1jumpmispredictionrate2.26.412.9integerIQfullofcycles7109fpIQfullofcycles1493avgcombinedqueuepopulation252527wrongpathinstructionsfetched2477wrongpathinstructionsissued943Table3Theresultofincreasedmultithreadingonsomelowlevelmetricsforthebasearchitecture.evenforsmallerblocksizes5,24.Inthisprocessor,wecanspreadtheburdenoffillingthefetchbandwidthamongmultiplethreads.Forexample,theprobabilityoffindingfourinstructionsfromeachoftwothreadsshouldbegreaterthanthatoffindingeightfromonethread.Inthissection,weattempttoreducefetchblockfragmentationourtermforthevariousfactorsthatpreventusfromfetchingthemaximumnumberofinstructionsbyfetchingfrommultiplethreadseachcycle,whilekeepingthemaximumfetchbandwidthbutnotnecessarilytheIcachebandwidthconstant.Weevaluateseveralfetchingschemes,whicharelabeledalg.num1.num2,wherealgisthefetchselectionmethodinthissectionthreadsarealwaysselectedusingaroundrobinpriorityscheme,num1isthenumberofthreadsthatcanfetchin1cycle,andnum2isthemaximumnumberofinstructionsfetchedperthreadin1cycle.Themaximumnumberoftotalinstructionsfetchedisalwayslimitedtoeight.Foreachofthefetchpartitioningpolicies,thecacheisalways32kilobytesorganizedinto8databanksagivenbankcandojustoneaccesspercycle.RR.1.8ThisisthebaselineschemefromSection4.Eachcycleonethreadfetchesasmanyaseightinstructions.ThethreadisdeterminedbyaroundrobinpriorityschemefromamongthosenotcurrentlysufferinganIcachemiss.InthisschemetheIcacheisindistinguishablefromthatonasinglethreadedsuperscalar.Eachcachebankhasitsownaddressdecoderandoutputdriverseachcycle,onlyoneofthebanksdrivesthecacheoutputbus,whichis8instructions32byteswide.RR.2.4,RR.4.2Theseschemesfetchfewerinstructionsperthreadfrommorethreadsfoureachfromtwothreads,ortwoeachfromfourthreads.Ifwetrytopartitionthefetchbandwidthtoofinely,however,wemaysufferthreadshortage,wherefewerthreadsareavailablethanarerequiredtofillthefetchbandwidth.Fortheseschemes,multiplecacheaddressesaredriventoeachcachedatabank,eachofwhichnowhasamultiplexerbeforeitsaddressdecoder,toselectonecacheindexpercycle.Sincethecachebanksaresingleported,bankconflictlogicisneededtoensurethateachaddresstargetsaseparatebank.RR.2.4hastwocacheoutput012345ThroughputIPC12468NumberofThreadsRR.2.8RR.4.2RR.2.4RR.1.8Figure4Instructionthroughputforthedifferentinstructioncacheinterfaceswithroundrobininstructionscheduling.buses,eachfourinstructionswide,whileRR.4.2hasfouroutputbuses,eachtwoinstructionswide.Forbothschemes,thetotalwidthoftheoutputbusesis8instructionsidenticaltothatinRR.1.8,butadditionalcircuitryisneededsoeachbankiscapableofdrivinganyofthemultiplenowsmalleroutputbuses,andisabletoselectoneornonetodriveinagivencycle.Also,thecachetagstorelogicmustbereplicatedormultipleportedinordertocalculatehit/missforeachaddresslookeduppercycle.Thus,thehardwareadditionsaretheaddressmuxmultipleaddressbusesselectionlogicontheoutputdriversthebankconflictlogicandmultiplehit/misscalculations.ThechangesrequiredforRR.2.4wouldhaveanegligibleimpactonareaandcacheaccesstime.ThechangesforRR.4.2aremoreextensive,andwouldbemoredifficulttodowithoutaffectingareaoraccesstime.Theseschemesactuallyreducethelatencyinthedecodeandrenamestages,asthemaximumlengthofdependencychainsamongfetchedinstructionsisreducedbyafactorof2and4,respectively.RR.2.8Thisschemeattacksfetchblockfragmentationwithoutsufferingfromthreadshortagebyfetchingeightinstructionsmoreflexiblyfromtwothreads.Thiscanbeimplementedbyreadinganeightinstructionblockforeachthread16instructionstotal,thencombiningthem.Wetakeasmanyinstructionsaspossiblefromthefirstthread,thenfillinwithinstructionsfromthesecond,uptoeighttotal.LikeRR.2.4,twoaddressesmustberoutedtoeachcachebank,thenmultiplexedbeforethedecoderbankconflictlogicandtwohit/misscalculationspercyclearenecessaryandeachbankdrivesoneofthetwooutputbuses.Now,however,eachoutputbusiseightinstructionswide,whichdoublesthebandwidthoutofthecachecomparedtoanyofthepreviousschemes.Thiscouldbedonewithoutgreatlyaffectingareaorcycletime,astheadditionalbussingcouldprobablybedonewithoutexpandingthecachelayout.Inaddition,logictoselectandcombinetheinstructionsisnecessary,whichmightormightnotrequireanadditionalpipestage.Oursimulationsassumeitdoesnot.Figure4showsthatwecangethighermaximumthroughputbysplittingthefetchovermultiplethreads.Forexample,theRR.2.4schemeoutperformsRR.1.8at8threadsby9.However,bettermaximumthroughputcomesatthecostofa12singlethreadpenaltyinfact,RR.2.4doesnotsurpassRR.1.8until4threads.TheRR.4.2schemeneeds6threadstosurpassRR.1.8andnevercatchesthe2threadschemes,sufferingfromthreadshortage.TheRR.2.8schemeprovidesthebestofbothworldsfewthreadsperformancelikeRR.1.8andmanythreadsperformancelikeRR.2.4.However,thehigherthroughputofthisschemeputsmorepressureontheinstructionqueues,causingIQfullconditionsatarateof18integerand8fpwith8threads.WiththeRR.2.8schemewehaveimprovedthemaximumthroughputby10withoutcompromisingsinglethreadperformance.Thiswasachievedbyacombinationof1partitioningthefetchbandwidthovermultiplethreads,and2makingthatpartitionflexible.Thisisthesameapproachalthoughinamorelimitedfashionherethatsimultaneousmultithreadingusestoimprovethethroughputofthefunctionalunits27.5.2ExploitingThreadChoiceintheFetchUnitTheefficiencyoftheentireprocessorisaffectedbythequalityofinstructionsfetched.Amultithreadedprocessorhasauniqueabilitytocontrolthatfactor.Inthissection,weexaminefetchingpoliciesaimedatidentifyingthebestthreadorthreadsavailabletofetcheachcycle.Twofactorsmakeonethreadlessdesirablethananother.Thefirstistheprobabilitythatathreadisfollowingawrongpathasaresultofanearlierbranchmisprediction.Wrongpathinstructionsconsumenotonlyfetchbandwidth,butalsoregisters,IQspace,andpossiblyissuebandwidth.Thesecondfactoristhelengthoftimethefetchedinstructionswillbeinthequeuebeforebecomingissuable.Wemaximizethethroughputofaqueueofboundedsizebyfeedingitinstructionsthatwillspendtheleasttimeinthequeue.Ifwefetchtoomanyinstructionsthatblockforalongtime,weeventuallyfilltheIQwithunissuableinstructions,aconditionwecallIQclog.Thisrestrictsbothfetchandissuethroughput,causingthefetchunittogoidleandpreventingissuableinstructionsfromgettingintotheIQ.Bothofthesefactorswrongpathprobabilityandexpectedqueuetimeimproveovertime,soathreadbecomesmoredesirableaswedelayfetchingit.Wedefineseveralfetchpolicies,eachofwhichattemptstoimproveontheroundrobinprioritypolicyusingfeedbackfromotherpartsoftheprocessor.Thefirstattackswrongpathfetching,theothersattackIQclog.TheyareBRCOUNTHereweattempttogivehighestprioritytothosethreadsthatareleastlikelytobeonawrongpath.Wedothisbycountingbranchinstructionsthatareinthedecodestage,therenamestage,andtheinstructionqueues,favoringthosewiththefewestunresolvedbranches.MISSCOUNTThispolicydetectsanimportantcauseofIQclog.AlongmemorylatencycancausedependentinstructionstobackupintheIQwaitingfortheloadtocomplete,eventuallyfillingthequeuewithblockedinstructionsfromonethread.ThispolicypreventsthatbygivingprioritytothosethreadsthathavethefewestoutstandingDcachemisses.ICOUNTThisisamoregeneralsolutiontoIQclog.Herepriorityisgiventothreadswiththefewestinstructionsindecode,rename,andtheinstructionqueues.Thisachievesthreepurposes1itpreventsanyonethreadfromfillingtheIQ,2itgiveshighest01234524680123452468ThroughputIPCNumberofThreadsIQPOSN.1.8ICOUNT.1.8MISSCOUNT.1.8BRCOUNT.1.8RR.1.8IQPOSN.2.8ICOUNT.2.8MISSCOUNT.2.8BRCOUNT.2.8RR.2.8NumberofThreadsFigure5Instructionthroughputforfetchingbasedonseveralpriorityheuristics,allcomparedtothebaselineroundrobinscheme.Theresultsfor1threadarethesameforallschemes,andthusnotshown.prioritytothreadsthataremovinginstructionsthroughtheIQmostefficiently,and3itprovidesamoreevenmixofinstructionsfromtheavailablethreads,maximizingtheparallelisminthequeues.IfcachemissesarethedominantcauseofIQclog,MISSCOUNTmayperformbetter,sinceitgetscachemissfeedbacktothefetchunitmorequickly.Ifthecausesaremorevaried,ICOUNTshouldperformbetter.IQPOSNLikeICOUNT,IQPOSNstrivestominimizeIQclogandbiastowardefficientthreads.Itgiveslowestprioritytothosethreadswithinstructionsclosesttotheheadofeithertheintegerorfloatingpointinstructionqueuestheoldestinstructionisattheheadofthequeue.ThreadswiththeoldestinstructionswillbemostpronetoIQclog,andthosemakingthebestprogresswillhaveinstructionsfarthestfromtheheadofthequeue.Thispolicydoesnotrequireacounterforeachthread,asdothepreviousthreepolicies.Likeanycontrolsystem,theefficiencyofthesemechanismsislimitedbythefeedbacklatencyresulting,inthiscase,fromfeedingdatafromlaterpipelinestagesbacktothefetchstage.Forexample,bythetimeinstructionsenterthequeuestageortheexecstage,theinformationusedtofetchthemisthreeoratleastsixcyclesold,respectively.Boththebranchcountingandthemisscountingpoliciestendtoproducefrequentties.Inthosecases,thetiebreakerisroundrobinpriority.Figure5showsthatallofthefetchheuristicsprovidespeedupoverroundrobin.Branchcountingandcachemisscountingprovidemoderatespeedups,butonlywhentheprocessorissaturatedwithmanythreads.Instructioncounting,incontrast,producesmoresignificantimprovementsregardlessofthenumberofthreads.IQPOSNprovidessimilarresultstoICOUNT,beingwithin4atalltimes,butneverexceedingit.Thebranchcountingheuristicdoeseverythingweaskofit.Itreduceswrongpathinstructions,from8.2offetchedinstructionsto3.6,andfrom3.6ofissuedinstructionsto0.8RR.1.8vs.BRCOUNT.1.8witheightthreads.Anditimprovesthroughputbyasmuchas8.Itsweaknessisthatthewrongpathproblemitsolvesisnotlargeonthisprocessor,whichhasalreadyattackedtheproblemwithsimultaneousmultithreading.EvenwiththeRRscheme,simultaneousmultithreadingreducesfetchedwrongpathinstructionsfrom16withonethreadto8with8threads.Cachemisscountingalsoachievesthroughputgainsashighas8overRR,butingeneralthegainsaremuchlower.ItisnotparticularlyeffectiveatreducingIQclog,aswegetIQfullconditions12ofthetimeontheintegerqueueand14onthefloatingpointqueueforMISSCOUNT.2.8with8threads.TheseresultsindicatethatIQclogismorethansimplytheresultoflongmemorylatencies.18ThreadsMetricThreadRRICOUNTintegerIQfullofcycles7186fpIQfullofcycles1481avgqueuepopulation253830outofregistersofcycles385Table4Somelowlevelmetricsfortheroundrobinandinstructioncountingprioritypoliciesandthe2.8fetchpartitioningscheme.Theinstructioncountingheuristicprovidesinstructionthroughputashighas5.3instructionspercycle,athroughputgainovertheunmodifiedsuperscalarof2.5.Itoutperformsthebestroundrobinresultby23.Instructioncountingisaseffectiveat2and4threadsinbenefitoverroundrobinasitisat8threads.ItnearlyeliminatesIQclogseeIQfullresultsinTable4andgreatlyimprovesthemixofinstructionsinthequeueswearefindingmoreissuableinstructionsdespitehavingfewerinstructionsinthetwoqueues.Intelligentfetchingwiththisheuristicisofgreaterbenefitthanpartitioningthefetchunit,astheICOUNT.1.8schemeconsistently012345ThroughputIPC12468NumberofThreads01234512468NumberofThreadsITAG,ICOUNT.1.8BIGQ,ICOUNT.1.8ICOUNT.1.8ITAG,ICOUNT.2.8BIGQ,ICOUNT.2.8ICOUNT.2.8Figure6Instructionthroughputforthe64entryqueueandearlyIcachetaglookup,whencoupledwiththeICOUNTfetchpolicy.outperformsRR.2.8.Table4pointstoasurprisingresult.AsaresultofsimultaneousmultithreadedinstructionissueandtheICOUNTfetchheuristics,weactuallyputlesspressureonthesameinstructionqueuewitheightthreadsthanwithone,havingsharplyreducedIQfullconditions.Italsoreducespressureontheregisterfilevs.RRbykeepingfewerinstructionsinthequeue.BRCOUNTandICOUNTeachsolvedifferentproblems,andperhapsthebestperformancecouldbeachievedfromaweightedcombinationofthemhowever,thecomplexityofthefeedbackmechanismincreasesasaresult.Byitself,instructioncountingclearlyprovidesthebestgains.Givenourmeasurementmethodology,itispossiblethatthethroughputincreasescouldbeoverstatedifafetchpolicysimplyfavorsthosethreadswiththemostinherentinstructionlevelparallelismorthebestcachebehavior,thusachievingimprovementsthatwouldnotbeseeninpractice.However,withtheICOUNT.2.8policy,theoppositehappens.OurresultsshowthatthisschemefavorsthreadswithlowersinglethreadILP,thusitsresultsincludeahighersampleofinstructionsfromtheslowthreadsthaneitherthesuperscalarresultsortheRRresults.Ifanything,then,theICOUNT.2.8improvementsareunderstated.Insummary,wehaveidentifiedasimpleheuristicthatisverysuccessfulatidentifyingthebestthreadstofetch.Instructioncountingdynamicallybiasestowardthreadsthatwilluseprocessorresourcesmostefficiently,therebyimprovingprocessorthroughputaswellasrelievingpressureonscarceprocessorresourcestheinstructionqueuesandtheregisters.5.3UnblockingtheFetchUnitByfetchingfrommultiplethreadsandusingintelligentfetchheuristics,wehavesignificantlyincreasedfetchthroughputandefficiency.Themoreefficientlyweareusingthefetchunit,themorewestandtolosewhenitbecomesblocked.Inthissectionweexamineschemesthatpreventtwoconditionsthatcausethefetchunittomissfetchopportunities,specificallyIQfullconditionsandIcachemisses.ThetwoschemesareBIGQTheprimaryrestrictiononIQsizeisnotthechiparea,butthetimetosearchitthereforewecanincreaseitssizeaslongaswedontincreasethesearchspace.Inthisscheme,wedoublethesizesoftheinstructionqueues,butonlysearchthefirst32entriesforissue.ThisschemeallowsthequeuestobufferinstructionsfromthefetchunitwhentheIQoverflows.ITAGWhenathreadisselectedforfetchingbutexperiencesacachemiss,welosetheopportunitytofetchthatcycle.IfwedotheIcachetaglookupsacycleearly,wecanfetcharoundcachemissescachemissaccessesarestillstartedimmediately,butonlynonmissingthreadsarechosenforfetch.Becauseweneedtohavethefetchaddressacycleearly,weessentiallyaddastagetothefrontofthepipeline,increasingthemisfetchandmispredictpenalties.ThisschemerequiresoneormoreadditionalportsontheIcachetags,sothatpotentialreplacementthreadscanbelookedupatthesametime.AlthoughtheBIGQschemeimprovestheperformanceoftheroundrobinschemenotshown,1.52acrosstheboard,Figure6showsthatthebiggerqueuesaddnosignificantimprovementtotheICOUNTpolicy.Infact,itisactuallydetrimentalforseveralthreadconfigurations.Thisisbecausethebufferingeffectofthebigqueueschemebringsinstructionsintotheissuablepartoftheinstructionqueuethatmayhavebeenfetchedmanycyclesearlier,usingpriorityinformationthatisnowoutofdate.Theresultsindicatethatusinguptodatepriorityinformationismoreimportantthanbuffering.Theseresultsshowthatintelligentfetchheuristicshavemadetheextrainstructionqueuehardwareunnecessary.ThebiggerqueuebyitselfisactuallylesseffectiveatreducingIQclogthantheICOUNTscheme.With8threads,thebiggerqueuesaloneBIGQ,RR.2.8reduceIQfullconditionsto11integerand0fp,whileinstructioncountingaloneICOUNT.2.8reducesthemto6and1.CombiningBIGQandICOUNTdropsthemto3and0.EarlyIcachetaglookupbooststhroughputasmuchas8IssueNumberofThreadsUselessInstructionsMethod12468wrongpathoptimisticOLDEST2.103.304.625.095.2943OPTLAST2.073.304.595.095.2942SPECLAST2.103.314.595.095.2943BRANCHFIRST2.073.294.585.085.2846Table5Instructionthroughputinstructionspercyclefortheissuepriorityschemes,andthepercentageofuselessinstructionsissuedwhenrunningwith8threads.overICOUNT.ItismosteffectivewhenfetchingonethreadICOUNT.1.8,wherethecostofalostfetchslotisgreater.However,itimprovestheICOUNT.2.8resultsnomorethan2,astheflexibilityofthe2.8schemealreadyhidessomeofthelostfetchbandwidth.Inaddition,ITAGlowersthroughputwithfewthreads,wherecompetitionforthefetchslotsislowandthecostofthelongermispredictionpenaltyishighest.Usingacombinationofpartitioningthefetchunit,intelligentfetching,andearlyIcachetaglookups,wehaveraisedthepeakperformanceofthebaseSMTarchitectureby375.4instructionspercyclevs.3.9.Ourmaximumspeeduprelativetoaconventionalsuperscalarhasgoneupproportionately,from1.8to2.5timesthethroughput.Thatgaincomesfromexploitingcharacteristicsofasimultaneousmultithreadingprocessornotavailabletoasinglethreadedmachine.Highfetchthroughputmakesissuebandwidthamorecriticalresource.Wefocusonthisfactorinthenextsection.6ChoosingInstructionsForIssueMuchasthefetchunitinasimultaneousmultithreadingprocessorcantakeadvantageoftheabilitytochoosewhichthreadstofetch,theissuelogichastheabilitytochooseinstructionsforissue.Adynamicallyscheduledsinglethreadedprocessormayhaveenoughreadyinstructionstobeabletochoosebetweenthem,butwithanSMTprocessortheoptionsaremorediverse.Also,becausewehavehigherthroughputthanasinglethreadedsuperscalarprocessor,theissuebandwidthispotentiallyamorecriticalresource,soavoidingissueslotwastemaybemorebeneficial.Inthissection,weexamineissueprioritypoliciesaimedatpreventingissuewaste.Issueslotwastecomesfromtwosources,wrongpathinstructionsresultingfrommispredictedbranchesandoptimisticallyissuedinstructions.RecallfromSection2thatweoptimisticallyissueloaddependentinstructionsacyclebeforewehaveDcachehitinformation.Inthecaseofacachemissorbankconflict,wehavetosquashtheoptimisticallyissuedinstruction,wastingthatissueslot.Inasinglethreadedprocessor,choosinginstructionsleastlikelytobeonawrongpathisalwaysachievedbyselectingtheoldestinstructionsthosedeepestintotheinstructionqueue.Inasimultaneousmultithreadingprocessor,thepositionofaninstructioninthequeueisnolongerthebestindicatorofthelevelofspeculationofthatinstruction,asrightpathinstructionsareintermingledinthequeueswithwrongpath.ThepoliciesweexamineareOLDESTFIRST,ourdefaultissuealgorithmuptothispoint,OPTLASTandSPECLAST,whichonlyissueoptimisticandspeculativeinstructionsmorespecifically,anyinstructionbehindabranchfromthesamethreadintheinstructionqueue,respectively,afterallothershavebeenissued,andBRANCHFIRST,whichissuesbranchesasearlyaspossibleinordertoidentifymispredictedbranchesquickly.ThedefaultfetchalgorithmforeachoftheseschemesisICOUNT.2.8.ThestrongmessageofTable5isthatissuebandwidthisnotyetabottleneck.Evenwhenitdoesbecomeacriticalresource,theamountofimprovementwegetfromnotwastingitislikelytobeboundedbythepercentageofourissuebandwidthgiventouselessinstructions,whichcurrentlystandsat74wrongpathinstructions,3squashedoptimisticinstructions.Becausewedontoftenhavemoreissuableinstructionsthanfunctionalunits,wearentabletoanddontneedtoreducethatsignificantly.TheSPECLASTschemeisunabletoreducethenumberofuselessinstructionsatall,whileOPTLASTbringsitdownto6.BRANCHFIRSTactuallyincreasesitto10,asbranchinstructionsareoftenloaddependenttherefore,issuingthemasearlyaspossibleoftenmeansissuingthemoptimistically.AcombinedschemeOPTLASTandBRANCHFIRSTmightreducethatsideeffect,butisunlikelytohavemucheffectonthroughput.SinceeachofthealternateschemespotentiallyintroducesmultiplepassestotheIQsearch,itisconvenientthatthesimplestmechanismstillworkswell.7WhereAretheBottlenecksNowWehaveshownthatproposedchangestotheinstructionqueuesandtheissuelogicareunnecessarytoachievethebestperformancewiththisarchitecture,butthatsignificantgainscanbeproducedbymoderatechangestotheinstructionfetchmechanisms.HereweexaminethatarchitecturemorecloselyusingICOUNT.2.8asournewbaseline,identifyinglikelydirectionsforfurtherimprovements.Inthissectionwepresentresultsofexperimentsintendedtoidentifybottlenecksinthenewdesign.Forcomponentsthatarepotentialbottlenecks,wequantifythesizeofthebottleneckbymeasuringtheimpactofrelievingit.Forsomeofthecomponentsthatarenotbottlenecks,weexaminewhetheritispossibletosimplifythosecomponentswithoutcreatingabottleneck.Becauseweareidentifyingbottlenecksratherthanproposingarchitectures,wearenolongerboundbyimplementationpracticalitiesintheseexperiments.TheIssueBandwidthTheexperimentsinSection6indicatethatissuebandwidthisnotabottleneck.Infact,wefoundthatevenaninfinitenumberoffunctionalunitsincreasesthroughputbyonly0.5at8threads.InstructionQueueSizeResultsinSection5would,similarly,seemtoimplythatthesizeoftheinstructionqueueswasnotabottleneck,particularlywithinstructioncountinghowever,theschemesweexaminedarenotthesameaslarger,searchablequeues,whichwouldalsoincreaseavailableparallelism.Nonetheless,theexper

注意事项

本文(28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf)为本站会员(baixue100)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5