28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf

上传人：*** IP属地：浙江上传时间：2014-01-05 格式：PDF 页数：12 大小：80.04KB 积分：6 举报 版权申诉

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf_第2页

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf_第3页

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf_第4页

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf_第5页

已阅读5页，还剩7页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

ExploitingChoice:InstructionFetchandIssueonanImplementableSimultaneousMultithreadingProcessorDeanM.Tullsen,SusanJ.Eggers,JoelS.Emery,HenryM.Levy,JackL.Lo,andRebeccaL.StammyDeptofComputerScienceandEngineeringyDigitalEquipmentCorporationUniversityofWashingtonHLO2-3/J3Box35235077ReedRoadSeattle,WA98195-2350Hudson,MA01749AbstractSimultaneousmultithreadingisatechniquethatpermitsmultipleindependentthreadstoissuemultipleinstructionseachcycle.Inpreviousworkwedemonstratedtheperformancepotentialofsi-multaneousmultithreading,basedonasomewhatidealizedmodel.Inthispaperweshowthatthethroughputgainsfromsimultaneousmultithreadingcanbeachievedwithoutextensivechangestoacon-ventionalwide-issuesuperscalar,eitherinhardwarestructuresorsizes.Wepresentanarchitectureforsimultaneousmultithreadingthatachievesthreegoals:(1)itminimizesthearchitecturalimpactontheconventionalsuperscalardesign,(2)ithasminimalperfor-manceimpactonasinglethreadexecutingalone,and(3)itachievessignificantthroughputgainswhenrunningmultiplethreads.Oursimultaneousmultithreadingarchitectureachievesathroughputof5.4instructionspercycle,a2.5-foldimprovementoveranunmod-ifiedsuperscalarwithsimilarhardwareresources.Thisspeedupisenhancedbyanadvantageofmultithreadingpreviouslyunexploitedinotherarchitectures:theabilitytofavorforfetchandissuethosethreadsmostefficientlyusingtheprocessoreachcycle,therebyprovidingthe“best”instructionstotheprocessor.1IntroductionSimultaneousmultithreading(SMT)isatechniquethatpermitsmul-tipleindependentthreadstoissuemultipleinstructionseachcycletoasuperscalarprocessorsfunctionalunits.SMTcombinesthemultiple-instruction-issuefeaturesofmodernsuperscalarswiththelatency-hidingabilityofmultithreadedarchitectures.Unlikecon-ventionalmultithreadedarchitectures1,2,15,23,whichdependonfastcontextswitchingtoshareprocessorexecutionresources,allhardwarecontextsinanSMTprocessorareactivesimultaneously,competingeachcycleforallavailableresources.Thisdynamicsharingofthefunctionalunitsallowssimultaneousmultithread-ingtosubstantiallyincreasethroughput,attackingthetwomajorimpedimentstoprocessorutilizationlonglatenciesandlimitedper-threadparallelism.Tullsen,etal.,27showedthepotentialofProceedingsofthe23rdAnnualInternationalSymposiumonComputerArchitecture,Philadelphia,PA,May,1996anSMTprocessortoachievesignificantlyhigherthroughputthaneitherawidesuperscalaroramultithreadedprocessor.Thatpaperalsodemonstratedtheadvantagesofsimultaneousmultithreadingovermultipleprocessorsonasinglechip,duetoSMTsabilitytodynamicallyassignexecutionresourceswhereneededeachcycle.ThoseresultsshowedSMTspotentialbasedonasomewhatide-alizedmodel.Thispaperextendsthatworkinfoursignificantways.First,wedemonstratethatthethroughputgainsofsimultaneousmul-tithreadingarepossiblewithoutextensivechangestoaconventional,wide-issuesuperscalarprocessor.Weproposeanarchitecturethatismorecomprehensive,realistic,andheavilyleveragedoffexistingsuperscalartechnology.Oursimulationsshowthataminimalim-plementationofsimultaneousmultithreadingachievesthroughput1.8timesthatoftheunmodifiedsuperscalar；smalltuningofthisarchitectureincreasesthatgainto2.5(reachingthroughputashighas5.4instructionspercycle).Second,weshowthatSMTneednotcompromisesingle-threadperformance.Third,weuseourmoredetailedarchitecturalmodeltoanalyzeandrelievebottlenecksthatdidnotexistinthemoreidealizedmodel.Fourth,weshowhowsimultaneousmultithreadingcreatesanadvantagepreviouslyunex-ploitableinotherarchitectures:namely,theabilitytochoosethe“best”instructions,fromallthreads,forbothfetchandissueeachcycle.Byfavoringthethreadsmostefficientlyusingtheprocessor,wecanboostthethroughputofourlimitedresources.Wepresentseveralsimpleheuristicsforthisselectionprocess,anddemonstratehowsuchheuristics,whenappliedtothefetchmechanism,canincreasethroughputbyasmuchas37%.Thispaperisorganizedasfollows.Section2presentsourbaselinesimultaneousmultithreadingarchitecture,comparingitwithexist-ingsuperscalartechnology.Section3describesoursimulatorandourworkload,andSection4showstheperformanceofthebaselinearchitecture.InSection5,weexaminetheinstructionfetchpro-cess,presentseveralheuristicsforimprovingitbasedonintelligentinstructionselection,andgiveperformanceresultstodifferentiatethoseheuristics.Section6examinestheinstructionissueprocessinasimilarway.WethenusethebestdesignschosenfromourfetchandissuestudiesinSection7asabasistodiscoverbottlenecksforfurtherperformanceimprovement.WediscussrelatedworkinSection8andsummarizeourresultsinSection9.ThisresearchwassupportedbyONRgrantsN00014-92-J-1395andN00014-94-1-1136,NSFgrantsCCR-9200832andCDA-9123308,NSFPYIAwardMIP-9058439,theWashingtonTechnologyCenter,DigitalEquipmentCorporation,andfellowshipsfromIntelandtheComputerMea-surementGroup.InstructionCache8DecodeRegisterRenamingfloatingpointinstructionqueueintegerinstructionqueuefpunitsint/ld-storeunitsDataCachePCFetchUnitintegerregistersfpregistersFigure1:Ourbasesimultaneousmultithreadinghardwarearchitecture.2ASimultaneousMultithreadingProcessorArchitectureInthissectionwepresentthearchitectureofoursimultaneousmul-tithreadingprocessor.Weshowthatthethroughputgainsprovidedbysimultaneousmultithreadingarepossiblewithoutaddingunduecomplexitytoaconventionalsuperscalarprocessordesign.OurSMTarchitectureisderivedfromahigh-performance,out-of-order,superscalararchitecture(Figure1,withouttheextrapro-gramcounters)whichrepresentsaprojectionofcurrentsuperscalardesigntrends3-5yearsintothefuture.Thissuperscalarproces-sorfetchesuptoeightinstructionspercycle；fetchingiscontrolledbyaconventionalsystemofbranchtargetbuffer,branchpredic-tion,andsubroutinereturnstacks.Fetchedinstructionsarethendecodedandpassedtotheregisterrenaminglogic,whichmapslogicalregistersontoapoolofphysicalregisters,removingfalsedependences.Instructionsarethenplacedinoneoftwoinstruc-tionqueues.ThoseinstructionqueuesaresimilartotheonesusedbytheMIPSR1000020andtheHPPA-800021,inthiscaseholdinginstructionsuntiltheyareissued.Instructionsareissuedtothefunctionalunitsout-of-orderwhentheiroperandsareavailable.Aftercompletingexecution,instructionsareretiredin-order,freeingphysicalregistersthatarenolongerneeded.OurSMTarchitectureisastraightforwardextensiontothiscon-ventionalsuperscalardesign.Wemadechangesonlywhenneces-sarytoenablesimultaneousmultithreading,andingeneral,struc-tureswerenotreplicatedorresizedtosupportSMToramulti-threadedworkload.Thus,nearlyallhardwareresourcesremaincompletelyavailableevenwhenthereisonlyasinglethreadinthesystem.Thechangesnecessarytosupportsimultaneousmulti-threadingonthatarchitectureare:multipleprogramcountersandsomemechanismbywhichthefetchunitselectsoneeachcycle,aseparatereturnstackforeachthreadforpredictingsubroutinereturndestinations,per-threadinstructionretirement,instructionqueueflush,andtrapmechanisms,athreadidwitheachbranchtargetbufferentrytoavoidpre-dictingphantombranches,andalargerregisterfile,tosupportlogicalregistersforallthreadsplusadditionalregistersforregisterrenaming.Thesizeoftheregisterfileaffectsthepipeline(weaddtwoextrastages)andtheschedulingofload-dependentinstructions,whichwediscusslaterinthissection.Noticeablyabsentfromthislistisamechanismtoenablesimulta-neousmultithreadedschedulingofinstructionsontothefunctionalunits.Becauseanyapparentdependencesbetweeninstructionsfromdifferentthreadsareremovedbytheregisterrenamingphase,acon-ventionalinstructionqueue(IQ)designedfordynamicschedulingcontainsallofthefunctionalitynecessaryforsimultaneousmul-tithreading.Theinstructionqueueissharedbyallthreadsandaninstructionfromanythreadinthequeuecanissuewhenitsoperandsareavailable.Wefetchfromoneprogramcounter(PC)eachcycle.ThePCischosen,inround-robinorder,fromamongthosethreadsnotalreadyexperiencinganIcachemiss.Thisschemeprovidessimultaneousmultithreadingatthepointofissue,butonlyfine-grainmultithread-ingofthefetchunit.WewilllookinSection5atwaystoextendsimultaneousmultithreadingtothefetchunit.Wealsoinvestigatealternativethreadprioritymechanismsforfetching.Aprimaryimpactofmultithreadingonourarchitectureisonthesizeoftheregisterfile.Wehaveasingleregisterfile,asthread-specificlogicalregistersaremappedontoacompletelysharedphys-icalregisterfilebytheregisterrenaming.Tosupporteightthreads,weneedaminimumof8*32=256physicalintegerregisters(fora32-registerinstructionsetarchitecture),plusmoretoenableregisterrenaming.Accesstosuchalargeregisterfilewillbeslow,almostcertainlyaffectingthecycletimeofthemachine.Toaccountforthesizeoftheregisterfile,wetaketwocyclestoreadregistersinsteadofone.Inthefirstcyclevaluesarereadintoabufferclosertothefunctionalunits.Theinstructionissenttoasimilarbufferatthesametime.Thenextcyclethedataissenttoafunctionalunitforexecution.Writestotheregisterfilearetreatedsimilarly,requiringanextraregisterwritestage.Figure2showsthepipelinemodifiedfortwo-phaseregisteraccess,comparedtothepipelineoftheoriginalsuperscalar.Thetwo-stageregisteraccesshasseveralramificationsonourarchitecture.First,itincreasesthepipelinedistancebetweenfetchandexec,increasingthebranchmispredictionpenaltyby1cycle.Second,ittakesanextracycletowritebackresults,requiringanextralevelofbypasslogic.Third,increasingthedistancebetweenQueueExecRegReadDecodeFetchRenameCommitmisfetchpenalty2cyclesregisterusage4cycleminimummispredictpenalty6cycles(a)misfetchpenalty2cyclesQueueRegWriteExecRegReadRegReadDecodeFetchRenameCommitregisterusage6cycleminimummispredictpenalty7cyclesmisqueuepenalty4cycles(b)Figure2:Thepipelineof(a)aconventionalsuperscalarprocessorand(b)thatpipelinemodifiedforanSMTprocessor,alongwithsomeimplicationsofthosepipelines.queueandexecincreasestheperiodduringwhichwrong-pathin-structionsremaininthepipelineafteramispredictionisdiscovered(themisqueuepenaltyinFigure2).Wrong-pathinstructionsarethoseinstructionsbroughtintotheprocessorasaresultofabranchmisprediction.Thoseinstructionsconsumeinstructionqueueslots,renamingregistersandpossiblyissueslots,allofwhich,onanSMTprocessor,couldbeusedbyotherthreads.Thispipelinedoesnotincreasetheinter-instructionlatencybe-tweenmostinstructions.Dependent(single-cyclelatency)instruc-tionscanstillbeissuedonconsecutivecycles,forexample,aslongasinter-instructionlatenciesarepredetermined.Thatisthecaseforallinstructionsbutloads.Sinceweareschedulinginstructionsacycleearlier(relativetotheexeccycle),load-hitlatencyincreasesbyonecycle(totwocycles).Ratherthansufferthispenalty,wescheduleload-dependentinstructionsassuminga1-cycledatala-tency,butsquashthoseinstructionsinthecaseofanL1cachemissorabankconflict.Therearetwoperformancecoststothissolution,whichwecalloptimisticissue.Optimisticallyissuedinstructionsthatgetsquashedwasteissueslots,andoptimisticinstructionsmuststillbeheldintheIQanextracycleaftertheyareissued,untilitisknownthattheywontbesquashed.Thelastimplicationofthetwo-phaseregisteraccessisthattherearetwomorestagesbetweenrenameandcommit,thusincreasingtheminimumtimethataphysicalregisterisheldbyanin-flightinstruction.Thisincreasesthepressureontherenamingregisterpool.Weassume,foreachmachinesize,enoughphysicalregisterstosupportallactivethreads,plus100moreregisterstoenablerenaming,bothfortheintegerfileandthefloatingpointfile；i.e.,forthesingle-threadresults,wemodel132physicalintegerregisters,andforan8-threadmachine,356.Weexpectthatinthe3-5yeartime-frame,theschemewehavedescribedwillremoveregisterfileaccessfromthecriticalpathfora4-threadmachine,but8threadswillstillbeasignificantchallenge.Nonetheless,extendingourresultstoan8-threadmachineallowsustoseetrendsbeyondthe4-threadnumbersandanticipatesothersolutionstothisproblem.Thenumberofregistersavailableforrenamingdeterminesthenumberofinstructionsthatcanbeintheprocessorbetweentherenamestageandthecommitstage.Thisarchitectureallowsustoaddressseveralconcernsaboutsimultaneousmultithreadedprocessordesign.Inparticular,thispapershowsthat:Instructionschedulingisnomorecomplexthanonadynami-callyscheduledsuperscalar.Registerfiledatapathsarenomorecomplexthaninthesu-perscalar,andtheperformanceimplicationsoftheregisterfileanditsextendedpipelinearesmall.Therequiredinstructionfetchthroughputisattainable,evenwithoutanyincreaseinfetchbandwidth.Unmodified(foranSMTworkload)cacheandbranchpredic-tionstructuresdonotthrashonthatworkload.Evenaggressivesuperscalartechnologies,suchasdynamicschedulingandspeculativeexecution,arenotsufficienttotakefulladvantageofawide-issueprocessorwithoutsimultaneousmultithreading.Wehaveonlypresentedanoutlineofthehardwarearchitecturetothispoint；thenextsectionprovidesmoredetail.2.1HardwareDetailsTheprocessorcontains3floatingpointfunctionalunitsand6integerunits；fourofthesixintegerunitsalsoexecuteloadsandstores.Thepeakissuebandwidthoutofthetwoinstructionqueuesisthereforenine；however,thethroughputofthemachineisboundedbythepeakfetchanddecodebandwidths,whichareeightinstructionspercycle.Weassumethatallfunctionalunitsarecompletelypipelined.Table1showstheinstructionlatencies,whicharederivedfromtheAlpha211648.Weassumea32-entryintegerinstructionqueue(whichhan-dlesintegerinstructionsandallload/storeoperations)anda32-entryfloatingpointqueue,notsignificantlylargerthantheHPPA-800021,whichhastwo28-entryqueues.Thecaches(Table2)aremulti-portedbyinterleavingthemintobanks,similartothedesignofSohiandFranklin26.Wemodellockup-freecachesandTLBs.TLBmissesrequiretwofullmemoryaccessesandnoexecutionresources.Wemodelthememorysub-systemingreatdetail,simulatingbandwidthlimitationsandaccessconflictsatmultiplelevelsofthehierarchy,toaddresstheconcernInstructionClassLatencyintegermultiply8,16conditionalmove2compare0allotherinteger1FPdivide17,30allotherFP4load(cachehit)1Table1:SimulatedinstructionlatenciesICacheDCacheL2L3Size32KB32KB256KB2MBAssociativityDMDM4-wayDMLineSize64646464Banks8881Transfertime1cycle114Accesses/cyclevar(1-4)411/4Cachefilltime2cycles228Latencytonextlevel661262Table2:Detailsofthecachehierarchythatmemorythroughputcouldbealimitingconditionforsimulta-neousmultithreading.Eachcycle,onethreadisgivencontrolofthefetchunit,chosenfromamongthosenotstalledforaninstructioncache(Icache)miss.Ifwefetchfrommultiplethreads,weneverattempttofetchfromthreadsthatconflict(onanIcachebank)witheachother,althoughtheymayconflictwithotherIcacheactivity(cachefills).Branchpredictionisprovidedbyadecoupledbranchtargetbuffer(BTB)andpatternhistorytable(PHT)scheme4.Weusea256-entryBTB,organizedasfour-waysetassociative.The2Kx2-bitPHTisaccessedbytheXORofthelowerbitsoftheaddressandtheglobalhistoryregister18,30.Returndestinationsarepredictedwitha12-entryreturnstack(percontext).Weassumeanefficient,butnotperfect,implementationofdy-namicmemorydisambiguation.Thisisemulatedbyusingonlypartoftheaddress(10bits)todisambiguatememoryreferences,sothatitisoccasionallyover-conservative.3MethodologyThemethodologyinthispapercloselyfollowsthesimulationandmeasurementmethodologyof27.Oursimulatorusesemulation-based,instruction-levelsimulation,andborrowssignificantlyfromMIPSI22,aMIPS-basedsimulator.Thesimulatorexecutesun-modifiedAlphaobjectcodeandmodelstheexecutionpipelines,memoryhierarchy,TLBs,andthebranchpredictionlogicoftheprocessordescribedinSection2.InanSMTprocessorabranchmispredictionintroduceswrong-pathinstructionsthatinteractwithinstructionsfromotherthreads.Tomodelthisbehavior,wefetchdownwrongpaths,introducethoseinstructionsintotheinstructionqueues,tracktheirdependences,andissuethem.Weeventuallysquashallwrong-pathinstructionsacycleafterabranchmispredictionisdiscoveredintheexecstage.Ourthroughputresultsonlycountusefulinstructions.OurworkloadcomesprimarilyfromtheSPEC92benchmarksuite7.Weusefivefloatingpointprograms(alvinn,doduc,fpppp,ora,andtomcatv)andtwointegerprograms(espressoandxlisp)fromthatsuite,andthedocumenttypesettingprogramTeX.Weassignadistinctprogramtoeachthreadintheprocessor:themulti-programmedworkloadstressesourarchitecturemorethanaparallelprogrambypresentingthreadswithwidelyvaryingprogramchar-acteristicsandwithnooverlapofcache,TLBorbranchpredictionusage.Toeliminatetheeffectsofbenchmarkdifferences,asingledatapointiscomposedof8runs,eachT*300millioninstructionsinlength,whereTisthenumberofthreads.Eachofthe8runsusesadifferentcombinationofthebenchmarks.WecompileeachprogramwiththeMultiflowtraceschedulingcompiler17,modifiedtoproduceAlphacode.Incontrastto27,weturnofftraceschedulinginthecompilerforthisstudy,fortworeasons.Inourmeasurements,wewanttodifferentiatebetweenusefulanduselessspeculativeinstructions,whichiseasywithhard-warespeculation,butnotpossibleforsoftwarespeculationwithoursystem.Also,softwarespeculationisnotasbeneficialonanar-chitecturewhichfeatureshardwarespeculation,andinsomecasesisharmful.However,theMultiflowcompilerisstillagoodchoiceforourcompilationengine,becauseofthehighqualityoftheloopunrolling,instructionschedulingandalignment,andotheropti-mizations,aswellastheeasewithwhichthemachinemodelcanbechanged.Thebenchmarksarecompiledtooptimizesingle-threadperformanceonthebasehardware.4PerformanceoftheBaseHardwareDesignInthissectionweexaminetheperformanceofthebasearchitec-tureandidentifyopportunitiesforimprovement.Figure3showsthatwithonlyasinglethreadrunningonourSMTarchitecture,thethroughputislessthan2%belowasuperscalarwithoutSMTsup-port.Thedropinthroughputisduetothelongerpipeline(describedinSection2)usedbytheSMTprocessor.Itspeakthroughputis84%higherthanthesuperscalar.Thisgainisachievedwithvirtuallynotuningofthebasearchitectureforsimultaneousmultithreading.Thisdesigncombineslowsingle-threadimpactwithhighspeedupforevenafewthreads,enablingsimultaneousmultithreadingtoreapbenefitseveninanenvironmentwheremultipleprocessesarerunningonlyasmallfractionofthetime.Wealsonote,however,thatthethroughputpeaksbefore8threads,andtheprocessoruti-lization,atlessthan50%ofthe8-issueprocessor,iswellshortofthepotentialshownin27.Wemakeseveralconclusionsaboutthepotentialbottlenecksofthissystemasweapproach8threads,aidedbyFigure3andTable3.Issuebandwidthisclearlynotabottleneck,asthethroughputrep-resentsafractionofavailableissuebandwidth,andourdatashowsthatnofunctionalunittypeisbeingoverloaded.Weappeartohaveenoughphysicalregisters.Thecachesandbranchpredictionlogicarebeingstressedmoreheavilyat8threads,butweexpectthelatency-hidingpotentialoftheadditionalthreadstomakeupforthosedrops.Theculpritappearstobeoneormoreofthefollowingthreeproblems:(1)IQsizeIQ-fullconditionsarecommon,12to21%ofcyclestotalforthetwoqueues；(2)fetchthroughputeveninthosecycleswherewedontexperienceanIQ-fullcondition,ourdatashowsthatwearesustainingonly4.2usefulinstructionsfetchedpercycle(4.5includingwrong-path)；and(3)lackofpar-allelismalthoughthequeuesarereasonablyfull,wefindfewer12345Throughput(InstructionsPerCycle)2468NumberofThreadsUnmodifiedSuperscalarFigure3:Instructionthroughputforthebasehardwarearchi-tecture.thanfouroutof,onaverage,27instructionspercycletoissue.Weexpecteightthreadstoprovidemoreparallelism,soperhapswehavethewronginstructionsintheinstructionqueues.Therestofthispaperfocusesonimprovingthisbasearchitecture.ThenextsectionaddresseseachoftheproblemsidentifiedherewithdifferentfetchpoliciesandIQconfigurations.Section6examineswaystopreventissuewaste,andSection7re-examinestheimprovedarchitecturefornewbottlenecks,identifyingdirectionsforfurtherimprovement.5TheFetchUnitInSearchofUsefulIn-structionsInthissectionweexaminewaystoimprovefetchthroughputwithoutincreasingthefetchbandwidth.OurSMTarchitecturesharesasinglefetchunitamongeightthreads.Wecanexploitthehighlevelofcompetitionforthefetchunitintwowaysnotpossiblewithsingle-threadedprocessors:(1)thefetchunitcanfetchfrommultiplethreadsatonce,increasingourutilizationofthefetchbandwidth,and(2)itcanbeselectiveaboutwhichthreadorthreadstofetchfrom.Becausenotallpathsprovideequallyusefulinstructionsinaparticularcycle,anSMTprocessorcanbenefitbyfetchingfromthethread(s)thatwillprovidethebestinstructions.Weexamineavarietyoffetcharchitecturesandfetchpoliciesthatexploitthoseadvantages.Specifically,theyattempttoimprovefetchthroughputbyaddressingthreefactors:fetchefficiency,bypartitioningthefetchunitamongthreads(Section5.1)；fetchef-fectiveness,byimprovingthequalityoftheinstructionsfetched(Section5.2)；andfetchavailability,byeliminatingconditionsthatblockthefetchunit(Section5.3).5.1PartitioningtheFetchUnitRecallthatourbaselinearchitecturefetchesuptoeightinstructionsfromonethreadeachcycle.Thefrequencyofbranchesintypicalinstructionstreamsandthemisalignmentofbranchdestinationsmakeitdifficulttofilltheentirefetchbandwidthfromonethread,NumberofThreadsMetric148out-of-registers(%ofcycles)3%7%3%Icachemissrate2.5%7.8%14.1%-missesperthousandinstructions61729Dcachemissrate3.1%6.5%11.3%-missesperthousandinstructions122543L2cachemissrate17.6%15.0%12.5%-missesperthousandinstructions359L3cachemissrate55.1%33.6%45.4%-missesperthousandinstructions134branchmispredictionrate5.0%7.4%9.1%jumpmispredictionrate2.2%6.4%12.9%integerIQ-full(%ofcycles)7%10%9%fpIQ-full(%ofcycles)14%9%3%avg(combined)queuepopulation252527wrong-pathinstructionsfetched24%7%7%wrong-pathinstructionsissued9%4%3%Table3:Theresultofincreasedmultithreadingonsomelow-levelmetricsforthebasearchitecture.evenforsmallerblocksizes5,24.Inthisprocessor,wecanspreadtheburdenoffillingthefetchbandwidthamongmultiplethreads.Forexample,theprobabilityoffindingfourinstructionsfromeachoftwothreadsshouldbegreaterthanthatoffindingeightfromonethread.Inthissection,weattempttoreducefetchblockfragmentation(ourtermforthevariousfactorsthatpreventusfromfetchingthemaximumnumberofinstructions)byfetchingfrommultiplethreadseachcycle,whilekeepingthemaximumfetchbandwidth(butnotnecessarilytheIcachebandwidth)constant.Weevaluateseveralfetchingschemes,whicharelabeledalg.num1.num2,wherealgisthefetchselectionmethod(inthissectionthreadsarealwaysselectedusingaround-robinpriorityscheme),num1isthenumberofthreadsthatcanfetchin1cycle,andnum2isthemaximumnumberofinstructionsfetchedperthreadin1cycle.Themaximumnumberoftotalinstructionsfetchedisalwayslimitedtoeight.Foreachofthefetchpartitioningpolicies,thecacheisalways32kilobytesorganizedinto8databanks；agivenbankcandojustoneaccesspercycle.RR.1.8ThisisthebaselineschemefromSection4.Eachcycleonethreadfetchesasmanyaseightinstructions.Thethreadisdeterminedbyaround-robinpriorityschemefromamongthosenotcurrentlysufferinganIcachemiss.InthisschemetheIcacheisindistinguishablefromthatonasingle-threadedsuperscalar.Eachcachebankhasitsownaddressdecoderandoutputdrivers；eachcycle,onlyoneofthebanksdrivesthecacheoutputbus,whichis8instructions(32bytes)wide.RR.2.4,RR.4.2Theseschemesfetchfewerinstructionsperthreadfrommorethreads(foureachfromtwothreads,ortwoeachfromfourthreads).Ifwetrytopartitionthefetchbandwidthtoofinely,however,wemaysufferthreadshortage,wherefewerthreadsareavailablethanarerequiredtofillthefetchbandwidth.Fortheseschemes,multiplecacheaddressesaredriventoeachcachedatabank,eachofwhichnowhasamultiplexerbeforeitsaddressdecoder,toselectonecacheindexpercycle.Sincethecachebanksaresingle-ported,bank-conflictlogicisneededtoensurethateachaddresstargetsaseparatebank.RR.2.4hastwocacheoutput012345Throughput(IPC)12468NumberofThreadsRR.2.8RR.4.2RR.2.4RR.1.8Figure4:Instructionthroughputforthedifferentinstructioncacheinterfaceswithround-robininstructionscheduling.buses,eachfourinstructionswide,whileRR.4.2hasfouroutputbuses,eachtwoinstructionswide.Forbothschemes,thetotalwidthoftheoutputbusesis8instructions(identicaltothatinRR.1.8),butadditionalcircuitryisneededsoeachbankiscapableofdrivinganyofthemultiple(nowsmaller)outputbuses,andisabletoselectoneornonetodriveinagivencycle.Also,thecachetagstorelogicmustbereplicatedormultiple-portedinordertocalculatehit/missforeachaddresslookeduppercycle.Thus,thehardwareadditionsare:theaddressmux；multiplead-dressbuses；selectionlogicontheoutputdrivers；thebankconflictlogic；andmultiplehit/misscalculations.ThechangesrequiredforRR.2.4wouldhaveanegligibleimpactonareaandcacheaccesstime.ThechangesforRR.4.2aremoreextensive,andwouldbemoredifficulttodowithoutaffectingareaoraccesstime.Theseschemesactuallyreducethelatencyinthedecodeandrenamestages,asthemaximumlengthofdependencychainsamongfetchedin-structionsisreducedbyafactorof2and4,respectively.RR.2.8Thisschemeattacksfetchblockfragmentationwithoutsufferingfromthreadshortagebyfetchingeightinstructionsmoreflexiblyfromtwothreads.Thiscanbeimplementedbyreadinganeight-instructionblockforeachthread(16instructionstotal),thencombiningthem.Wetakeasmanyinstructionsaspossiblefromthefirstthread,thenfillinwithinstructionsfromthesecond,uptoeighttotal.LikeRR.2.4,twoaddressesmustberoutedtoeachcachebank,thenmultiplexedbeforethedecoder；bank-conflictlogicandtwohit/misscalculationspercyclearenecessary；andeachbankdrivesoneofthetwooutputbuses.Now,however,eachoutputbusiseightinstructionswide,whichdoublesthebandwidthoutofthecachecomparedtoanyofthepreviousschemes.Thiscouldbedonewithoutgreatlyaffectingareaorcycletime,astheadditionalbussingcouldprobablybedonewithoutexpandingthecachelayout.Inaddition,logictoselectandcombinetheinstructionsisnecessary,whichmightormightnotrequireanadditionalpipestage.Oursimulationsassumeitdoesnot.Figure4showsthatwecangethighermaximumthroughputbysplittingthefetchovermultiplethreads.Forexample,theRR.2.4schemeoutperformsRR.1.8at8threadsby9%.However,bettermaximumthroughputcomesatthecostofa12%single-threadpenalty；infact,RR.2.4doesnotsurpassRR.1.8until4threads.TheRR.4.2schemeneeds6threadstosurpassRR.1.8andnevercatchesthe2-threadschemes,sufferingfromthreadshortage.TheRR.2.8schemeprovidesthebestofbothworlds:few-threadsperformancelikeRR.1.8andmany-threadsperformancelikeRR.2.4.However,thehigherthroughputofthisschemeputsmorepressureontheinstructionqueues,causingIQ-fullconditionsatarateof18%(integer)and8%(fp)with8threads.WiththeRR.2.8schemewehaveimprovedthemaximumthroughputby10%withoutcompromisingsingle-threadperfor-mance.Thiswasachievedbyacombinationof(1)partitioningthefetchbandwidthovermultiplethreads,and(2)makingthatpar-titionflexible.Thisisthesameapproach(althoughinamorelimitedfashionhere)thatsimultaneousmultithreadingusestoimprovethethroughputofthefunctionalunits27.5.2ExploitingThreadChoiceintheFetchUnitTheefficiencyoftheentireprocessorisaffectedbythequalityofinstructionsfetched.Amultithreadedprocessorhasauniqueabilitytocontrolthatfactor.Inthissection,weexaminefetchingpoliciesaimedatidentifyingthe“best”threadorthreadsavailabletofetcheachcycle.Twofactorsmakeonethreadlessdesirablethananother.Thefirstistheprobabilitythatathreadisfollowingawrongpathasaresultofanearlierbranchmisprediction.Wrong-pathinstructionsconsumenotonlyfetchbandwidth,butalsoregisters,IQspace,andpossiblyissuebandwidth.Thesecondfactoristhelengthoftimethefetchedinstructionswillbeinthequeuebeforebecomingissuable.Wemaximizethethroughputofaqueueofboundedsizebyfeedingitinstructionsthatwillspendtheleasttimeinthequeue.Ifwefetchtoomanyinstructionsthatblockforalongtime,weeventuallyfilltheIQwithunissuableinstructions,aconditionwecallIQclog.Thisrestrictsbothfetchandissuethroughput,causingthefetchunittogoidleandpreventingissuableinstructionsfromgettingintotheIQ.Bothofthesefactors(wrong-pathprobabilityandexpectedqueuetime)improveovertime,soathreadbecomesmoredesirableaswedelayfetchingit.Wedefineseveralfetchpolicies,eachofwhichattemptstoim-proveontheround-robinprioritypolicyusingfeedbackfromotherpartsoftheprocessor.Thefirstattackswrong-pathfetching,theothersattackIQclog.Theyare:BRCOUNTHereweattempttogivehighestprioritytothosethreadsthatareleastlikelytobeonawrongpath.Wedothisbycountingbranchinstructionsthatareinthedecodestage,therenamestage,andtheinstructionqueues,favoringthosewiththefewestunresolvedbranches.MISSCOUNTThispolicydetectsanimportantcauseofIQclog.AlongmemorylatencycancausedependentinstructionstobackupintheIQwaitingfortheloadtocomplete,eventuallyfillingthequeuewithblockedinstructionsfromonethread.ThispolicypreventsthatbygivingprioritytothosethreadsthathavethefewestoutstandingDcachemisses.ICOUNTThisisamoregeneralsolutiontoIQclog.Herepriorityisgiventothreadswiththefewestinstructionsindecode,rename,andtheinstructionqueues.Thisachievesthreepurposes:(1)itpreventsanyonethreadfromfillingtheIQ,(2)itgiveshighest01234524680123452468Throughput(IPC)NumberofThreadsIQPOSN.1.8ICOUNT.1.8MISSCOUNT.1.8BRCOUNT.1.8RR.1.8IQPOSN.2.8ICOUNT.2.8MISSCOUNT.2.8BRCOUNT.2.8RR.2.8NumberofThreadsFigure5:Instructionthroughputforfetchingbasedonseveralpriorityheuristics,allcomparedtothebaselineround-robinscheme.Theresultsfor1threadarethesameforallschemes,andthusnotshown.prioritytothreadsthataremovinginstructionsthroughtheIQmostefficiently,and(3)itprovidesamoreevenmixofinstructionsfromtheavailablethreads,maximizingtheparallelisminthequeues.IfcachemissesarethedominantcauseofIQclog,MISSCOUNTmayperformbetter,sinceitgetscachemissfeedbacktothefetchunitmorequickly.Ifthecausesaremorevaried,ICOUNTshouldperformbetter.IQPOSNLikeICOUNT,IQPOSNstrivestominimizeIQclogandbiastowardefficientthreads.Itgiveslowestprioritytothosethreadswithinstructionsclosesttotheheadofeithertheintegerorfloatingpointinstructionqueues(theoldestinstructionisattheheadofthequeue).ThreadswiththeoldestinstructionswillbemostpronetoIQclog,andthosemakingthebestprogresswillhaveinstructionsfarthestfromtheheadofthequeue.Thispolicydoesnotrequireacounterforeachthread,asdothepreviousthreepolicies.Likeanycontrolsystem,theefficiencyofthesemechanismsislimitedbythefeedbacklatencyresulting,inthiscase,fromfeedingdatafromlaterpipelinestagesbacktothefetchstage.Forexample,bythetimeinstructionsenterthequeuestageortheexecstage,theinformationusedtofetchthemisthreeor(atleast)sixcyclesold,respectively.Boththebranch-countingandthemiss-countingpoliciestendtoproducefrequentties.Inthosecases,thetie-breakerisround-robinpriority.Figure5showsthatallofthefetchheuristicsprovidespeedupoverround-robin.Branchcountingandcache-misscountingpro-videmoderatespeedups,butonlywhentheprocessorissaturatedwithmanythreads.Instructioncounting,incontrast,producesmoresignificantimprovementsregardlessofthenumberofthreads.IQ-POSNprovidessimilarresultstoICOUNT,beingwithin4%atalltimes,butneverexceedingit.Thebranch-countingheuristicdoeseverythingweaskofit.Itreduceswrong-pathinstructions,from8.2%offetchedinstructionsto3.6%,andfrom3.6%ofissuedinstructionsto0.8%(RR.1.8vs.BRCOUNT.1.8witheightthreads).Anditimprovesthroughputbyasmuchas8%.Itsweaknessisthatthewrong-pathproblemitsolvesisnotlargeonthisprocessor,whichhasalreadyattackedtheproblemwithsimultaneousmultithreading.EvenwiththeRRscheme,simultaneousmultithreadingreducesfetchedwrong-pathinstructionsfrom16%withonethreadto8%with8threads.Cachemisscountingalsoachievesthroughputgainsashighas8%overRR,butingeneralthegainsaremuchlower.ItisnotparticularlyeffectiveatreducingIQclog,aswegetIQ-fullcondi-tions12%ofthetimeontheintegerqueueand14%onthefloatingpointqueue(forMISSCOUNT.2.8with8threads).TheseresultsindicatethatIQclogismorethansimplytheresultoflongmemorylatencies.18ThreadsMetricThreadRRICOUNTintegerIQ-full(%ofcycles)7%18%6%fpIQ-full(%ofcycles)14%8%1%avgqueuepopulation253830out-of-registers(%ofcycles)3%8%5%Table4:Somelow-levelmetricsfortheround-robinandinstruction-countingprioritypolicies(andthe2.8fetchparti-tioningscheme).Theinstruction-countingheuristicprovidesinstructionthrough-putashighas5.3instructionspercycle,athroughputgainovertheunmodifiedsuperscalarof2.5.Itoutperformsthebestround-robinresultby23%.Instructioncountingisaseffectiveat2and4threads(inbenefitoverround-robin)asitisat8threads.Itnearlyelimi-natesIQclog(seeIQ-fullresultsinTable4)andgreatlyimprovesthemixofinstructionsinthequeues(wearefindingmoreissuableinstructionsdespitehavingfewerinstructionsinthetwoqueues).Intelligentfetchingwiththisheuristicisofgreaterbenefitthanpar-titioningthefetchunit,astheICOUNT.1.8schemeconsistently012345Throughput(IPC)12468NumberofThreads01234512468NumberofThreadsITAG,ICOUNT.1.8BIGQ,ICOUNT.1.8ICOUNT.1.8ITAG,ICOUNT.2.8BIGQ,ICOUNT.2.8ICOUNT.2.8Figure6:Instructionthroughputforthe64-entryqueueandearlyIcachetaglookup,whencoupledwiththeICOUNTfetchpolicy.outperformsRR.2.8.Table4pointstoasurprisingresult.AsaresultofsimultaneousmultithreadedinstructionissueandtheICOUNTfetchheuristics,weactuallyputlesspressureonthesameinstructionqueuewitheightthreadsthanwithone,havingsharplyreducedIQ-fullconditions.Italsoreducespressureontheregisterfile(vs.RR)bykeepingfewerinstructionsinthequeue.BRCOUNTandICOUNTeachsolvedifferentproblems,andperhapsthebestperformancecouldbeachievedfromaweightedcombinationofthem；however,thecomplexityofthefeedbackmechanismincreasesasaresult.Byitself,instructioncountingclearlyprovidesthebestgains.Givenourmeasurementmethodology,itispossiblethatthethroughputincreasescouldbeoverstatedifafetchpolicysimplyfavorsthosethreadswiththemostinherentinstruction-levelparal-lelismorthebestcachebehavior,thusachievingimprovementsthatwouldnotbeseeninpractice.However,withtheICOUNT.2.8pol-icy,theoppositehappens.Ourresultsshowthatthisschemefavorsthreadswithlowersingle-threadILP,thusitsresultsincludeahighersampleofinstructionsfromtheslowthreadsthaneitherthesuper-scalarresultsortheRRresults.Ifanything,then,theICOUNT.2.8improvementsareunderstated.Insummary,wehaveidentifiedasimpleheuristicthatisverysuc-cessfulatidentifyingthebestthreadstofetch.Instructioncountingdynamicallybiasestowardthreadsthatwilluseprocessorresourcesmostefficiently,therebyimprovingprocessorthroughputaswellasrelievingpressureonscarceprocessorresources:theinstructionqueuesandtheregisters.5.3UnblockingtheFetchUnitByfetchingfrommultiplethreadsandusingintelligentfetchheuris-tics,wehavesignificantlyincreasedfetchthroughputandefficiency.Themoreefficientlyweareusingthefetchunit,themorewestandtolosewhenitbecomesblocked.Inthissectionweexamineschemesthatpreventtwoconditionsthatcausethefetchunittomissfetchopportunities,specificallyIQ-fullconditionsandIcachemisses.Thetwoschemesare:BIGQTheprimaryrestrictiononIQsizeisnotthechiparea,butthetimetosearchit；thereforewecanincreaseitssizeaslongaswedontincreasethesearchspace.Inthisscheme,wedoublethesizesoftheinstructionqueues,butonlysearchthefirst32entriesforissue.ThisschemeallowsthequeuestobufferinstructionsfromthefetchunitwhentheIQoverflows.ITAGWhenathreadisselectedforfetchingbutexperiencesacachemiss,welosetheopportunitytofetchthatcycle.IfwedotheIcachetaglookupsacycleearly,wecanfetcharoundcachemisses:cachemissaccessesarestillstartedimmediately,butonlynon-missingthreadsarechosenforfetch.Becauseweneedtohavethefetchaddressacycleearly,weessentiallyaddastagetothefrontofthepipeline,increasingthemisfetchandmispredictpenalties.ThisschemerequiresoneormoreadditionalportsontheIcachetags,sothatpotentialreplacementthreadscanbelookedupatthesametime.AlthoughtheBIGQschemeimprovestheperformanceoftheround-robinscheme(notshown),1.5-2%acrosstheboard,Figure6showsthatthebiggerqueuesaddnosignificantimprovementtotheICOUNTpolicy.Infact,itisactuallydetrimentalforseveralthreadconfigurations.Thisisbecausethebufferingeffectofthebigqueueschemebringsinstructionsintotheissuablepartoftheinstructionqueuethatmayhavebeenfetchedmanycyclesearlier,usingpriorityinformationthatisnowout-of-date.Theresultsindicatethatusingup-to-datepriorityinformationismoreimportantthanbuffering.Theseresultsshowthatintelligentfetchheuristicshavemadetheextrainstructionqueuehardwareunnecessary.ThebiggerqueuebyitselfisactuallylesseffectiveatreducingIQclogthantheICOUNTscheme.With8threads,thebiggerqueuesalone(BIGQ,RR.2.8)reduceIQ-fullconditionsto11%(integer)and0%(fp),whilein-structioncountingalone(ICOUNT.2.8)reducesthemto6%and1%.CombiningBIGQandICOUNTdropsthemto3%and0%.EarlyIcachetaglookupbooststhroughputasmuchas8%IssueNumberofThreadsUselessInstructionsMethod12468wrong-pathoptimisticOLDEST2.103.304.625.095.294%3%OPTLAST2.073.304.595.095.294%2%SPECLAST2.103.314.595.095.294%3%BRANCHFIRST2.073.294.585.085.284%6%Table5:Instructionthroughput(instructionspercycle)fortheissuepriorityschemes,andthepercentageofuselessinstructionsissuedwhenrunningwith8threads.overICOUNT.Itismosteffectivewhenfetchingonethread(ICOUNT.1.8,wherethecostofalostfetchslotisgreater).How-ever,itimprovestheICOUNT.2.8resultsnomorethan2%,astheflexibilityofthe2.8schemealreadyhidessomeofthelostfetchbandwidth.Inaddition,ITAGlowersthroughputwithfewthreads,wherecompetitionforthefetchslotsislowandthecostofthelongermispredictionpenaltyishighest.Usingacombinationofpartitioningthefetchunit,intelligentfetching,andearlyIcachetaglookups,wehaveraisedthepeakperformanceofthebaseSMTarchitectureby37%(5.4instructionspercyclevs.3.9).Ourmaximumspeeduprelativetoaconventionalsuperscalarhasgoneupproportionately,from1.8to2.5timesthethroughput.Thatgaincomesfromexploitingcharacteristicsofasimultaneousmultithreadingprocessornotavailabletoasingle-threadedmachine.

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf

文档简介

温馨提示

最新文档

评论

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf

文档简介

温馨提示

最新文档

评论

相关文档