会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf28-Exploiting Choice - Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.pdf -- 5 元

宽屏显示 收藏 分享

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

ExploitingChoiceInstructionFetchandIssueonanImplementableSimultaneousMultithreadingProcessorDeanM.Tullsen,SusanJ.Eggers,JoelS.Emery,HenryM.Levy,JackL.Lo,andRebeccaL.StammyDeptofComputerScienceandEngineeringyDigitalEquipmentCorporationUniversityofWashingtonHLO23/J3Box35235077ReedRoadSeattle,WA981952350Hudson,MA01749AbstractSimultaneousmultithreadingisatechniquethatpermitsmultipleindependentthreadstoissuemultipleinstructionseachcycle.Inpreviousworkwedemonstratedtheperformancepotentialofsimultaneousmultithreading,basedonasomewhatidealizedmodel.Inthispaperweshowthatthethroughputgainsfromsimultaneousmultithreadingcanbeachievedwithoutextensivechangestoaconventionalwideissuesuperscalar,eitherinhardwarestructuresorsizes.Wepresentanarchitectureforsimultaneousmultithreadingthatachievesthreegoals1itminimizesthearchitecturalimpactontheconventionalsuperscalardesign,2ithasminimalperformanceimpactonasinglethreadexecutingalone,and3itachievessignificantthroughputgainswhenrunningmultiplethreads.Oursimultaneousmultithreadingarchitectureachievesathroughputof5.4instructionspercycle,a2.5foldimprovementoveranunmodifiedsuperscalarwithsimilarhardwareresources.Thisspeedupisenhancedbyanadvantageofmultithreadingpreviouslyunexploitedinotherarchitecturestheabilitytofavorforfetchandissuethosethreadsmostefficientlyusingtheprocessoreachcycle,therebyprovidingthebestinstructionstotheprocessor.1IntroductionSimultaneousmultithreadingSMTisatechniquethatpermitsmultipleindependentthreadstoissuemultipleinstructionseachcycletoasuperscalarprocessorsfunctionalunits.SMTcombinesthemultipleinstructionissuefeaturesofmodernsuperscalarswiththelatencyhidingabilityofmultithreadedarchitectures.Unlikeconventionalmultithreadedarchitectures1,2,15,23,whichdependonfastcontextswitchingtoshareprocessorexecutionresources,allhardwarecontextsinanSMTprocessorareactivesimultaneously,competingeachcycleforallavailableresources.Thisdynamicsharingofthefunctionalunitsallowssimultaneousmultithreadingtosubstantiallyincreasethroughput,attackingthetwomajorimpedimentstoprocessorutilizationlonglatenciesandlimitedperthreadparallelism.Tullsen,etal.,27showedthepotentialofProceedingsofthe23rdAnnualInternationalSymposiumonComputerArchitecture,Philadelphia,PA,May,1996anSMTprocessortoachievesignificantlyhigherthroughputthaneitherawidesuperscalaroramultithreadedprocessor.Thatpaperalsodemonstratedtheadvantagesofsimultaneousmultithreadingovermultipleprocessorsonasinglechip,duetoSMTsabilitytodynamicallyassignexecutionresourceswhereneededeachcycle.ThoseresultsshowedSMTspotentialbasedonasomewhatidealizedmodel.Thispaperextendsthatworkinfoursignificantways.First,wedemonstratethatthethroughputgainsofsimultaneousmultithreadingarepossiblewithoutextensivechangestoaconventional,wideissuesuperscalarprocessor.Weproposeanarchitecturethatismorecomprehensive,realistic,andheavilyleveragedoffexistingsuperscalartechnology.Oursimulationsshowthataminimalimplementationofsimultaneousmultithreadingachievesthroughput1.8timesthatoftheunmodifiedsuperscalarsmalltuningofthisarchitectureincreasesthatgainto2.5reachingthroughputashighas5.4instructionspercycle.Second,weshowthatSMTneednotcompromisesinglethreadperformance.Third,weuseourmoredetailedarchitecturalmodeltoanalyzeandrelievebottlenecksthatdidnotexistinthemoreidealizedmodel.Fourth,weshowhowsimultaneousmultithreadingcreatesanadvantagepreviouslyunexploitableinotherarchitecturesnamely,theabilitytochoosethebestinstructions,fromallthreads,forbothfetchandissueeachcycle.Byfavoringthethreadsmostefficientlyusingtheprocessor,wecanboostthethroughputofourlimitedresources.Wepresentseveralsimpleheuristicsforthisselectionprocess,anddemonstratehowsuchheuristics,whenappliedtothefetchmechanism,canincreasethroughputbyasmuchas37.Thispaperisorganizedasfollows.Section2presentsourbaselinesimultaneousmultithreadingarchitecture,comparingitwithexistingsuperscalartechnology.Section3describesoursimulatorandourworkload,andSection4showstheperformanceofthebaselinearchitecture.InSection5,weexaminetheinstructionfetchprocess,presentseveralheuristicsforimprovingitbasedonintelligentinstructionselection,andgiveperformanceresultstodifferentiatethoseheuristics.Section6examinestheinstructionissueprocessinasimilarway.WethenusethebestdesignschosenfromourfetchandissuestudiesinSection7asabasistodiscoverbottlenecksforfurtherperformanceimprovement.WediscussrelatedworkinSection8andsummarizeourresultsinSection9.ThisresearchwassupportedbyONRgrantsN0001492J1395andN000149411136,NSFgrantsCCR9200832andCDA9123308,NSFPYIAwardMIP9058439,theWashingtonTechnologyCenter,DigitalEquipmentCorporation,andfellowshipsfromIntelandtheComputerMeasurementGroup.InstructionCache8DecodeRegisterRenamingfloatingpointinstructionqueueintegerinstructionqueuefpunitsint/ldstoreunitsDataCachePCFetchUnitintegerregistersfpregistersFigure1Ourbasesimultaneousmultithreadinghardwarearchitecture.2ASimultaneousMultithreadingProcessorArchitectureInthissectionwepresentthearchitectureofoursimultaneousmultithreadingprocessor.Weshowthatthethroughputgainsprovidedbysimultaneousmultithreadingarepossiblewithoutaddingunduecomplexitytoaconventionalsuperscalarprocessordesign.OurSMTarchitectureisderivedfromahighperformance,outoforder,superscalararchitectureFigure1,withouttheextraprogramcounterswhichrepresentsaprojectionofcurrentsuperscalardesigntrends35yearsintothefuture.Thissuperscalarprocessorfetchesuptoeightinstructionspercyclefetchingiscontrolledbyaconventionalsystemofbranchtargetbuffer,branchprediction,andsubroutinereturnstacks.Fetchedinstructionsarethendecodedandpassedtotheregisterrenaminglogic,whichmapslogicalregistersontoapoolofphysicalregisters,removingfalsedependences.Instructionsarethenplacedinoneoftwoinstructionqueues.ThoseinstructionqueuesaresimilartotheonesusedbytheMIPSR1000020andtheHPPA800021,inthiscaseholdinginstructionsuntiltheyareissued.Instructionsareissuedtothefunctionalunitsoutoforderwhentheiroperandsareavailable.Aftercompletingexecution,instructionsareretiredinorder,freeingphysicalregistersthatarenolongerneeded.OurSMTarchitectureisastraightforwardextensiontothisconventionalsuperscalardesign.Wemadechangesonlywhennecessarytoenablesimultaneousmultithreading,andingeneral,structureswerenotreplicatedorresizedtosupportSMToramultithreadedworkload.Thus,nearlyallhardwareresourcesremaincompletelyavailableevenwhenthereisonlyasinglethreadinthesystem.Thechangesnecessarytosupportsimultaneousmultithreadingonthatarchitecturearemultipleprogramcountersandsomemechanismbywhichthefetchunitselectsoneeachcycle,aseparatereturnstackforeachthreadforpredictingsubroutinereturndestinations,perthreadinstructionretirement,instructionqueueflush,andtrapmechanisms,athreadidwitheachbranchtargetbufferentrytoavoidpredictingphantombranches,andalargerregisterfile,tosupportlogicalregistersforallthreadsplusadditionalregistersforregisterrenaming.Thesizeoftheregisterfileaffectsthepipelineweaddtwoextrastagesandtheschedulingofloaddependentinstructions,whichwediscusslaterinthissection.Noticeablyabsentfromthislistisamechanismtoenablesimultaneousmultithreadedschedulingofinstructionsontothefunctionalunits.Becauseanyapparentdependencesbetweeninstructionsfromdifferentthreadsareremovedbytheregisterrenamingphase,aconventionalinstructionqueueIQdesignedfordynamicschedulingcontainsallofthefunctionalitynecessaryforsimultaneousmultithreading.Theinstructionqueueissharedbyallthreadsandaninstructionfromanythreadinthequeuecanissuewhenitsoperandsareavailable.WefetchfromoneprogramcounterPCeachcycle.ThePCischosen,inroundrobinorder,fromamongthosethreadsnotalreadyexperiencinganIcachemiss.Thisschemeprovidessimultaneousmultithreadingatthepointofissue,butonlyfinegrainmultithreadingofthefetchunit.WewilllookinSection5atwaystoextendsimultaneousmultithreadingtothefetchunit.Wealsoinvestigatealternativethreadprioritymechanismsforfetching.Aprimaryimpactofmultithreadingonourarchitectureisonthesizeoftheregisterfile.Wehaveasingleregisterfile,asthreadspecificlogicalregistersaremappedontoacompletelysharedphysicalregisterfilebytheregisterrenaming.Tosupporteightthreads,weneedaminimumof832256physicalintegerregistersfora32registerinstructionsetarchitecture,plusmoretoenableregisterrenaming.Accesstosuchalargeregisterfilewillbeslow,almostcertainlyaffectingthecycletimeofthemachine.Toaccountforthesizeoftheregisterfile,wetaketwocyclestoreadregistersinsteadofone.Inthefirstcyclevaluesarereadintoabufferclosertothefunctionalunits.Theinstructionissenttoasimilarbufferatthesametime.Thenextcyclethedataissenttoafunctionalunitforexecution.Writestotheregisterfilearetreatedsimilarly,requiringanextraregisterwritestage.Figure2showsthepipelinemodifiedfortwophaseregisteraccess,comparedtothepipelineoftheoriginalsuperscalar.Thetwostageregisteraccesshasseveralramificationsonourarchitecture.First,itincreasesthepipelinedistancebetweenfetchandexec,increasingthebranchmispredictionpenaltyby1cycle.Second,ittakesanextracycletowritebackresults,requiringanextralevelofbypasslogic.Third,increasingthedistancebetweenQueueExecRegReadDecodeFetchRenameCommitmisfetchpenalty2cyclesregisterusage4cycleminimummispredictpenalty6cyclesamisfetchpenalty2cyclesQueueRegWriteExecRegReadRegReadDecodeFetchRenameCommitregisterusage6cycleminimummispredictpenalty7cyclesmisqueuepenalty4cyclesbFigure2ThepipelineofaaconventionalsuperscalarprocessorandbthatpipelinemodifiedforanSMTprocessor,alongwithsomeimplicationsofthosepipelines.queueandexecincreasestheperiodduringwhichwrongpathinstructionsremaininthepipelineafteramispredictionisdiscoveredthemisqueuepenaltyinFigure2.Wrongpathinstructionsarethoseinstructionsbroughtintotheprocessorasaresultofabranchmisprediction.Thoseinstructionsconsumeinstructionqueueslots,renamingregistersandpossiblyissueslots,allofwhich,onanSMTprocessor,couldbeusedbyotherthreads.Thispipelinedoesnotincreasetheinterinstructionlatencybetweenmostinstructions.Dependentsinglecyclelatencyinstructionscanstillbeissuedonconsecutivecycles,forexample,aslongasinterinstructionlatenciesarepredetermined.Thatisthecaseforallinstructionsbutloads.Sinceweareschedulinginstructionsacycleearlierrelativetotheexeccycle,loadhitlatencyincreasesbyonecycletotwocycles.Ratherthansufferthispenalty,wescheduleloaddependentinstructionsassuminga1cycledatalatency,butsquashthoseinstructionsinthecaseofanL1cachemissorabankconflict.Therearetwoperformancecoststothissolution,whichwecalloptimisticissue.Optimisticallyissuedinstructionsthatgetsquashedwasteissueslots,andoptimisticinstructionsmuststillbeheldintheIQanextracycleaftertheyareissued,untilitisknownthattheywontbesquashed.Thelastimplicationofthetwophaseregisteraccessisthattherearetwomorestagesbetweenrenameandcommit,thusincreasingtheminimumtimethataphysicalregisterisheldbyaninflightinstruction.Thisincreasesthepressureontherenamingregisterpool.Weassume,foreachmachinesize,enoughphysicalregisterstosupportallactivethreads,plus100moreregisterstoenablerenaming,bothfortheintegerfileandthefloatingpointfilei.e.,forthesinglethreadresults,wemodel132physicalintegerregisters,andforan8threadmachine,356.Weexpectthatinthe35yeartimeframe,theschemewehavedescribedwillremoveregisterfileaccessfromthecriticalpathfora4threadmachine,but8threadswillstillbeasignificantchallenge.Nonetheless,extendingourresultstoan8threadmachineallowsustoseetrendsbeyondthe4threadnumbersandanticipatesothersolutionstothisproblem.Thenumberofregistersavailableforrenamingdeterminesthenumberofinstructionsthatcanbeintheprocessorbetweentherenamestageandthecommitstage.Thisarchitectureallowsustoaddressseveralconcernsaboutsimultaneousmultithreadedprocessordesign.Inparticular,thispapershowsthatInstructionschedulingisnomorecomplexthanonadynamicallyscheduledsuperscalar.Registerfiledatapathsarenomorecomplexthaninthesuperscalar,andtheperformanceimplicationsoftheregisterfileanditsextendedpipelinearesmall.Therequiredinstructionfetchthroughputisattainable,evenwithoutanyincreaseinfetchbandwidth.UnmodifiedforanSMTworkloadcacheandbranchpredictionstructuresdonotthrashonthatworkload.Evenaggressivesuperscalartechnologies,suchasdynamicschedulingandspeculativeexecution,arenotsufficienttotakefulladvantageofawideissueprocessorwithoutsimultaneousmultithreading.Wehaveonlypresentedanoutlineofthehardwarearchitecturetothispointthenextsectionprovidesmoredetail.2.1HardwareDetailsTheprocessorcontains3floatingpointfunctionalunitsand6integerunitsfourofthesixintegerunitsalsoexecuteloadsandstores.Thepeakissuebandwidthoutofthetwoinstructionqueuesisthereforeninehowever,thethroughputofthemachineisboundedbythepeakfetchanddecodebandwidths,whichareeightinstructionspercycle.Weassumethatallfunctionalunitsarecompletelypipelined.Table1showstheinstructionlatencies,whicharederivedfromtheAlpha211648.Weassumea32entryintegerinstructionqueuewhichhandlesintegerinstructionsandallload/storeoperationsanda32entryfloatingpointqueue,notsignificantlylargerthantheHPPA800021,whichhastwo28entryqueues.ThecachesTable2aremultiportedbyinterleavingthemintobanks,similartothedesignofSohiandFranklin26.WemodellockupfreecachesandTLBs.TLBmissesrequiretwofullmemoryaccessesandnoexecutionresources.Wemodelthememorysubsystemingreatdetail,simulatingbandwidthlimitationsandaccessconflictsatmultiplelevelsofthehierarchy,toaddresstheconcernInstructionClassLatencyintegermultiply8,16conditionalmove2compare0allotherinteger1FPdivide17,30allotherFP4loadcachehit1Table1SimulatedinstructionlatenciesICacheDCacheL2L3Size32KB32KB256KB2MBAssociativityDMDM4wayDMLineSize64646464Banks8881Transfertime1cycle114Accesses/cyclevar14411/4Cachefilltime2cycles228Latencytonextlevel661262Table2Detailsofthecachehierarchythatmemorythroughputcouldbealimitingconditionforsimultaneousmultithreading.Eachcycle,onethreadisgivencontrolofthefetchunit,chosenfromamongthosenotstalledforaninstructioncacheIcachemiss.Ifwefetchfrommultiplethreads,weneverattempttofetchfromthreadsthatconflictonanIcachebankwitheachother,althoughtheymayconflictwithotherIcacheactivitycachefills.BranchpredictionisprovidedbyadecoupledbranchtargetbufferBTBandpatternhistorytablePHTscheme4.Weusea256entryBTB,organizedasfourwaysetassociative.The2Kx2bitPHTisaccessedbytheXORofthelowerbitsoftheaddressandtheglobalhistoryregister18,30.Returndestinationsarepredictedwitha12entryreturnstackpercontext.Weassumeanefficient,butnotperfect,implementationofdynamicmemorydisambiguation.Thisisemulatedbyusingonlypartoftheaddress10bitstodisambiguatememoryreferences,sothatitisoccasionallyoverconservative.3MethodologyThemethodologyinthispapercloselyfollowsthesimulationandmeasurementmethodologyof27.Oursimulatorusesemulationbased,instructionlevelsimulation,andborrowssignificantlyfromMIPSI22,aMIPSbasedsimulator.ThesimulatorexecutesunmodifiedAlphaobjectcodeandmodelstheexecutionpipelines,memoryhierarchy,TLBs,andthebranchpredictionlogicoftheprocessordescribedinSection2.InanSMTprocessorabranchmispredictionintroduceswrongpathinstructionsthatinteractwithinstructionsfromotherthreads.Tomodelthisbehavior,wefetchdownwrongpaths,introducethoseinstructionsintotheinstructionqueues,tracktheirdependences,andissuethem.Weeventuallysquashallwrongpathinstructionsacycleafterabranchmispredictionisdiscoveredintheexecstage.Ourthroughputresultsonlycountusefulinstructions.OurworkloadcomesprimarilyfromtheSPEC92benchmarksuite7.Weusefivefloatingpointprogramsalvinn,doduc,fpppp,ora,andtomcatvandtwointegerprogramsespressoandxlispfromthatsuite,andthedocumenttypesettingprogramTeX.Weassignadistinctprogramtoeachthreadintheprocessorthemultiprogrammedworkloadstressesourarchitecturemorethanaparallelprogrambypresentingthreadswithwidelyvaryingprogramcharacteristicsandwithnooverlapofcache,TLBorbranchpredictionusage.Toeliminatetheeffectsofbenchmarkdifferences,asingledatapointiscomposedof8runs,eachT300millioninstructionsinlength,whereTisthenumberofthreads.Eachofthe8runsusesadifferentcombinationofthebenchmarks.WecompileeachprogramwiththeMultiflowtraceschedulingcompiler17,modifiedtoproduceAlphacode.Incontrastto27,weturnofftraceschedulinginthecompilerforthisstudy,fortworeasons.Inourmeasurements,wewanttodifferentiatebetweenusefulanduselessspeculativeinstructions,whichiseasywithhardwarespeculation,butnotpossibleforsoftwarespeculationwithoursystem.Also,softwarespeculationisnotasbeneficialonanarchitecturewhichfeatureshardwarespeculation,andinsomecasesisharmful.However,theMultiflowcompilerisstillagoodchoiceforourcompilationengine,becauseofthehighqualityoftheloopunrolling,instructionschedulingandalignment,andotheroptimizations,aswellastheeasewithwhichthemachinemodelcanbechanged.Thebenchmarksarecompiledtooptimizesinglethreadperformanceonthebasehardware.4PerformanceoftheBaseHardwareDesignInthissectionweexaminetheperformanceofthebasearchitectureandidentifyopportunitiesforimprovement.Figure3showsthatwithonlyasinglethreadrunningonourSMTarchitecture,thethroughputislessthan2belowasuperscalarwithoutSMTsupport.ThedropinthroughputisduetothelongerpipelinedescribedinSection2usedbytheSMTprocessor.Itspeakthroughputis84higherthanthesuperscalar.Thisgainisachievedwithvirtuallynotuningofthebasearchitectureforsimultaneousmultithreading.Thisdesigncombineslowsinglethreadimpactwithhighspeedupforevenafewthreads,enablingsimultaneousmultithreadingtoreapbenefitseveninanenvironmentwheremultipleprocessesarerunningonlyasmallfractionofthetime.Wealsonote,however,thatthethroughputpeaksbefore8threads,andtheprocessorutilization,atlessthan50ofthe8issueprocessor,iswellshortofthepotentialshownin27.Wemakeseveralconclusionsaboutthepotentialbottlenecksofthissystemasweapproach8threads,aidedbyFigure3andTable3.Issuebandwidthisclearlynotabottleneck,asthethroughputrepresentsafractionofavailableissuebandwidth,andourdatashowsthatnofunctionalunittypeisbeingoverloaded.Weappeartohaveenoughphysicalregisters.Thecachesandbranchpredictionlogicarebeingstressedmoreheavilyat8threads,butweexpectthelatencyhidingpotentialoftheadditionalthreadstomakeupforthosedrops.Theculpritappearstobeoneormoreofthefollowingthreeproblems1IQsizeIQfullconditionsarecommon,12to21ofcyclestotalforthetwoqueues2fetchthroughputeveninthosecycleswherewedontexperienceanIQfullcondition,ourdatashowsthatwearesustainingonly4.2usefulinstructionsfetchedpercycle4.5includingwrongpathand3lackofparallelismalthoughthequeuesarereasonablyfull,wefindfewer12345ThroughputInstructionsPerCycle2468NumberofThreadsUnmodifiedSuperscalarFigure3Instructionthroughputforthebasehardwarearchitecture.thanfouroutof,onaverage,27instructionspercycletoissue.Weexpecteightthreadstoprovidemoreparallelism,soperhapswehavethewronginstructionsintheinstructionqueues.Therestofthispaperfocusesonimprovingthisbasearchitecture.ThenextsectionaddresseseachoftheproblemsidentifiedherewithdifferentfetchpoliciesandIQconfigurations.Section6examineswaystopreventissuewaste,andSection7reexaminestheimprovedarchitecturefornewbottlenecks,identifyingdirectionsforfurtherimprovement.5TheFetchUnitInSearchofUsefulInstructionsInthissectionweexaminewaystoimprovefetchthroughputwithoutincreasingthefetchbandwidth.OurSMTarchitecturesharesasinglefetchunitamongeightthreads.Wecanexploitthehighlevelofcompetitionforthefetchunitintwowaysnotpossiblewithsinglethreadedprocessors1thefetchunitcanfetchfrommultiplethreadsatonce,increasingourutilizationofthefetchbandwidth,and2itcanbeselectiveaboutwhichthreadorthreadstofetchfrom.Becausenotallpathsprovideequallyusefulinstructionsinaparticularcycle,anSMTprocessorcanbenefitbyfetchingfromthethreadsthatwillprovidethebestinstructions.Weexamineavarietyoffetcharchitecturesandfetchpoliciesthatexploitthoseadvantages.Specifically,theyattempttoimprovefetchthroughputbyaddressingthreefactorsfetchefficiency,bypartitioningthefetchunitamongthreadsSection5.1fetcheffectiveness,byimprovingthequalityoftheinstructionsfetchedSection5.2andfetchavailability,byeliminatingconditionsthatblockthefetchunitSection5.3.5.1PartitioningtheFetchUnitRecallthatourbaselinearchitecturefetchesuptoeightinstructionsfromonethreadeachcycle.Thefrequencyofbranchesintypicalinstructionstreamsandthemisalignmentofbranchdestinationsmakeitdifficulttofilltheentirefetchbandwidthfromonethread,NumberofThreadsMetric148outofregistersofcycles373Icachemissrate2.57.814.1missesperthousandinstructions61729Dcachemissrate3.16.511.3missesperthousandinstructions122543L2cachemissrate17.615.012.5missesperthousandinstructions359L3cachemissrate55.133.645.4missesperthousandinstructions134branchmispredictionrate5.07.49.1jumpmispredictionrate2.26.412.9integerIQfullofcycles7109fpIQfullofcycles1493avgcombinedqueuepopulation252527wrongpathinstructionsfetched2477wrongpathinstructionsissued943Table3Theresultofincreasedmultithreadingonsomelowlevelmetricsforthebasearchitecture.evenforsmallerblocksizes5,24.Inthisprocessor,wecanspreadtheburdenoffillingthefetchbandwidthamongmultiplethreads.Forexample,theprobabilityoffindingfourinstructionsfromeachoftwothreadsshouldbegreaterthanthatoffindingeightfromonethread.Inthissection,weattempttoreducefetchblockfragmentationourtermforthevariousfactorsthatpreventusfromfetchingthemaximumnumberofinstructionsbyfetchingfrommultiplethreadseachcycle,whilekeepingthemaximumfetchbandwidthbutnotnecessarilytheIcachebandwidthconstant.Weevaluateseveralfetchingschemes,whicharelabeledalg.num1.num2,wherealgisthefetchselectionmethodinthissectionthreadsarealwaysselectedusingaroundrobinpriorityscheme,num1isthenumberofthreadsthatcanfetchin1cycle,andnum2isthemaximumnumberofinstructionsfetchedperthreadin1cycle.Themaximumnumberoftotalinstructionsfetchedisalwayslimitedtoeight.Foreachofthefetchpartitioningpolicies,thecacheisalways32kilobytesorganizedinto8databanksagivenbankcandojustoneaccesspercycle.RR.1.8ThisisthebaselineschemefromSection4.Eachcycleonethreadfetchesasmanyaseightinstructions.ThethreadisdeterminedbyaroundrobinpriorityschemefromamongthosenotcurrentlysufferinganIcachemiss.InthisschemetheIcacheisindistinguishablefromthatonasinglethreadedsuperscalar.Eachcachebankhasitsownaddressdecoderandoutputdriverseachcycle,onlyoneofthebanksdrivesthecacheoutputbus,whichis8instructions32byteswide.RR.2.4,RR.4.2Theseschemesfetchfewerinstructionsperthreadfrommorethreadsfoureachfromtwothreads,ortwoeachfromfourthreads.Ifwetrytopartitionthefetchbandwidthtoofinely,however,wemaysufferthreadshortage,wherefewerthreadsareavailablethanarerequiredtofillthefetchbandwidth.Fortheseschemes,multiplecacheaddressesaredriventoeachcachedatabank,eachofwhichnowhasamultiplexerbeforeitsaddressdecoder,toselectonecacheindexpercycle.Sincethecachebanksaresingleported,bankconflictlogicisneededtoensurethateachaddresstargetsaseparatebank.RR.2.4hastwocacheoutput012345ThroughputIPC12468NumberofThreadsRR.2.8RR.4.2RR.2.4RR.1.8Figure4Instructionthroughputforthedifferentinstructioncacheinterfaceswithroundrobininstructionscheduling.buses,eachfourinstructionswide,whileRR.4.2hasfouroutputbuses,eachtwoinstructionswide.Forbothschemes,thetotalwidthoftheoutputbusesis8instructionsidenticaltothatinRR.1.8,butadditionalcircuitryisneededsoeachbankiscapableofdrivinganyofthemultiplenowsmalleroutputbuses,andisabletoselectoneornonetodriveinagivencycle.Also,thecachetagstorelogicmustbereplicatedormultipleportedinordertocalculatehit/missforeachaddresslookeduppercycle.Thus,thehardwareadditionsaretheaddressmuxmultipleaddressbusesselectionlogicontheoutputdriversthebankconflictlogicandmultiplehit/misscalculations.ThechangesrequiredforRR.2.4wouldhaveanegligibleimpactonareaandcacheaccesstime.ThechangesforRR.4.2aremoreextensive,andwouldbemoredifficulttodowithoutaffectingareaoraccesstime.Theseschemesactuallyreducethelatencyinthedecodeandrenamestages,asthemaximumlengthofdependencychainsamongfetchedinstructionsisreducedbyafactorof2and4,respectively.RR.2.8Thisschemeattacksfetchblockfragmentationwithoutsufferingfromthreadshortagebyfetchingeightinstructionsmoreflexiblyfromtwothreads.Thiscanbeimplementedbyreadinganeightinstructionblockforeachthread16instructionstotal,thencombiningthem.Wetakeasmanyinstructionsaspossiblefromthefirstthread,thenfillinwithinstructionsfromthesecond,uptoeighttotal.LikeRR.2.4,twoaddressesmustberoutedtoeachcachebank,thenmultiplexedbeforethedecoderbankconflictlogicandtwohit/misscalculationspercyclearenecessaryandeachbankdrivesoneofthetwooutputbuses.Now,however,eachoutputbusiseightinstructionswide,whichdoublesthebandwidthoutofthecachecomparedtoanyofthepreviousschemes.Thiscouldbedonewithoutgreatlyaffectingareaorcycletime,astheadditionalbussingcouldprobablybedonewithoutexpandingthecachelayout.Inaddition,logictoselectandcombinetheinstructionsisnecessary,whichmightormightnotrequireanadditionalpipestage.Oursimulationsassumeitdoesnot.Figure4showsthatwecangethighermaximumthroughputbysplittingthefetchovermultiplethreads.Forexample,theRR.2.4schemeoutperformsRR.1.8at8threadsby9.However,bettermaximumthroughputcomesatthecostofa12singlethreadpenaltyinfact,RR.2.4doesnotsurpassRR.1.8until4threads.TheRR.4.2schemeneeds6threadstosurpassRR.1.8andnevercatchesthe2threadschemes,sufferingfromthreadshortage.TheRR.2.8schemeprovidesthebestofbothworldsfewthreadsperformancelikeRR.1.8andmanythreadsperformancelikeRR.2.4.However,thehigherthroughputofthisschemeputsmorepressureontheinstructionqueues,causingIQfullconditionsatarateof18integerand8fpwith8threads.WiththeRR.2.8schemewehaveimprovedthemaximumthroughputby10withoutcompromisingsinglethreadperformance.Thiswasachievedbyacombinationof1partitioningthefetchbandwidthovermultiplethreads,and2makingthatpartitionflexible.Thisisthesameapproachalthoughinamorelimitedfashionherethatsimultaneousmultithreadingusestoimprovethethroughputofthefunctionalunits27.5.2ExploitingThreadChoiceintheFetchUnitTheefficiencyoftheentireprocessorisaffectedbythequalityofinstructionsfetched.Amultithreadedprocessorhasauniqueabilitytocontrolthatfactor.Inthissection,weexaminefetchingpoliciesaimedatidentifyingthebestthreadorthreadsavailabletofetcheachcycle.Twofactorsmakeonethreadlessdesirablethananother.Thefirstistheprobabilitythatathreadisfollowingawrongpathasaresultofanearlierbranchmisprediction.Wrongpathinstructionsconsumenotonlyfetchbandwidth,butalsoregisters,IQspace,andpossiblyissuebandwidth.Thesecondfactoristhelengthoftimethefetchedinstructionswillbeinthequeuebeforebecomingissuable.Wemaximizethethroughputofaqueueofboundedsizebyfeedingitinstructionsthatwillspendtheleasttimeinthequeue.Ifwefetchtoomanyinstructionsthatblockforalongtime,weeventuallyfilltheIQwithunissuableinstructions,aconditionwecallIQclog.Thisrestrictsbothfetchandissuethroughput,causingthefetchunittogoidleandpreventingissuableinstructionsfromgettingintotheIQ.Bothofthesefactorswrongpathprobabilityandexpectedqueuetimeimproveovertime,soathreadbecomesmoredesirableaswedelayfetchingit.Wedefineseveralfetchpolicies,eachofwhichattemptstoimproveontheroundrobinprioritypolicyusingfeedbackfromotherpartsoftheprocessor.Thefirstattackswrongpathfetching,theothersattackIQclog.TheyareBRCOUNTHereweattempttogivehighestprioritytothosethreadsthatareleastlikelytobeonawrongpath.Wedothisbycountingbranchinstructionsthatareinthedecodestage,therenamestage,andtheinstructionqueues,favoringthosewiththefewestunresolvedbranches.MISSCOUNTThispolicydetectsanimportantcauseofIQclog.AlongmemorylatencycancausedependentinstructionstobackupintheIQwaitingfortheloadtocomplete,eventuallyfillingthequeuewithblockedinstructionsfromonethread.ThispolicypreventsthatbygivingprioritytothosethreadsthathavethefewestoutstandingDcachemisses.ICOUNTThisisamoregeneralsolutiontoIQclog.Herepriorityisgiventothreadswiththefewestinstructionsindecode,rename,andtheinstructionqueues.Thisachievesthreepurposes1itpreventsanyonethreadfromfillingtheIQ,2itgiveshighest01234524680123452468ThroughputIPCNumberofThreadsIQPOSN.1.8ICOUNT.1.8MISSCOUNT.1.8BRCOUNT.1.8RR.1.8IQPOSN.2.8ICOUNT.2.8MISSCOUNT.2.8BRCOUNT.2.8RR.2.8NumberofThreadsFigure5Instructionthroughputforfetchingbasedonseveralpriorityheuristics,allcomparedtothebaselineroundrobinscheme.Theresultsfor1threadarethesameforallschemes,andthusnotshown.prioritytothreadsthataremovinginstructionsthroughtheIQmostefficiently,and3itprovidesamoreevenmixofinstructionsfromtheavailablethreads,maximizingtheparallelisminthequeues.IfcachemissesarethedominantcauseofIQclog,MISSCOUNTmayperformbetter,sinceitgetscachemissfeedbacktothefetchunitmorequickly.Ifthecausesaremorevaried,ICOUNTshouldperformbetter.IQPOSNLikeICOUNT,IQPOSNstrivestominimizeIQclogandbiastowardefficientthreads.Itgiveslowestprioritytothosethreadswithinstructionsclosesttotheheadofeithertheintegerorfloatingpointinstructionqueuestheoldestinstructionisattheheadofthequeue.ThreadswiththeoldestinstructionswillbemostpronetoIQclog,andthosemakingthebestprogresswillhaveinstructionsfarthestfromtheheadofthequeue.Thispolicydoesnotrequireacounterforeachthread,asdothepreviousthreepolicies.Likeanycontrolsystem,theefficiencyofthesemechanismsislimitedbythefeedbacklatencyresulting,inthiscase,fromfeedingdatafromlaterpipelinestagesbacktothefetchstage.Forexample,bythetimeinstructionsenterthequeuestageortheexecstage,theinformationusedtofetchthemisthreeoratleastsixcyclesold,respectively.Boththebranchcountingandthemisscountingpoliciestendtoproducefrequentties.Inthosecases,thetiebreakerisroundrobinpriority.Figure5showsthatallofthefetchheuristicsprovidespeedupoverroundrobin.Branchcountingandcachemisscountingprovidemoderatespeedups,butonlywhentheprocessorissaturatedwithmanythreads.Instructioncounting,incontrast,producesmoresignificantimprovementsregardlessofthenumberofthreads.IQPOSNprovidessimilarresultstoICOUNT,beingwithin4atalltimes,butneverexceedingit.Thebranchcountingheuristicdoeseverythingweaskofit.Itreduceswrongpathinstructions,from8.2offetchedinstructionsto3.6,andfrom3.6ofissuedinstructionsto0.8RR.1.8vs.BRCOUNT.1.8witheightthreads.Anditimprovesthroughputbyasmuchas8.Itsweaknessisthatthewrongpathproblemitsolvesisnotlargeonthisprocessor,whichhasalreadyattackedtheproblemwithsimultaneousmultithreading.EvenwiththeRRscheme,simultaneousmultithreadingreducesfetchedwrongpathinstructionsfrom16withonethreadto8with8threads.Cachemisscountingalsoachievesthroughputgainsashighas8overRR,butingeneralthegainsaremuchlower.ItisnotparticularlyeffectiveatreducingIQclog,aswegetIQfullconditions12ofthetimeontheintegerqueueand14onthefloatingpointqueueforMISSCOUNT.2.8with8threads.TheseresultsindicatethatIQclogismorethansimplytheresultoflongmemorylatencies.18ThreadsMetricThreadRRICOUNTintegerIQfullofcycles7186fpIQfullofcycles1481avgqueuepopulation253830outofregistersofcycles385Table4Somelowlevelmetricsfortheroundrobinandinstructioncountingprioritypoliciesandthe2.8fetchpartitioningscheme.Theinstructioncountingheuristicprovidesinstructionthroughputashighas5.3instructionspercycle,athroughputgainovertheunmodifiedsuperscalarof2.5.Itoutperformsthebestroundrobinresultby23.Instructioncountingisaseffectiveat2and4threadsinbenefitoverroundrobinasitisat8threads.ItnearlyeliminatesIQclogseeIQfullresultsinTable4andgreatlyimprovesthemixofinstructionsinthequeueswearefindingmoreissuableinstructionsdespitehavingfewerinstructionsinthetwoqueues.Intelligentfetchingwiththisheuristicisofgreaterbenefitthanpartitioningthefetchunit,astheICOUNT.1.8schemeconsistently012345ThroughputIPC12468NumberofThreads01234512468NumberofThreadsITAG,ICOUNT.1.8BIGQ,ICOUNT.1.8ICOUNT.1.8ITAG,ICOUNT.2.8BIGQ,ICOUNT.2.8ICOUNT.2.8Figure6Instructionthroughputforthe64entryqueueandearlyIcachetaglookup,whencoupledwiththeICOUNTfetchpolicy.outperformsRR.2.8.Table4pointstoasurprisingresult.AsaresultofsimultaneousmultithreadedinstructionissueandtheICOUNTfetchheuristics,weactuallyputlesspressureonthesameinstructionqueuewitheightthreadsthanwithone,havingsharplyreducedIQfullconditions.Italsoreducespressureontheregisterfilevs.RRbykeepingfewerinstructionsinthequeue.BRCOUNTandICOUNTeachsolvedifferentproblems,andperhapsthebestperformancecouldbeachievedfromaweightedcombinationofthemhowever,thecomplexityofthefeedbackmechanismincreasesasaresult.Byitself,instructioncountingclearlyprovidesthebestgains.Givenourmeasurementmethodology,itispossiblethatthethroughputincreasescouldbeoverstatedifafetchpolicysimplyfavorsthosethreadswiththemostinherentinstructionlevelparallelismorthebestcachebehavior,thusachievingimprovementsthatwouldnotbeseeninpractice.However,withtheICOUNT.2.8policy,theoppositehappens.OurresultsshowthatthisschemefavorsthreadswithlowersinglethreadILP,thusitsresultsincludeahighersampleofinstructionsfromtheslowthreadsthaneitherthesuperscalarresultsortheRRresults.Ifanything,then,theICOUNT.2.8improvementsareunderstated.Insummary,wehaveidentifiedasimpleheuristicthatisverysuccessfulatidentifyingthebestthreadstofetch.Instructioncountingdynamicallybiasestowardthreadsthatwilluseprocessorresourcesmostefficiently,therebyimprovingprocessorthroughputaswellasrelievingpressureonscarceprocessorresourcestheinstructionqueuesandtheregisters.5.3UnblockingtheFetchUnitByfetchingfrommultiplethreadsandusingintelligentfetchheuristics,wehavesignificantlyincreasedfetchthroughputandefficiency.Themoreefficientlyweareusingthefetchunit,themorewestandtolosewhenitbecomesblocked.Inthissectionweexamineschemesthatpreventtwoconditionsthatcausethefetchunittomissfetchopportunities,specificallyIQfullconditionsandIcachemisses.ThetwoschemesareBIGQTheprimaryrestrictiononIQsizeisnotthechiparea,butthetimetosearchitthereforewecanincreaseitssizeaslongaswedontincreasethesearchspace.Inthisscheme,wedoublethesizesoftheinstructionqueues,butonlysearchthefirst32entriesforissue.ThisschemeallowsthequeuestobufferinstructionsfromthefetchunitwhentheIQoverflows.ITAGWhenathreadisselectedforfetchingbutexperiencesacachemiss,welosetheopportunitytofetchthatcycle.IfwedotheIcachetaglookupsacycleearly,wecanfetcharoundcachemissescachemissaccessesarestillstartedimmediately,butonlynonmissingthreadsarechosenforfetch.Becauseweneedtohavethefetchaddressacycleearly,weessentiallyaddastagetothefrontofthepipeline,increasingthemisfetchandmispredictpenalties.ThisschemerequiresoneormoreadditionalportsontheIcachetags,sothatpotentialreplacementthreadscanbelookedupatthesametime.AlthoughtheBIGQschemeimprovestheperformanceoftheroundrobinschemenotshown,1.52acrosstheboard,Figure6showsthatthebiggerqueuesaddnosignificantimprovementtotheICOUNTpolicy.Infact,itisactuallydetrimentalforseveralthreadconfigurations.Thisisbecausethebufferingeffectofthebigqueueschemebringsinstructionsintotheissuablepartoftheinstructionqueuethatmayhavebeenfetchedmanycyclesearlier,usingpriorityinformationthatisnowoutofdate.Theresultsindicatethatusinguptodatepriorityinformationismoreimportantthanbuffering.Theseresultsshowthatintelligentfetchheuristicshavemadetheextrainstructionqueuehardwareunnecessary.ThebiggerqueuebyitselfisactuallylesseffectiveatreducingIQclogthantheICOUNTscheme.With8threads,thebiggerqueuesaloneBIGQ,RR.2.8reduceIQfullconditionsto11integerand0fp,whileinstructioncountingaloneICOUNT.2.8reducesthemto6and1.CombiningBIGQandICOUNTdropsthemto3and0.EarlyIcachetaglookupbooststhroughputasmuchas8IssueNumberofThreadsUselessInstructionsMethod12468wrongpathoptimisticOLDEST2.103.304.625.095.2943OPTLAST2.073.304.595.095.2942SPECLAST2.103.314.595.095.2943BRANCHFIRST2.073.294.585.085.2846Table5Instructionthroughputinstructionspercyclefortheissuepriorityschemes,andthepercentageofuselessinstructionsissuedwhenrunningwith8threads.overICOUNT.ItismosteffectivewhenfetchingonethreadICOUNT.1.8,wherethecostofalostfetchslotisgreater.However,itimprovestheICOUNT.2.8resultsnomorethan2,astheflexibilityofthe2.8schemealreadyhidessomeofthelostfetchbandwidth.Inaddition,ITAGlowersthroughputwithfewthreads,wherecompetitionforthefetchslotsislowandthecostofthelongermispredictionpenaltyishighest.Usingacombinationofpartitioningthefetchunit,intelligentfetching,andearlyIcachetaglookups,wehaveraisedthepeakperformanceofthebaseSMTarchitectureby375.4instructionspercyclevs.3.9.Ourmaximumspeeduprelativetoaconventionalsuperscalarhasgoneupproportionately,from1.8to2.5timesthethroughput.Thatgaincomesfromexploitingcharacteristicsofasimultaneousmultithreadingprocessornotavailabletoasinglethreadedmachine.Highfetchthroughputmakesissuebandwidthamorecriticalresource.Wefocusonthisfactorinthenextsection.6ChoosingInstructionsForIssueMuchasthefetchunitinasimultaneousmultithreadingprocessorcantakeadvantageoftheabilitytochoosewhichthreadstofetch,theissuelogichastheabilitytochooseinstructionsforissue.Adynamicallyscheduledsinglethreadedprocessormayhaveenoughreadyinstructionstobeabletochoosebetweenthem,butwithanSMTprocessortheoptionsaremorediverse.Also,becausewehavehigherthroughputthanasinglethreadedsuperscalarprocessor,theissuebandwidthispotentiallyamorecriticalresource,soavoidingissueslotwastemaybemorebeneficial.Inthissection,weexamineissueprioritypoliciesaimedatpreventingissuewaste.Issueslotwastecomesfromtwosources,wrongpathinstructionsresultingfrommispredictedbranchesandoptimisticallyissuedinstructions.RecallfromSection2thatweoptimisticallyissueloaddependentinstructionsacyclebeforewehaveDcachehitinformation.Inthecaseofacachemissorbankconflict,wehavetosquashtheoptimisticallyissuedinstruction,wastingthatissueslot.Inasinglethreadedprocessor,choosinginstructionsleastlikelytobeonawrongpathisalwaysachievedbyselectingtheoldestinstructionsthosedeepestintotheinstructionqueue.Inasimultaneousmultithreadingprocessor,thepositionofaninstructioninthequeueisnolongerthebestindicatorofthelevelofspeculationofthatinstruction,asrightpathinstructionsareintermingledinthequeueswithwrongpath.ThepoliciesweexamineareOLDESTFIRST,ourdefaultissuealgorithmuptothispoint,OPTLASTandSPECLAST,whichonlyissueoptimisticandspeculativeinstructionsmorespecifically,anyinstructionbehindabranchfromthesamethreadintheinstructionqueue,respectively,afterallothershavebeenissued,andBRANCHFIRST,whichissuesbranchesasearlyaspossibleinordertoidentifymispredictedbranchesquickly.ThedefaultfetchalgorithmforeachoftheseschemesisICOUNT.2.8.ThestrongmessageofTable5isthatissuebandwidthisnotyetabottleneck.Evenwhenitdoesbecomeacriticalresource,theamountofimprovementwegetfromnotwastingitislikelytobeboundedbythepercentageofourissuebandwidthgiventouselessinstructions,whichcurrentlystandsat74wrongpathinstructions,3squashedoptimisticinstructions.Becausewedontoftenhavemoreissuableinstructionsthanfunctionalunits,wearentabletoanddontneedtoreducethatsignificantly.TheSPECLASTschemeisunabletoreducethenumberofuselessinstructionsatall,whileOPTLASTbringsitdownto6.BRANCHFIRSTactuallyincreasesitto10,asbranchinstructionsareoftenloaddependenttherefore,issuingthemasearlyaspossibleoftenmeansissuingthemoptimistically.AcombinedschemeOPTLASTandBRANCHFIRSTmightreducethatsideeffect,butisunlikelytohavemucheffectonthroughput.SinceeachofthealternateschemespotentiallyintroducesmultiplepassestotheIQsearch,itisconvenientthatthesimplestmechanismstillworkswell.7WhereAretheBottlenecksNowWehaveshownthatproposedchangestotheinstructionqueuesandtheissuelogicareunnecessarytoachievethebestperformancewiththisarchitecture,butthatsignificantgainscanbeproducedbymoderatechangestotheinstructionfetchmechanisms.HereweexaminethatarchitecturemorecloselyusingICOUNT.2.8asournewbaseline,identifyinglikelydirectionsforfurtherimprovements.Inthissectionwepresentresultsofexperimentsintendedtoidentifybottlenecksinthenewdesign.Forcomponentsthatarepotentialbottlenecks,wequantifythesizeofthebottleneckbymeasuringtheimpactofrelievingit.Forsomeofthecomponentsthatarenotbottlenecks,weexaminewhetheritispossibletosimplifythosecomponentswithoutcreatingabottleneck.Becauseweareidentifyingbottlenecksratherthanproposingarchitectures,wearenolongerboundbyimplementationpracticalitiesintheseexperiments.TheIssueBandwidthTheexperimentsinSection6indicatethatissuebandwidthisnotabottleneck.Infact,wefoundthatevenaninfinitenumberoffunctionalunitsincreasesthroughputbyonly0.5at8threads.InstructionQueueSizeResultsinSection5would,similarly,seemtoimplythatthesizeoftheinstructionqueueswasnotabottleneck,particularlywithinstructioncountinghowever,theschemesweexaminedarenotthesameaslarger,searchablequeues,whichwouldalsoincreaseavailableparallelism.Nonetheless,theexper
编号:201401051948146806    大小:80.04KB    格式:PDF    上传时间:2014-01-05
  【编辑】
5
关 键 词:
工业、机械、能源、设计、建模、模具、工学
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:22次
baixue100上传于2014-01-05

官方联系方式

客服手机:13961746681   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

工业、机械、能源、设计、建模、模具、工学  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5