53-Architecture of the Pentium Microprocessor.pdf_第1页
53-Architecture of the Pentium Microprocessor.pdf_第2页
53-Architecture of the Pentium Microprocessor.pdf_第3页
53-Architecture of the Pentium Microprocessor.pdf_第4页
53-Architecture of the Pentium Microprocessor.pdf_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

ArchitectureofthePentiumMicroprocessorThePentiwnCPUisthelatestinIntelsfamilyofcompatiblemicroprocessors.Itintegrates3.1milliontransistorsin0.8-pmBiCMOStechnology.Wedescribethetechniquesofpipelining,superscalarexecution,andbranchpredictionusedinthemicroprocessorsdesign.hePentiumprocessorisIntelsnextgenerationofcompatiblemicroproces-sorsfollowingthepopulari486CPUfamily.Thedesignstartedinearly1989withtheprimarygoalofmaximizingperformancewhilepreservingsoftwarecompatibilitywithinthepracticalconstraintsofavailabletechnology.ThePentiumprocessorintegrates3.1milliontransis-torsin0.8-ymBiCMOStechnologyandcarriestheInteltrademark.Wedescribethearchitectureanddevelopmentprocessemployedtoachievethisgoal.DonaldAlpertDrorAvnonhtelCorporationTechnologyThecontinualadvancementofsemiconductortechnologypromotesinnovationinmicroproces-sordesign.Higherlevelsofintegration,madepossiblebyreducedfeaturesizesandincreasedinterconnectionlayers,enabledesignerstode-ployadditionalhardwareresourcesformorepar-allelcomputationanddeeperpipelining.Fasterdevicespeedsleadtohigherclockratesandcon-sequentlytorequirementsforlargerandmorespecializedon-chipmemorybuffers.Table1(nextpage)summarizesthetechnologyimprovementsassociatedwithourthreemostre-centmicroprocessorgenerations.The0.8-ymBiCMOStechnologyofthePentiummicroproces-sorenables2.5timesthenumberoftransistorsandtwicetheclockfrequencyoftheoriginali486CPU,whichwasimplementedin1.0-pmCMOS.CompatibiIitySinceintroductionofthe8086microprocessorin1978,theX86architecturehasevolvedthroughseveralgenerationsofsubstantialfunctionalen-hancementsandtechnologyimprovements,in-cludingthe80286andi386CPUs.EachoftheseCPUswassupportedbyacorrespondingfloat-ing-pointunit.Thei486CPU,introducedin1989,integratesthecompletefunctionalityofaninte-gerprocessor,floating-pointunit,andcachememoryintoasinglecircuit.TheX86architecturegreatlyappealedtosoft-waredevelopersbecauseofitswidespreadapplicationasthecentralprocessorofIBM-compatiblepersonalcomputers.ThesuccessofthearchitectureinPCshasinturnmadetheX86popularforcommercialserverapplicationsaswell.Figure1showssomeofthewell-knownsoftwareenvironmentsthatarehostedonthearchitecture.ThecommonsoftwareenvironmentsallowtheX86architecturetoexerciseseveraloperatingmodes.ApplicationsdevelopedforDOSuse16-bitrealmode(orvirtual8086mode)andMSWindows.EarlyversionsofOS/2use16-bitpro-tectedmode,andapplicationsforotherpopularenvironmentsuse32-bitflat(unsegmented)mode.ThePentiummicroprocessoremploysgeneraltechniquesforimprovingperformanceinallop-eratingmodes,aswellascertaintechniquesforimprovingperformanceinspecificoperating0272-1732/93/0600-0011$03.0001993IEEEJune199311PentiummicroprocessorTable1.Technologyformicroprocessordevelopment.No.ofFrequencyMicroprocessorYearTechnologytransistors(MHz)1386CPU19861.5-pmCMOS,275K16two-layermetali486CPU19891.O-pmCMOS,1.2M33two-layermetalPentiumCPU19930.8-pmBiCMOS,3.1M66three-layermetal1&bitaeneratton32-bitgenerationUnixSVR4scoNetware311DOSOSF/1MS-Wtndowsos121xNextStep32-bitOS/2SolarisWindowsNTirTaligentUnivel1980s1991199x1Figure1.Softwareenvironments.(Allfigures,tables,andphotographspubllshedinthzsarticlearethepropertyofIntelColporationjII64bitsinterface64bitsPrefetchbuffersPipelinedfloating-pointPipeIIvpipeunitIntegerIntegerRegistersetMultiplier32bitsIAdderII4DatacacheI1DividerIFigure2.Pentiumprocessorblockdiagram.modes.Wefocusonthe32-bitflatmodehere,sincethisisthemostappropriatemodeforcomparisonwiththeotherhigh-performancemicroprocessorsde-scribedattheHotChipsIVConference.TheX86architecturesupportstheIEEE-754standardforfloating-pointarith-metic.Inadditiontorequiredoperationsonsingle-precisionanddouble-precisionformats,theXS6floating-pointarchitec-tureincludesoperationson8O-bit,extended-precisionformatandasetofbasictranscendentalfunctions.PentiumCPUdesignersfoundnumer-ousexcitingtechnicalchallengesinde-velopingamicroarchitecturethatmaintainedcompatibilitywithsuchadiversesoftwarebase.Laterinthisarticlewepresentexamplesoftechniquesforsupportingself-modifyingcodeandthestack-oriented,floating-pointregisterfile.PerformanceAmicroprocessorsperformanceisacomplexfunctionofmanyparametersthatvarybetweenapplications,compilers,andhardwaresystems.IndevelopingthePentiummicropro-cessor,thedesignteamaddressedtheseaspectsforeachofthepopularsoftwareenvironments.Asaresult,PentiumCPUfeaturestunedcompilersandcachememory.WefocusontheperformanceofSPECbenchmarksforboththePentiummicroprocessorandi486CPUinsystemswithwell-tunedcompilersandcachememory.Morespecifi-cally,thePentiumCPUachievesroughlytwotimesthespeeduponintegercodeanduptofivetimesthespeeduponfloating-pointvectorcodewhencomparedwithani486CPUofidenticalclockfrequency.OrganizationFigure2showstheoverallorganizationofthePentiummicroprocessor.Thecoreexecutionunitsaretwointegerpipelinesandafloating-pointpipelinewithdedicatedadder,multiplier,anddivider.Separateon-chipinstructioncodeanddatacachessupplythememorydemandsoftheexecutionunits,withabranchtargetbufferaugmentingtheinstructioncachefordynamicbranchprediction.Theexternalinterfaceincludesseparateaddressand64-bitdatabuses.IntegerpipelineThePentiumprocessorsintegerpipelineissimilartothatofthei486CPU.3Thepipelinehasfivestages(seeFigure3)withthefollowingfunctions:Preftcch.DuringthePFstagetheCPUprefetchescodefromtheinstructioncacheandalignsthecodetothe12IEEEMicroPFD1D2EWBFetchandaligninstructionLDecodeinstructionGeneratecontrolwordDecodecontrolwordGeneratememoryaddresscAccessdatacacheorcalculateALUresultWriteresultIIFigure3.Integerpipeline.PFD1D2EWBFetchandaligninstructionLDecodeinstructionGeneratecontrolwordDecodecontrolwordIGeneratememoryaddress:IiAccessdatacacheorcalculateALUresultI+DecodecontrolwordiGeneratememoryaddress.IAccessdatacacheorcalculateALUresultWriteresultWriteresultUpipeVpipeFigure4.Superscalarexecution.initialbyteofthenextinstructiontobedecoded.Be-causeinstructionsareofvariablelength,thisstagein-cludesbufferstoholdboththelinecontainingtheinstructionbeingdecodedandthenextconsecutiveline.Firstdecode.IntheD1stagetheCPUdecodesthein-structiontogenerateacontrolword.Asinglecontrolwordexecutesinstructionsdirectly;morecomplexin-structionsrequiremicrocodedcontrolsequencinginD1.Seconddecode.IntheD2stagetheCPUdecodesthecontrolwordfromD1foruseintheEstage.Inaddition,theCPUgeneratesaddressesfordatamemoryreferences.Execute.IntheEstagetheCPUeitheraccessesthedatacacheorcalculatesresultsintheMU(arithmeticlogicunit),barrelshifter,orotherfunctionalunitsinthedatapath.WritebuckIntheWBstagetheCPUupdatestheregis-tersandflagswiththeinstructionsresults.Allexcep-tionalconditionsmustberesolvedbeforeaninstructioncanadvancetoWB.Comparedtotheintegerpipelineofthei486CPU,thePentiummicroprocessorintegratesadditionalhardwareinseveralstagestospeedinstructionexecution.Forexample,thei486CPUrequirestwoclockstodecodeseveralinstruc-tionformats,butthePentiumCPUtakesoneclockandex-ecutesshiftandmultiplyinstructionsfaster.Moresignificantly,thePentiumprocessorsubstantiallyenhancessuperscalarex-ecution,branchprediction,andcacheorganization.Superscalarexecution.ThePentiumCPUhasasuper-scalarorganizationthatenablestwoinstructionstoexecuteinparallel.Figure4showsthattheresourcesforaddressgenerationandMUfunctionshavebeenreplicatedininde-pendentintegerpipelines,calledUandV.(ThepipelinenameswereselectedbecauseUandVwerethefirsttwoconsecu-tivelettersofthealphabetneitherofwhichwastheinitialofafunctionalunitinthedesignpartitioning.)InthePFandD1stagestheCPUcanfetchanddecodetwosimpleinstructionsinparallelandissuethemtotheUandVpipelines.Addition-ally,forcomplexinstructionstheCPUinD1cangeneratemicrocodesequencesthatcontrolbothUandVpipelines.Severaltechniquesareusedtoresolvedependenciesbe-tweeninstructionsthatmightbeexecutedinparallel.Mostofthelogiciscontainedintheinstructionissuealgorithm(seeFigure5)ofD1.DecodetwoconsecutiveinstructionsI1andI2IfthefollowingarealltrueI1isa”simple”instructionI2isa”simple”instructionI1isnotajumpinstructionDestinationofI1zsourceofI2DestinationofI1jldestinationofI2Thenissue11toUpipeandI2toVpipeElseissueI1toUpipeFigure5.Instructionissuealgorithm.June199313-PentiummicroprocessorlrBranchinstructionaddress+History+BranchdestinationaddressFigure6.Branchtargetbuffer.Resourcedependencies.Aresourcedependencyoccurswhentwoinstructionsrequireasinglefunctionalunitordatapath.DuringtheD1stage,theCPUonlyissuestwoinstruc-tionsforparallelexecutionifbotharefromaclassof“simple”instructions,therebyeliminatingmostresourcedependen-cies.Theinstructionsmustbedirectlyexecuted.thatis,notrequiremicrocodesequencing.TheinstructionbeingissuedtotheVpipecanbeanALUoperation,memoryreference,orjump.TheinstructionbeingissuedtotheUpipecanbefromthesamecategoriesorfromanadditionalsetthatusesafunctionalunitavailableonlyintheUpipe,suchasthebarrelshifter.Althoughthesetofinstructionsidentifiedas“simple”mightseemrestrictive,morethan90percentofin-structionsexecutedintheIntegerSPECbenchmarksuitearesimple.Datadependencies.Adatadependencyoccurswhenoneinstructionwritesaresultthatisreadorwrittenbyanotherinstruction.LogicinD1ensuresthatthesourceanddestina-tionregistersoftheinstructionissuedtotheVpipedifferfromthedestinationregisteroftheinstructionissuedtotheUpipe.Thisarrangementeliminatesread-after-write(RAW)andwrite-after-write(WAW)dependencies.Write-after-read(WAR)dependenciesneednotbecheckedbecausereddsoccurinanearlierstageofthepipelinesthanwrites.Thedesignincludeslogicthatenablesinstructionswithcertainspecialtypesofdatadependencytobeexecutedinparallel.Forexample,aconditionalbranchinstructionthatteststheflagresultscanbeexecutedinparallelwithacom-pareinstructionthatsetstheflags.Controldependencies.Acontroldependencyoccurswhentheresultofoneinstructiondetermineswhetheranotherin-structionwillbeexecuted.WhenajumpinstructionisissuedtotheUpipe,theCPUinD1neverissuesaninstructiontotheVpipe,therebyeliminatingcontroldependencies.NotethatresourcedependenciesanddatadependenciesbetweenmemoryreferencesarenotresolvedinD1.Depen-dentmemoryreferencescanbeissuedtothetwopipelines;weexplaintheirresolutioninthedescriptionofthedatacache.Branchprediction.Thei486CPUhasasimpletechniqueforhandlingbranches.Whenabranchinstructionisexecuted,thepipelinecontinuestofetchanddecodeinstructionsalongthesequentialpathuntilthebranchreachestheEstage.InE,theCPUfetchesthebranchdestination,andthepipelinere-solveswhetherornotaconditionalbranchistaken.Ifthebranchisnottaken,theCPUdiscardsthefetcheddestina-tion,andexecutionproceedsalongthesequentialpathwithnodelay.Ifthebranchistaken,thefetcheddestinationisusedtobegindecodingalongthetargetpathwithtwoclocksofdelay.Takenbranchesarefoundtobe15percentto20percentofinstructionsexecuted,representinganobviousareaforimprovementbythePentiumprocessor.ThePentiumCPUemploysabranchtargetbuffer(BTB),whichisanassociativememoryusedtoimproveperformanceoftakenbranchinstructions(seeFigure6).Whenabranchinstructionisfirsttaken,theCPUallocatesanentryinthebranchtargetbuffertoassociatethebranchinstructionsaddresswithitsdestinationaddressandtoinitializethehistoryusedinthepredictionalgorithm.Asinstructionsaredecoded,theCPUsearchesthebranchtargetbuffertodeterminewhetheritholdsanentryforacorrespondingbranchinstruction.Whenthereisahit,theCPUusesthehistorytodeterminewhetherthebranchshouldbetaken.Ifitshould,themicroprocessorusesthetar-getaddresstobeginfetchinganddecodinginstructionsfromthetargetpath.ThebranchisresolvedearlyintheWBstage,andifthepredictionwasincorrect,theCPUflushesthepipe-lineandresumesfetchingalongthecorrectpath.TheCPUupdatesthedual-portedhistoryintheWBstage.Thebranchtargetbufferholdsentriesforpredicting256branchesinafour-wayassociativeorganization.Usingthesetechniques,thePentiumCPUexecutescor-rectlypredictedbrancheswithnodelay.Inaddition,condi-tionalbranchescanbeexecutedintheVpipepairedwithacompareorotherinstructionthatsetstheflagsintheUpipe.Branchingexecuteswithfullcompatibilityandnomodifica-tiontoexistingsoftware.(Weexplainaspectsofinteractionsbetweenbranchpredictionandself-modifyingcodelater.)Cacheorganization.Thei486CPUemploysasingleon-chipcachethatisunifiedforcodeanddata.Thesingle-portedcacheismultiplexedonademandbasisbetweensequentialcodeprefetchesofcompletelinesanddatareferencestoin-dividuallocations.Asjustexplained,branchtargetsareprefetchedintheEstage,effectivelyusingthesamehard-wareasdatamemoryreferences.Therearepotentialadvan-tagesforsuchanorganizationoveronethatseparatescodeanddata.1)Foragivensizeofcachememory,aunifiedcachehasahigherhitratethanseparatecachesbecauseitbalancesthetotalallocationofcodeanddatalinesautomatically.2)Onlyonecacheneedstobedesigned.3)Handlingself-modifyingcodecanbesimpler.14IEEEMicroDespitethesepotentialadvantagesofaunifiedcache,allofwhichapplytothei486CPU,thePentiummicroprocessorusesseparatecodeanddatacaches.Thereasonisthatthesuperscalardesignandbranchpredictiondemandmoreband-widththanaunifiedcachesimilartothatofthei486CPUcanprovide.First,efficientbranchpredictionrequiresthatthedestinationofabranchbeaccessedsimultaneouslywithdatareferencesofpreviousinstructionsexecutinginthepipeline.Second,theparallelexecutionofdatamemoryreferencesrequiressimultaneousaccessesforloadsandstores.Third,inthecontextoftheoverallPentiummicroprocessordesign,handlingself-modifyingcodeforseparatecodeanddatacachesisonlymarginallymorecomplexthanforaunifiedcache.Theinstructioncacheanddatacacheareeach8-Kbyte,two-wayassociativedesignswith32-bytelines.Programsexecutingonthei486CPUtypicallygeneratemoredatamemoryreferencesthanwhenexecutingonRISCmicroprocessors.MeasurementsonIntegerSPECbenchmarksshow0.5to0.6datareferencesperinstructionforthei486CPU4andonly0.17to0.33fortheMipsprocessor.jThisdifferenceresultsdirectlyfromthelimitednumber(eight)ofregistersfortheX86architecture,aswellasprocedure-callingconventionsthatrequirepassingallparametersinmemory.Asmalldatacacheisadequatetocapturethelocalityoftheadditionalreferences.(Afterall,theadditionalreferenceshavesufficientlocalitytofitintheregisterfileoftheRISCmicro-processors.)ThePentiummicroprocessorimplementsadatacachethatsupportsdualaccessesbytheUpipeandVpipetoprovideadditionalbandwidthandsimplifycompilerin-structionschedulingalgorithms.Figure7showsthattheaddresspathtothetranslationlook-asidebufferanddatacachetagsisafullydual-portedstructure.Thedatapath,however,issingleportedwitheight-wayinterleavingof32-bit-widebanks.Whenabankconflictoccurs,theUpipeassumespriority,andtheVpipestallsforaclockcycle.Thebankconflictlogicalsoservestoeliminatedatadependenciesbetweenparallelmemoryreferencestoasinglelocation.Formemoryreferencestodouble-precisionfloating-pointdata,theCPUaccessesconsecutivebanksinparallel,formingasingle64-bitpath.Thedesignteamconsideredafullydual-portedstructureforthedatacache,butfeasibilitystudiesandperformancesimulationsshowedtheinterleavedstructuretobemoreef-fective.Thedual-portedstructureeliminatedbankconflicts,buttheSRAMcellwouldhavebeenlargerthanthecellusedintheinterleavedscheme,resultinginasmallercacheandlowerhitratiofortheallocatedarea.Additionally,thehan-dlingofdatadependencieswouldhavebeenmorecomplex.Withawrite-throughcache-consistencyprotocoland32-bitdatabus,thei486DX2CPUusesbuses80percentofthetime;85percentofallbuscyclesarewrites.(Thei486DX2CPUhasacorepipelinethatoperatesattwicethebusclocksDual-portedTLBBankconflictdetection77IIIDual-portedcachetagsFigure7.Dual-accessdatacache.Singe-portedandinterleavedbcachedatafrequency.)ForthePentiummicroprocessor,withitshigherperformancecorepipelinesand64-bitdatabus,usingawrite-backprotocolforcacheconsistencywasanobviousenhance-ment.Thewrite-backprotocolusesfourstates:modified,exclusive,shared,andinvalid(MESI).Self-modifyingcode.OnechallengingaspectofthePentiummicroprocessorsdesignwassupportingself-modi-fyingcodecompatibly.Compatibilityrequiresthatwhenaninstructionismodifiedfollowedbyexecutionofatakenbranchinstruction,subsequentexecutionsofthemodifiedinstruc-tionmustusetheupdatedvalue.Thisisaspecialformofdependencybetweendatastoresandinstructionfetches.Theinteractionbetweenbranchpredictionsandself-modi-fyingcoderequiresthemostattention.ThePentiumCPUfetchesthetargetofatakenbranchbeforepreviousinstruc-tionshavecompletedstores,sodedicatedlogicchecksforsuchconditionsinthepipelineandflushesincorrectlyfetchedinstructionswhennecessary.TheCPUthoroughlyverifiespredictedbranchestohandlecasesinwhichaninstructionenteredinthebranchtargetbuffermightbemodified.Thesamemechanismsusedforconsistencywithexternalmemorymaintainconsistencybetweenthecodecacheanddatacache.Floating-pointpipelineThei486CPUintegratedthefloating-pointunit(FPU)onchip,thuseliminatingoverheadofthecommunicationproto-colthatresultedfromusingacoprocessor.BringingtheFPUonchipsubstantiallyboostedperformanceinthei486CPU.Nevertheless,duetolimiteddevicesavailablefortheFPU,itsmicroarchitecturewasbasedonapartialmultiplierarrayandashift-and-adddatapathcontrolledbymicrocode.Floating-pointoperationscouldnotbepipelinedwithanyotherfloating-pointoperations;thatis,onceafloating-pointin-structionisinvoked,allotherfloating-pointinstructionsstalluntilitscompletion.ThelargertransistorbudgetavailableforthePentiummi-croprocessorpermitsacompletelynewapproachinthede-signofthefloating-pointmicroarchitecture.TheaggressiveJune199315PentiummicroprocessorIntegerpipeFloating-pointpipeFigure8.Floating-pointpipeline.performancegoalsfortheFPtJpresentedanexcitingchal-lengeforthedesigners,evenwithinoresiliconresourcesavailable.Furthermore,maintainingfill1compatibilitywithpreviousproductsandwiththeIEEEstandardforfloating-pointarithmeticwasanuncoinpromisingrequirement.Floating-pointpipelinestages.Pentiumsfloating-pointpipelineconsistsofeightstages.Thefirsttwostagesarepro-cessedbytheconmion(integerpipeline)resourcesforprefetchanddecode.Inthethirdstagethefloating-pointhardwarebeginsactivatinglogicforinstructionexecution.AllofthefirstfiveStagesarematchedwiththeircounterpartintegerpipelineStagesforpipelinesequencingandsynchronization(seeFigure8).Prefetch.ThePFstageisthesame21sintheintegerpipe-line.Fin-tdecode.TheD1stageisthesameasintheintegerpipeline.Seconddecode.TheD2stageisthesame:isintheinte-gerpipeline.Operand,fetch.InthisEstagetheFPLJaccessesImththedatacacheandthefloating-pointregisterfiletofetchtheoperandsnecessaryfortheoperation.Whenfloating-pointdataistobewrittentothedatacache.theFPUconvertsinternaldataformatintotheappropriatememoryrepresentation.ThisstagematchestheEstageoftheintegerpipeline.Firstexecute.IntheX1stagetheFPUexecutesthefirststepsofthefloating-pointcomputation.Whenfloating-pointdataisreadfromthedatacache,theFPUwritestheincomingdataintothefloating-pointregisterfile.Secondexecute.IntheX2stagetheFP1Jcontinuestoexecutethefloating-pointcomputation.WriteJoat.IntheWFstagetheFPUcompletestheex-ecutionofthefloating-pointcomputationandwritestheresultintothefloating-pointregisterfile.Errorreporting.IntheERstagetheFPUreportsinternalspecialsituationsthatmightrequireadditionalprocess-ingtocompleteexecutionandupdatesthefloating-pointstatusword.Theeight-stagepipelineintheFPUallowsasinglecyclethroughputformostofthebasicfloating-pointinstructionssuchasfloating-pointadd.subtract,inultiply,andcompare.Thismeansthatasequenceofbasicfloating-pointinstnic-tionsfreefromdatadependencieswouldexecuteatarateofoneinstructionpercycle,assuminginstructioncacheanddatacachehits.Datadependenciesexistbetweenfloating-pointinstruc-tionswhenasubsequentinstructionusestheresultofapre-cedinginstruction.Sincetheactualcomputationoffloating-pointresultstakesplaceduringX1,X2,andWFstages,specialpathsinthehardwareallowotherstagestobeby-passedandpresenttheresulttothesubsequentinstructionupongeneration.Consequently,thelatencyofthebasicfloating-pointinstructionsisthreecycles.TheX86floating-pointarchitecturesupportssingle-precision(32-bit),double-precision(66bit),andextended-precision(80-hit)floating-pointoperations.Wechosetosupportallcom-piitationforthethreeprecisionsdirectly,byextendingthedatapathwidthtosupportextendedprecision.Althoughthisentailedusinginoredevicesfortheimplementation,itgreatlysimplifiedthemicroarchitecturewhileimprovingtheperfor-mance.Ifsmallerdatapathsweredesigned,specialreroutingofthedatawithintheFPlJandseveralstatemachinesormicrocodesequencingvouldhavebeenrequiredforcalcu-latingthehigherprecisiondata.Floating-pointinstructionsexecuteintheUpipeandgen-erallycannotbepairedwithanyotherintegerorfloating-pointinstructions(theoneexceptionwillbeexplainedlater).Thedesignwastunedforinstructionsthatuseone64-bitoperandinmemorywiththeotheroperandresidinginthefloating-pointregisterfile.Thus.theseoperationsmayex-ecuteatthemaximumthroughputrate,sinceafullstage(Estage)inthepipelineisdedicatedtooperandfetching.Al-thoughfloating-pointinstructionsusetheUpipeduringtheEstage.thetuoportstothedatacache(whichareusedbytheUpipeandtheVpipeforintegeroperations)areusedtobring64-bitdatatotheFPU.Consequently,duringintensivefloating-pointcomputationprograms,thedatacacheaccessportsoftheLJpipeandVpipeoperateconcurrentlywiththefloating-pointcomputation.Thisbehaviorissimilartosuperscalarload-storeRISCdesignswhereloadinstructionsexecuteinparallelLvithfloating-pointoperations,andthere-foredeliverequivalentthroughputoffloating-pointopera-tionspercycle.Microarchitectureoverview.Thefloating-pointunitofthePentiummicroprocessorconsistsofsixfunctionalsec-tions(seeFigure9).Thefloating-pointinterface,registerfile,andcontrol(FIRC)sectionistheonlyinterfacebetweentheFPUandtherestoftheCPU.Sincethefunctionoffloating-pointoperationsisusuallyself-containedwithinthefloating-pointcomputationcore,concentratingalltheinterfacelogicinonesectionhelpedtocreateamodulardesignoftheothersections.TheFIRCsectionalsocontainsmostofthecommonfloating-pointre-sources:registerfile.centralizedcontrollogic,andsafein-structionrecognitionlogic(describedlater).FIRCcancompleteexecutionofinstructionsthatdonotneedarithmeticconipu-16IEEEMicro.-.tation.Itdispatchestheinstructionsrequiringarithmeticcom-putationtothearithmeticsections.Thefloating-pointexponentsection(FEXP)calculatestheexponentandthesignresultsforallthefloating-pointarith-meticoperations.Itinterfaceswithalltheotherarithmeticsectionsforallthenecessaryadjustmentsbetweentheman-tissaandthesign-and-exponentfieldsinthecomputationoffloating-pointresults.Thefloating-pointmultipliersection(FMUL)includesafullmultiplierarraytosupportsingle-precision(24-bitmantissa).double-precision(j3-bitmantissa),andextended-precision(64-bitmantissa)multiplicationandroundingwithinthreecycles.FMULexecutesallthefloating-pointmultiplicationoperations.Itisalsousedforintegermultiplication,whichisimplementedthroughmicrocodecontrol.Thefloating-pointaddersection(FAIII)executesallthe“add”floating-pointinstructions,suchasfloating-pointadd,subtract,andcompare.FADDalsoexecutesalargesetofmicro-operationsthatareusedbymicrocodesequencesinthecalculationofcomplexinstructions,suchastinarycodeddecimal(BCD)operations,fomiatconversions,andtranscen-dentalfunctions.TheFAIIDsectionoperatesduringtheX1andX2stagesofthefloating-pointpipelineandemploysseveralwideaddersandshifterstosupporthigh-speedarith-meticalgorithmswhileinaintainingmaximumperformanceforalldataprecisions.TheCPUachievesaLatencyofthreecycleswithathroughputofonecycleforalltheoperationsdirectlyexecutedbytheFADDsectionforsingle-precision,double-precision,andextended-precisiondata.Thefloating-pointdivider(FDIV)sectionexecutesthefloating-pointdivide,remainder,andsquare-rootinstructions.Itoper-atesduringtheX1andX2pipelinestagesandcalcukatestwobitsofthedividequotienteverycycle.Theoverallinstmctionlatencydependsontheprecisionoftheoperation.FDIVusesitsownsequencerforiterativecomputationduringtheX1stage.TheresultsarefullyaccurateinaccordancewithIEEEstandard754andreadyforroundingattheendoftheX2stage.Thefloating-pointrounder(FRND)sectionroundsthere-sultsdeliveredfromtheFADDandFDIVsections.ItoperatesduringtheWFstageofthefloating-pointpipelineanddeliv-ersaroundedresultaccordingtotheprecisioncontrolandtheroundingcontrol,whicharespecifiedinthefloating-pointcontrolword.Safeinstructionrecognition.Floating-pointcomputa-tionrequireslongerexecutiontimesthanintegercomputa-tion.Pentiumsfloating-pointpipelineuseseightstages.whiletheintegerpipelineusesonlyfivestages.Compatibilityre-quiresin-orderinstructionexecutionaswellaspreciseex-ceptionreporting.TomeettheserequirementsinthePentiumprocessor,floating-pointinstructionsshouldnotproceedbeyondtheX1stage,thatis.allowsubsequentinstructionstoproceedbeyondtheEstage,unlessthefloating-pointin-structionisguaranteedtocompletewithoutcausinganex-ToifromintegericacheMantissaresultExponentresult-IIIIIFDIVIiFADD.1FMULccIFRNDFigure9.Floating-pointunitblockdiagram.ception.Otherwise,aninstructionmaychangethestateoftheCPU,whileanearlierfloating-pointinstruction(whichhasnotyetcompleted)mightcauseanexceptionthatre-quiresatraptoasoftwareexceptionhandler.Toavoidasubstantialperformancelossduetostallinginstructionsuntiltheexceptionstatusofapreviousfloating-pointinstructionisknown,Pentiumsfloating-pointunitem-ploysamechanismcalledsafeinstructionrecognition(SIR).Thislogicdetermineswhetherafloating-pointinstructionisguaranteedtocompletewithoutcreatinganexceptionandthereforeisconsidered“safe.”Ifaninstructionissafe,thereisnoneedtostallthepipeline,andthemaximumthrough-putcanbeobtained.If,however,theinstructionisnotsafe,thepipelinestallsforthreecyclesuntiltheunsafeinstructionreachestheERstageandafinaldeterminationoftheexcep-tionSVdtUSismade.SixpossibleexceptionscanoccuronthePentiummicroprocessorsfloating-pointoperations:invalidoperation,dividebyzero,denomdloperand,overflow,underflow,andinexact.TheSIRlogicneedstodetermineearlyinthefloat-ingpipeline-intheX1stage-beforeanycomputationtakesplacewhethertheinstructionisguaranteedtobeexceptionfree(safe)ornot(unsafe).Thefirstthreeofthesixexcep-tionscanbedetectedwithoutanyfloating-pointcalculation.Fromthelatterthreeexceptions,theinexactexceptionisusually“masked“bytheoperatingsystemorthesoftwareapplication(usingtheprecisionmask,orPM,bitinthefloating-pointcontrolword).Otherwise,atrapwilloccurwheneverroundingoftheresultisnecessary.Whepthepre-tJune199317PentiummicronrocessorSTOST1ST2ST3ST4ST5ST6ST7Cycle1+FADDQWORDPTREAXFXCHST(2)Cycle2+FMULQWORDPTREBXIFXCHST(3)ST2ST3ST4ST5GST6HST7ST4ST5ST6ST7FigureIO.FXCHcodeexample.cision(inexact)exceptionismasked,thepipelinedeliversthecorrectlyroundedresultdirectly.ForoverflowandunderflowexceptionsSIRlogicusesanalgorithmthatmoni-torstheexponentfieldsoftheinputoperandstoconcludetheexceptionstatus(safeorunsafe).IntheX86architecturetheCPUstoresfloating-pointoper-andsinthefloating-pointregisterfilewithanextended-precisionexponent,regardlessoftheprecisioncontrolinthefloating-pointcontrolword.Theextended-precisionexpo-nentsupportsmuchgreaterrangethanthedouble-precisionformat.Overflowandunderflowexceptionscausedbycon-vertingthedataintodouble-precisionorsingle-precisionfor-matsoccuronlywhenstoringthedataintoexternalmemory.ThesecharacteristicsoftheX86floating-pointarchitecturegiveauniqueadvantagetotheeffectivenessoftheSIRmecha-nisminthePentiumCPU,sincetheSIRalgorithmcanusetheinternal(extended-precision)exponentrange.Thus,theoc-currenceofunsafeoperationsisextremelyrare.Ourevalua-tionoftheSIRalgorithmfortheFPUdesignfoundnounsafeinstructionsinsimulatedexecutionoftheSPEC89floating-pointbenchmarks.Registerstackmanipulation.Thex86floating-pointin-structionsetusestheregisterfileasastackofeightregistersinwhichthetopofstack(TOS)actsasanaccumulatoroftheresults.Therefore,thetopofthestackisusedforthemajorityoftheinstructionsasoneofthesourceoperandsand,usually,asthedestinationregister.Toimprovethefloating-pointpipelineperformancebyop-timizingtheuseofthefloating-pointregisterfile,PentiumsFPUcanexecutetheFXCHinstructioninparallelwithanybasicfloating-pointoperation.TheFXCHinstructionswapsthecontentsoftheTOSregisterwithanotherregisterinthefloating-pointregisterfile.Allthebasicfloating-pointinstruc-tionsmaybepairedwithFXCHintheVpipe.Thepairex-ecuteinparallel,evenwhendatadependencybetweenthetwoinstructionsinthepairexists.TheuseofparallelFXCHredirectstheresultofafloating-pointoperationtoanyse-lectedregisterintheregisterfile,whilebringinganewoper-andtothetopofthestackforimmediateusebythenextfloating-pointoperation.TheexampleshowninFigure10illustratestheuseofpar-allelFXCH.Thecodeintheexamplegeneratestheresultsoftwoindependentfloating-pointcalculations.Thefloating-pointregisterfilecontainsinitialvaluespriortocodeexecution:registerSTO(TOS)containsthevalueA,registerSTlcontainsvalueB,registerST2containsvalueC,andsoon.Thetwooperationsare1)floating-pointadditionofvalueAwiththe64-bitfloating-pointoperandaddressedbythegenera1registerEAX,and2)floating-pointmultiplicationofvalueCbythe&bitfloating-pointoperandaddressedbythegeneralregisterEBX.Whenthefloating-pointpipelineisfullyloadedandthesetwooperationsarepartofthecodesequence,theparallelFXCHallowsthecalculationtomaintainthemaximumthroughputofonecycleperoperation.WithinonecyclethePentiumCPUwritestheresultoftheadditiontoST2,whiletheoperandforthenextoperationmovestothetopofthestack.Onthenextcycle,theprocessorwritestheresultofthemultiplicationtoST3,whilethetopofthestackcontainsvalueD,whichmaybeusedforasubsequentoperation.Transcendentalinstructions.TheCPUsupportsalleighttranscendentalinstructionsthataredefinedintheinstructionsetthroughdirectexecutionofmicrocodesequences.Thetranscendentalinstructionsare1)FSIN2)FCOS3)FSINCOS4)FPTAN5)FPATAN6)F2XM17)M2X8)FYL2xPsine,cosine,sineandcosine,tangent,arctangent,2*x-1,Y*Log2(X),and1Y*Log2(X+l)Wedevelopednew,table-drivenalgorithmsforthetran-scendentalfunctionsusingpolynomialapproximationtech-niques.Thesealgorithmssubstantiallyimprovedperformanceandaccuracyoverthei486CPUimplementation,whichusedthemoretraditionalCordicalgorithms.Theapproximationtablesresideinanon-chipROMalongwiththeotherspecialconstantsthatareusedforfloating-pointcomputation.Theperformanceimprovementofthetranscendentalin-structionsonthePentiumprocessorrangesfromtwotothreetimesoverthesameinstructionsonthei486CPUatthesamefrequency.Theworst-caseerrorforallthetranscendentalin-structionsislessthan1ulp(unitinthelastplace)whenroundingtonearestevenandlessthan1.5ulpswhenround-inginothermodes.Thefunctionsareguaranteedtobemono-tonic,withrespecttotheinputoperands,throughoutthedomainsupportedbytheinstruction.18IEEEMicroDevelopmentprocessDevelopingahighlyintegratedmicroproc

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论