会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

53-Architecture of the Pentium Microprocessor.pdf

  • 资源星级:
  • 资源大小:1,015.14KB   全文页数:11页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

53-Architecture of the Pentium Microprocessor.pdf

ArchitectureofthePentiumMicroprocessorThePentiwnCPUisthelatestinIntelsfamilyofcompatiblemicroprocessors.Itintegrates3.1milliontransistorsin0.8pmBiCMOStechnology.Wedescribethetechniquesofpipelining,superscalarexecution,andbranchpredictionusedinthemicroprocessorsdesign.hePentiumprocessorisIntelsnextgenerationofcompatiblemicroprocessorsfollowingthepopulari486CPUfamily.Thedesignstartedinearly1989withtheprimarygoalofmaximizingperformancewhilepreservingsoftwarecompatibilitywithinthepracticalconstraintsofavailabletechnology.ThePentiumprocessorintegrates3.1milliontransistorsin0.8ymBiCMOStechnologyandcarriestheInteltrademark.Wedescribethearchitectureanddevelopmentprocessemployedtoachievethisgoal.DonaldAlpertDrorAvnonhtelCorporationTechnologyThecontinualadvancementofsemiconductortechnologypromotesinnovationinmicroprocessordesign.Higherlevelsofintegration,madepossiblebyreducedfeaturesizesandincreasedinterconnectionlayers,enabledesignerstodeployadditionalhardwareresourcesformoreparallelcomputationanddeeperpipelining.Fasterdevicespeedsleadtohigherclockratesandconsequentlytorequirementsforlargerandmorespecializedonchipmemorybuffers.Table1nextpagesummarizesthetechnologyimprovementsassociatedwithourthreemostrecentmicroprocessorgenerations.The0.8ymBiCMOStechnologyofthePentiummicroprocessorenables2.5timesthenumberoftransistorsandtwicetheclockfrequencyoftheoriginali486CPU,whichwasimplementedin1.0pmCMOS.CompatibiIitySinceintroductionofthe8086microprocessorin1978,theX86architecturehasevolvedthroughseveralgenerationsofsubstantialfunctionalenhancementsandtechnologyimprovements,includingthe80286andi386CPUs.EachoftheseCPUswassupportedbyacorrespondingfloatingpointunit.Thei486CPU,introducedin1989,integratesthecompletefunctionalityofanintegerprocessor,floatingpointunit,andcachememoryintoasinglecircuit.TheX86architecturegreatlyappealedtosoftwaredevelopersbecauseofitswidespreadapplicationasthecentralprocessorofIBMcompatiblepersonalcomputers.ThesuccessofthearchitectureinPCshasinturnmadetheX86popularforcommercialserverapplicationsaswell.Figure1showssomeofthewellknownsoftwareenvironmentsthatarehostedonthearchitecture.ThecommonsoftwareenvironmentsallowtheX86architecturetoexerciseseveraloperatingmodes.ApplicationsdevelopedforDOSuse16bitrealmodeorvirtual8086modeandMSWindows.EarlyversionsofOS/2use16bitprotectedmode,andapplicationsforotherpopularenvironmentsuse32bitflatunsegmentedmode.ThePentiummicroprocessoremploysgeneraltechniquesforimprovingperformanceinalloperatingmodes,aswellascertaintechniquesforimprovingperformanceinspecificoperating02721732/93/0600001103.0001993IEEEJune199311PentiummicroprocessorTable1.Technologyformicroprocessordevelopment.No.ofFrequencyMicroprocessorYearTechnologytransistorsMHz1386CPU19861.5pmCMOS,275K16twolayermetali486CPU19891.OpmCMOS,1.2M33twolayermetalPentiumCPU19930.8pmBiCMOS,3.1M66threelayermetal1bitaeneratton32bitgenerationUnixSVR4scoNetware311DOSOSF/1MSWtndowsos121xNextStep32bitOS/2SolarisWindowsNTirTaligentUnivel1980s1991199x1Figure1.Softwareenvironments.Allfigures,tables,andphotographspubllshedinthzsarticlearethepropertyofIntelColporationjII64bitsinterface64bitsPrefetchbuffersPipelinedfloatingpointPipeIIvpipeunitIntegerIntegerRegistersetMultiplier32bitsIAdderII4DatacacheI1DividerIFigure2.Pentiumprocessorblockdiagram.modes.Wefocusonthe32bitflatmodehere,sincethisisthemostappropriatemodeforcomparisonwiththeotherhighperformancemicroprocessorsdescribedattheHotChipsIVConference.TheX86architecturesupportstheIEEE754standardforfloatingpointarithmetic.Inadditiontorequiredoperationsonsingleprecisionanddoubleprecisionformats,theXS6floatingpointarchitectureincludesoperationson8Obit,extendedprecisionformatandasetofbasictranscendentalfunctions.PentiumCPUdesignersfoundnumerousexcitingtechnicalchallengesindevelopingamicroarchitecturethatmaintainedcompatibilitywithsuchadiversesoftwarebase.Laterinthisarticlewepresentexamplesoftechniquesforsupportingselfmodifyingcodeandthestackoriented,floatingpointregisterfile.PerformanceAmicroprocessorsperformanceisacomplexfunctionofmanyparametersthatvarybetweenapplications,compilers,andhardwaresystems.IndevelopingthePentiummicroprocessor,thedesignteamaddressedtheseaspectsforeachofthepopularsoftwareenvironments.Asaresult,PentiumCPUfeaturestunedcompilersandcachememory.WefocusontheperformanceofSPECbenchmarksforboththePentiummicroprocessorandi486CPUinsystemswithwelltunedcompilersandcachememory.Morespecifically,thePentiumCPUachievesroughlytwotimesthespeeduponintegercodeanduptofivetimesthespeeduponfloatingpointvectorcodewhencomparedwithani486CPUofidenticalclockfrequency.OrganizationFigure2showstheoverallorganizationofthePentiummicroprocessor.Thecoreexecutionunitsaretwointegerpipelinesandafloatingpointpipelinewithdedicatedadder,multiplier,anddivider.Separateonchipinstructioncodeanddatacachessupplythememorydemandsoftheexecutionunits,withabranchtargetbufferaugmentingtheinstructioncachefordynamicbranchprediction.Theexternalinterfaceincludesseparateaddressand64bitdatabuses.IntegerpipelineThePentiumprocessorsintegerpipelineissimilartothatofthei486CPU.3ThepipelinehasfivestagesseeFigure3withthefollowingfunctionsPreftcch.DuringthePFstagetheCPUprefetchescodefromtheinstructioncacheandalignsthecodetothe12IEEEMicroPFD1D2EWBFetchandaligninstructionLDecodeinstructionGeneratecontrolwordDecodecontrolwordGeneratememoryaddresscAccessdatacacheorcalculateALUresultWriteresultIIFigure3.Integerpipeline.PFD1D2EWBFetchandaligninstructionLDecodeinstructionGeneratecontrolwordDecodecontrolwordIGeneratememoryaddressIiAccessdatacacheorcalculateALUresultIDecodecontrolwordiGeneratememoryaddress.IAccessdatacacheorcalculateALUresultWriteresultWriteresultUpipeVpipeFigure4.Superscalarexecution.initialbyteofthenextinstructiontobedecoded.Becauseinstructionsareofvariablelength,thisstageincludesbufferstoholdboththelinecontainingtheinstructionbeingdecodedandthenextconsecutiveline.Firstdecode.IntheD1stagetheCPUdecodestheinstructiontogenerateacontrolword.AsinglecontrolwordexecutesinstructionsdirectlymorecomplexinstructionsrequiremicrocodedcontrolsequencinginD1..Seconddecode.IntheD2stagetheCPUdecodesthecontrolwordfromD1foruseintheEstage.Inaddition,theCPUgeneratesaddressesfordatamemoryreferences.Execute.IntheEstagetheCPUeitheraccessesthedatacacheorcalculatesresultsintheMUarithmeticlogicunit,barrelshifter,orotherfunctionalunitsinthedatapath.WritebuckIntheWBstagetheCPUupdatestheregistersandflagswiththeinstructionsresults.AllexceptionalconditionsmustberesolvedbeforeaninstructioncanadvancetoWB.Comparedtotheintegerpipelineofthei486CPU,thePentiummicroprocessorintegratesadditionalhardwareinseveralstagestospeedinstructionexecution.Forexample,thei486CPUrequirestwoclockstodecodeseveralinstructionformats,butthePentiumCPUtakesoneclockandexecutesshiftandmultiplyinstructionsfaster.Moresignificantly,thePentiumprocessorsubstantiallyenhancessuperscalarexecution,branchprediction,andcacheorganization.Superscalarexecution.ThePentiumCPUhasasuperscalarorganizationthatenablestwoinstructionstoexecuteinparallel.Figure4showsthattheresourcesforaddressgenerationandMUfunctionshavebeenreplicatedinindependentintegerpipelines,calledUandV.ThepipelinenameswereselectedbecauseUandVwerethefirsttwoconsecutivelettersofthealphabetneitherofwhichwastheinitialofafunctionalunitinthedesignpartitioning.InthePFandD1stagestheCPUcanfetchanddecodetwosimpleinstructionsinparallelandissuethemtotheUandVpipelines.Additionally,forcomplexinstructionstheCPUinD1cangeneratemicrocodesequencesthatcontrolbothUandVpipelines.Severaltechniquesareusedtoresolvedependenciesbetweeninstructionsthatmightbeexecutedinparallel.MostofthelogiciscontainedintheinstructionissuealgorithmseeFigure5ofD1.DecodetwoconsecutiveinstructionsI1andI2IfthefollowingarealltrueI1isasimpleinstructionI2isasimpleinstructionI1isnotajumpinstructionDestinationofI1zsourceofI2DestinationofI1jldestinationofI2Thenissue11toUpipeandI2toVpipeElseissueI1toUpipeFigure5.Instructionissuealgorithm.June199313PentiummicroprocessorlrBranchinstructionaddressHistoryBranchdestinationaddressFigure6.Branchtargetbuffer.Resourcedependencies.Aresourcedependencyoccurswhentwoinstructionsrequireasinglefunctionalunitordatapath.DuringtheD1stage,theCPUonlyissuestwoinstructionsforparallelexecutionifbotharefromaclassofsimpleinstructions,therebyeliminatingmostresourcedependencies.Theinstructionsmustbedirectlyexecuted.thatis,notrequiremicrocodesequencing.TheinstructionbeingissuedtotheVpipecanbeanALUoperation,memoryreference,orjump.TheinstructionbeingissuedtotheUpipecanbefromthesamecategoriesorfromanadditionalsetthatusesafunctionalunitavailableonlyintheUpipe,suchasthebarrelshifter.Althoughthesetofinstructionsidentifiedassimplemightseemrestrictive,morethan90percentofinstructionsexecutedintheIntegerSPECbenchmarksuitearesimple.Datadependencies.Adatadependencyoccurswhenoneinstructionwritesaresultthatisreadorwrittenbyanotherinstruction.LogicinD1ensuresthatthesourceanddestinationregistersoftheinstructionissuedtotheVpipedifferfromthedestinationregisteroftheinstructionissuedtotheUpipe.ThisarrangementeliminatesreadafterwriteRAWandwriteafterwriteWAWdependencies.WriteafterreadWARdependenciesneednotbecheckedbecausereddsoccurinanearlierstageofthepipelinesthanwrites.Thedesignincludeslogicthatenablesinstructionswithcertainspecialtypesofdatadependencytobeexecutedinparallel.Forexample,aconditionalbranchinstructionthatteststheflagresultscanbeexecutedinparallelwithacompareinstructionthatsetstheflags.Controldependencies.Acontroldependencyoccurswhentheresultofoneinstructiondetermineswhetheranotherinstructionwillbeexecuted.WhenajumpinstructionisissuedtotheUpipe,theCPUinD1neverissuesaninstructiontotheVpipe,therebyeliminatingcontroldependencies.NotethatresourcedependenciesanddatadependenciesbetweenmemoryreferencesarenotresolvedinD1.Dependentmemoryreferencescanbeissuedtothetwopipelinesweexplaintheirresolutioninthedescriptionofthedatacache.Branchprediction.Thei486CPUhasasimpletechniqueforhandlingbranches.Whenabranchinstructionisexecuted,thepipelinecontinuestofetchanddecodeinstructionsalongthesequentialpathuntilthebranchreachestheEstage.InE,theCPUfetchesthebranchdestination,andthepipelineresolveswhetherornotaconditionalbranchistaken.Ifthebranchisnottaken,theCPUdiscardsthefetcheddestination,andexecutionproceedsalongthesequentialpathwithnodelay.Ifthebranchistaken,thefetcheddestinationisusedtobegindecodingalongthetargetpathwithtwoclocksofdelay.Takenbranchesarefoundtobe15percentto20percentofinstructionsexecuted,representinganobviousareaforimprovementbythePentiumprocessor.ThePentiumCPUemploysabranchtargetbufferBTB,whichisanassociativememoryusedtoimproveperformanceoftakenbranchinstructionsseeFigure6.Whenabranchinstructionisfirsttaken,theCPUallocatesanentryinthebranchtargetbuffertoassociatethebranchinstructionsaddresswithitsdestinationaddressandtoinitializethehistoryusedinthepredictionalgorithm.Asinstructionsaredecoded,theCPUsearchesthebranchtargetbuffertodeterminewhetheritholdsanentryforacorrespondingbranchinstruction.Whenthereisahit,theCPUusesthehistorytodeterminewhetherthebranchshouldbetaken.Ifitshould,themicroprocessorusesthetargetaddresstobeginfetchinganddecodinginstructionsfromthetargetpath.ThebranchisresolvedearlyintheWBstage,andifthepredictionwasincorrect,theCPUflushesthepipelineandresumesfetchingalongthecorrectpath.TheCPUupdatesthedualportedhistoryintheWBstage.Thebranchtargetbufferholdsentriesforpredicting256branchesinafourwayassociativeorganization.Usingthesetechniques,thePentiumCPUexecutescorrectlypredictedbrancheswithnodelay.Inaddition,conditionalbranchescanbeexecutedintheVpipepairedwithacompareorotherinstructionthatsetstheflagsintheUpipe.Branchingexecuteswithfullcompatibilityandnomodificationtoexistingsoftware.Weexplainaspectsofinteractionsbetweenbranchpredictionandselfmodifyingcodelater.Cacheorganization.Thei486CPUemploysasingleonchipcachethatisunifiedforcodeanddata.Thesingleportedcacheismultiplexedonademandbasisbetweensequentialcodeprefetchesofcompletelinesanddatareferencestoindividuallocations.Asjustexplained,branchtargetsareprefetchedintheEstage,effectivelyusingthesamehardwareasdatamemoryreferences.Therearepotentialadvantagesforsuchanorganizationoveronethatseparatescodeanddata.1Foragivensizeofcachememory,aunifiedcachehasahigherhitratethanseparatecachesbecauseitbalancesthetotalallocationofcodeanddatalinesautomatically.2Onlyonecacheneedstobedesigned.3Handlingselfmodifyingcodecanbesimpler.14IEEEMicroDespitethesepotentialadvantagesofaunifiedcache,allofwhichapplytothei486CPU,thePentiummicroprocessorusesseparatecodeanddatacaches.Thereasonisthatthesuperscalardesignandbranchpredictiondemandmorebandwidththanaunifiedcachesimilartothatofthei486CPUcanprovide.First,efficientbranchpredictionrequiresthatthedestinationofabranchbeaccessedsimultaneouslywithdatareferencesofpreviousinstructionsexecutinginthepipeline.Second,theparallelexecutionofdatamemoryreferencesrequiressimultaneousaccessesforloadsandstores.Third,inthecontextoftheoverallPentiummicroprocessordesign,handlingselfmodifyingcodeforseparatecodeanddatacachesisonlymarginallymorecomplexthanforaunifiedcache.Theinstructioncacheanddatacacheareeach8Kbyte,twowayassociativedesignswith32bytelines.Programsexecutingonthei486CPUtypicallygeneratemoredatamemoryreferencesthanwhenexecutingonRISCmicroprocessors.MeasurementsonIntegerSPECbenchmarksshow0.5to0.6datareferencesperinstructionforthei486CPU4andonly0.17to0.33fortheMipsprocessor.jThisdifferenceresultsdirectlyfromthelimitednumbereightofregistersfortheX86architecture,aswellasprocedurecallingconventionsthatrequirepassingallparametersinmemory.Asmalldatacacheisadequatetocapturethelocalityoftheadditionalreferences.Afterall,theadditionalreferenceshavesufficientlocalitytofitintheregisterfileoftheRISCmicroprocessors.ThePentiummicroprocessorimplementsadatacachethatsupportsdualaccessesbytheUpipeandVpipetoprovideadditionalbandwidthandsimplifycompilerinstructionschedulingalgorithms.Figure7showsthattheaddresspathtothetranslationlookasidebufferanddatacachetagsisafullydualportedstructure.Thedatapath,however,issingleportedwitheightwayinterleavingof32bitwidebanks.Whenabankconflictoccurs,theUpipeassumespriority,andtheVpipestallsforaclockcycle.Thebankconflictlogicalsoservestoeliminatedatadependenciesbetweenparallelmemoryreferencestoasinglelocation.Formemoryreferencestodoubleprecisionfloatingpointdata,theCPUaccessesconsecutivebanksinparallel,formingasingle64bitpath.Thedesignteamconsideredafullydualportedstructureforthedatacache,butfeasibilitystudiesandperformancesimulationsshowedtheinterleavedstructuretobemoreeffective.Thedualportedstructureeliminatedbankconflicts,buttheSRAMcellwouldhavebeenlargerthanthecellusedintheinterleavedscheme,resultinginasmallercacheandlowerhitratiofortheallocatedarea.Additionally,thehandlingofdatadependencieswouldhavebeenmorecomplex.Withawritethroughcacheconsistencyprotocoland32bitdatabus,thei486DX2CPUusesbuses80percentofthetime85percentofallbuscyclesarewrites.Thei486DX2CPUhasacorepipelinethatoperatesattwicethebusclocksDualportedTLBBankconflictdetection77IIIDualportedcachetagsFigure7.Dualaccessdatacache.Singeportedandinterleavedbcachedatafrequency.ForthePentiummicroprocessor,withitshigherperformancecorepipelinesand64bitdatabus,usingawritebackprotocolforcacheconsistencywasanobviousenhancement.Thewritebackprotocolusesfourstatesmodified,exclusive,shared,andinvalidMESI.Selfmodifyingcode.OnechallengingaspectofthePentiummicroprocessorsdesignwassupportingselfmodifyingcodecompatibly.Compatibilityrequiresthatwhenaninstructionismodifiedfollowedbyexecutionofatakenbranchinstruction,subsequentexecutionsofthemodifiedinstructionmustusetheupdatedvalue.Thisisaspecialformofdependencybetweendatastoresandinstructionfetches.Theinteractionbetweenbranchpredictionsandselfmodifyingcoderequiresthemostattention.ThePentiumCPUfetchesthetargetofatakenbranchbeforepreviousinstructionshavecompletedstores,sodedicatedlogicchecksforsuchconditionsinthepipelineandflushesincorrectlyfetchedinstructionswhennecessary.TheCPUthoroughlyverifiespredictedbranchestohandlecasesinwhichaninstructionenteredinthebranchtargetbuffermightbemodified.Thesamemechanismsusedforconsistencywithexternalmemorymaintainconsistencybetweenthecodecacheanddatacache.FloatingpointpipelineThei486CPUintegratedthefloatingpointunitFPUonchip,thuseliminatingoverheadofthecommunicationprotocolthatresultedfromusingacoprocessor.BringingtheFPUonchipsubstantiallyboostedperformanceinthei486CPU.Nevertheless,duetolimiteddevicesavailablefortheFPU,itsmicroarchitecturewasbasedonapartialmultiplierarrayandashiftandadddatapathcontrolledbymicrocode.Floatingpointoperationscouldnotbepipelinedwithanyotherfloatingpointoperationsthatis,onceafloatingpointinstructionisinvoked,allotherfloatingpointinstructionsstalluntilitscompletion.ThelargertransistorbudgetavailableforthePentiummicroprocessorpermitsacompletelynewapproachinthedesignofthefloatingpointmicroarchitecture.TheaggressiveJune199315PentiummicroprocessorIntegerpipeFloatingpointpipeFigure8.Floatingpointpipeline.performancegoalsfortheFPtJpresentedanexcitingchallengeforthedesigners,evenwithinoresiliconresourcesavailable.Furthermore,maintainingfill1compatibilitywithpreviousproductsandwiththeIEEEstandardforfloatingpointarithmeticwasanuncoinpromisingrequirement.Floatingpointpipelinestages.Pentiumsfloatingpointpipelineconsistsofeightstages.Thefirsttwostagesareprocessedbytheconmionintegerpipelineresourcesforprefetchanddecode.Inthethirdstagethefloatingpointhardwarebeginsactivatinglogicforinstructionexecution.AllofthefirstfiveStagesarematchedwiththeircounterpartintegerpipelineStagesforpipelinesequencingandsynchronizationseeFigure8.Prefetch.ThePFstageisthesame21sintheintegerpipeline.Fintdecode.TheD1stageisthesameasintheintegerpipeline.Seconddecode.TheD2stageisthesameisintheintegerpipeline.Operand,fetch.InthisEstagetheFPLJaccessesImththedatacacheandthefloatingpointregisterfiletofetchtheoperandsnecessaryfortheoperation.Whenfloatingpointdataistobewrittentothedatacache.theFPUconvertsinternaldataformatintotheappropriatememoryrepresentation.ThisstagematchestheEstageoftheintegerpipeline.Firstexecute.IntheX1stagetheFPUexecutesthefirststepsofthefloatingpointcomputation.Whenfloatingpointdataisreadfromthedatacache,theFPUwritestheincomingdataintothefloatingpointregisterfile.Secondexecute.IntheX2stagetheFP1Jcontinuestoexecutethefloatingpointcomputation.WriteJoat.IntheWFstagetheFPUcompletestheexecutionofthefloatingpointcomputationandwritestheresultintothefloatingpointregisterfile.Errorreporting.IntheERstagetheFPUreportsinternalspecialsituationsthatmightrequireadditionalprocessingtocompleteexecutionandupdatesthefloatingpointstatusword.TheeightstagepipelineintheFPUallowsasinglecyclethroughputformostofthebasicfloatingpointinstructionssuchasfloatingpointadd.subtract,inultiply,andcompare.Thismeansthatasequenceofbasicfloatingpointinstnictionsfreefromdatadependencieswouldexecuteatarateofoneinstructionpercycle,assuminginstructioncacheanddatacachehits.Datadependenciesexistbetweenfloatingpointinstructionswhenasubsequentinstructionusestheresultofaprecedinginstruction.SincetheactualcomputationoffloatingpointresultstakesplaceduringX1,X2,andWFstages,specialpathsinthehardwareallowotherstagestobebypassedandpresenttheresulttothesubsequentinstructionupongeneration.Consequently,thelatencyofthebasicfloatingpointinstructionsisthreecycles.TheX86floatingpointarchitecturesupportssingleprecision32bit,doubleprecision66bit,andextendedprecision80hitfloatingpointoperations.Wechosetosupportallcompiitationforthethreeprecisionsdirectly,byextendingthedatapathwidthtosupportextendedprecision.Althoughthisentailedusinginoredevicesfortheimplementation,itgreatlysimplifiedthemicroarchitecturewhileimprovingtheperformance.Ifsmallerdatapathsweredesigned,specialreroutingofthedatawithintheFPlJandseveralstatemachinesormicrocodesequencingvouldhavebeenrequiredforcalculatingthehigherprecisiondata.FloatingpointinstructionsexecuteintheUpipeandgenerallycannotbepairedwithanyotherintegerorfloatingpointinstructionstheoneexceptionwillbeexplainedlater.Thedesignwastunedforinstructionsthatuseone64bitoperandinmemorywiththeotheroperandresidinginthefloatingpointregisterfile.Thus.theseoperationsmayexecuteatthemaximumthroughputrate,sinceafullstageEstageinthepipelineisdedicatedtooperandfetching.AlthoughfloatingpointinstructionsusetheUpipeduringtheEstage.thetuoportstothedatacachewhichareusedbytheUpipeandtheVpipeforintegeroperationsareusedtobring64bitdatatotheFPU.Consequently,duringintensivefloatingpointcomputationprograms,thedatacacheaccessportsoftheLJpipeandVpipeoperateconcurrentlywiththefloatingpointcomputation.ThisbehaviorissimilartosuperscalarloadstoreRISCdesignswhereloadinstructionsexecuteinparallelLvithfloatingpointoperations,andthereforedeliverequivalentthroughputoffloatingpointoperationspercycle.Microarchitectureoverview.ThefloatingpointunitofthePentiummicroprocessorconsistsofsixfunctionalsectionsseeFigure9.Thefloatingpointinterface,registerfile,andcontrolFIRCsectionistheonlyinterfacebetweentheFPUandtherestoftheCPU.Sincethefunctionoffloatingpointoperationsisusuallyselfcontainedwithinthefloatingpointcomputationcore,concentratingalltheinterfacelogicinonesectionhelpedtocreateamodulardesignoftheothersections.TheFIRCsectionalsocontainsmostofthecommonfloatingpointresourcesregisterfile.centralizedcontrollogic,andsafeinstructionrecognitionlogicdescribedlater.FIRCcancompleteexecutionofinstructionsthatdonotneedarithmeticconipu16IEEEMicro..tation.Itdispatchestheinstructionsrequiringarithmeticcomputationtothearithmeticsections.ThefloatingpointexponentsectionFEXPcalculatestheexponentandthesignresultsforallthefloatingpointarithmeticoperations.Itinterfaceswithalltheotherarithmeticsectionsforallthenecessaryadjustmentsbetweenthemantissaandthesignandexponentfieldsinthecomputationoffloatingpointresults.ThefloatingpointmultipliersectionFMULincludesafullmultiplierarraytosupportsingleprecision24bitmantissa.doubleprecisionj3bitmantissa,andextendedprecision64bitmantissamultiplicationandroundingwithinthreecycles.FMULexecutesallthefloatingpointmultiplicationoperations.Itisalsousedforintegermultiplication,whichisimplementedthroughmicrocodecontrol.ThefloatingpointaddersectionFAIIIexecutesalltheaddfloatingpointinstructions,suchasfloatingpointadd,subtract,andcompare.FADDalsoexecutesalargesetofmicrooperationsthatareusedbymicrocodesequencesinthecalculationofcomplexinstructions,suchastinarycodeddecimalBCDoperations,fomiatconversions,andtranscendentalfunctions.TheFAIIDsectionoperatesduringtheX1andX2stagesofthefloatingpointpipelineandemploysseveralwideaddersandshifterstosupporthighspeedarithmeticalgorithmswhileinaintainingmaximumperformanceforalldataprecisions.TheCPUachievesaLatencyofthreecycleswithathroughputofonecycleforalltheoperationsdirectlyexecutedbytheFADDsectionforsingleprecision,doubleprecision,andextendedprecisiondata.ThefloatingpointdividerFDIVsectionexecutesthefloatingpointdivide,remainder,andsquarerootinstructions.ItoperatesduringtheX1andX2pipelinestagesandcalcukatestwobitsofthedividequotienteverycycle.Theoverallinstmctionlatencydependsontheprecisionoftheoperation.FDIVusesitsownsequencerforiterativecomputationduringtheX1stage.TheresultsarefullyaccurateinaccordancewithIEEEstandard754andreadyforroundingattheendoftheX2stage.ThefloatingpointrounderFRNDsectionroundstheresultsdeliveredfromtheFADDandFDIVsections.ItoperatesduringtheWFstageofthefloatingpointpipelineanddeliversaroundedresultaccordingtotheprecisioncontrolandtheroundingcontrol,whicharespecifiedinthefloatingpointcontrolword.Safeinstructionrecognition.Floatingpointcomputationrequireslongerexecutiontimesthanintegercomputation.Pentiumsfloatingpointpipelineuseseightstages.whiletheintegerpipelineusesonlyfivestages.Compatibilityrequiresinorderinstructionexecutionaswellaspreciseexceptionreporting.TomeettheserequirementsinthePentiumprocessor,floatingpointinstructionsshouldnotproceedbeyondtheX1stage,thatis.allowsubsequentinstructionstoproceedbeyondtheEstage,unlessthefloatingpointinstructionisguaranteedtocompletewithoutcausinganexToifromintegericacheMantissaresultExponentresultIIIIIFDIVIiFADD.1FMULccIFRNDFigure9.Floatingpointunitblockdiagram.ception.Otherwise,aninstructionmaychangethestateoftheCPU,whileanearlierfloatingpointinstructionwhichhasnotyetcompletedmightcauseanexceptionthatrequiresatraptoasoftwareexceptionhandler.Toavoidasubstantialperformancelossduetostallinginstructionsuntiltheexceptionstatusofapreviousfloatingpointinstructionisknown,PentiumsfloatingpointunitemploysamechanismcalledsafeinstructionrecognitionSIR.Thislogicdetermineswhetherafloatingpointinstructionisguaranteedtocompletewithoutcreatinganexceptionandthereforeisconsideredsafe.Ifaninstructionissafe,thereisnoneedtostallthepipeline,andthemaximumthroughputcanbeobtained.If,however,theinstructionisnotsafe,thepipelinestallsforthreecyclesuntiltheunsafeinstructionreachestheERstageandafinaldeterminationoftheexceptionSVdtUSismade.SixpossibleexceptionscanoccuronthePentiummicroprocessorsfloatingpointoperationsinvalidoperation,dividebyzero,denomdloperand,overflow,underflow,andinexact.TheSIRlogicneedstodetermineearlyinthefloatingpipelineintheX1stagebeforeanycomputationtakesplacewhethertheinstructionisguaranteedtobeexceptionfreesafeornotunsafe.Thefirstthreeofthesixexceptionscanbedetectedwithoutanyfloatingpointcalculation.Fromthelatterthreeexceptions,theinexactexceptionisusuallymaskedbytheoperatingsystemorthesoftwareapplicationusingtheprecisionmask,orPM,bitinthefloatingpointcontrolword.Otherwise,atrapwilloccurwheneverroundingoftheresultisnecessary.WhepthepretJune199317PentiummicronrocessorSTOST1ST2ST3ST4ST5ST6ST7Cycle1FADDQWORDPTREAXFXCHST2Cycle2FMULQWORDPTREBXIFXCHST3ST2ST3ST4ST5GST6HST7ST4ST5ST6ST7FigureIO.FXCHcodeexample.cisioninexactexceptionismasked,thepipelinedeliversthecorrectlyroundedresultdirectly.ForoverflowandunderflowexceptionsSIRlogicusesanalgorithmthatmonitorstheexponentfieldsoftheinputoperandstoconcludetheexceptionstatussafeorunsafe.IntheX86architecturetheCPUstoresfloatingpointoperandsinthefloatingpointregisterfilewithanextendedprecisionexponent,regardlessoftheprecisioncontrolinthefloatingpointcontrolword.Theextendedprecisionexponentsupportsmuchgreaterrangethanthedoubleprecisionformat.Overflowandunderflowexceptionscausedbyconvertingthedataintodoubleprecisionorsingleprecisionformatsoccuronlywhenstoringthedataintoexternalmemory.ThesecharacteristicsoftheX86floatingpointarchitecturegiveauniqueadvantagetotheeffectivenessoftheSIRmechanisminthePentiumCPU,sincetheSIRalgorithmcanusetheinternalextendedprecisionexponentrange.Thus,theoccurrenceofunsafeoperationsisextremelyrare.OurevaluationoftheSIRalgorithmfortheFPUdesignfoundnounsafeinstructionsinsimulatedexecutionoftheSPEC89floatingpointbenchmarks.Registerstackmanipulation.Thex86floatingpointinstructionsetusestheregisterfileasastackofeightregistersinwhichthetopofstackTOSactsasanaccumulatoroftheresults.Therefore,thetopofthestackisusedforthemajorityoftheinstructionsasoneofthesourceoperandsand,usually,asthedestinationregister.Toimprovethefloatingpointpipelineperformancebyoptimizingtheuseofthefloatingpointregisterfile,PentiumsFPUcanexecutetheFXCHinstructioninparallelwithanybasicfloatingpointoperation.TheFXCHinstructionswapsthecontentsoftheTOSregisterwithanotherregisterinthefloatingpointregisterfile.AllthebasicfloatingpointinstructionsmaybepairedwithFXCHintheVpipe.Thepairexecuteinparallel,evenwhendatadependencybetweenthetwoinstructionsinthepairexists.TheuseofparallelFXCHredirectstheresultofafloatingpointoperationtoanyselectedregisterintheregisterfile,whilebringinganewoperandtothetopofthestackforimmediateusebythenextfloatingpointoperation.TheexampleshowninFigure10illustratestheuseofparallelFXCH.Thecodeintheexamplegeneratestheresultsoftwoindependentfloatingpointcalculations.ThefloatingpointregisterfilecontainsinitialvaluespriortocodeexecutionregisterSTOTOScontainsthevalueA,registerSTlcontainsvalueB,registerST2containsvalueC,andsoon.Thetwooperationsare1floatingpointadditionofvalueAwiththe64bitfloatingpointoperandaddressedbythegenera1registerEAX,and2floatingpointmultiplicationofvalueCbythebitfloatingpointoperandaddressedbythegeneralregisterEBX.Whenthefloatingpointpipelineisfullyloadedandthesetwooperationsarepartofthecodesequence,theparallelFXCHallowsthecalculationtomaintainthemaximumthroughputofonecycleperoperation.WithinonecyclethePentiumCPUwritestheresultoftheadditiontoST2,whiletheoperandforthenextoperationmovestothetopofthestack.Onthenextcycle,theprocessorwritestheresultofthemultiplicationtoST3,whilethetopofthestackcontainsvalueD,whichmaybeusedforasubsequentoperation.Transcendentalinstructions.TheCPUsupportsalleighttranscendentalinstructionsthataredefinedintheinstructionsetthroughdirectexecutionofmicrocodesequences.Thetranscendentalinstructionsare1FSIN2FCOS3FSINCOS4FPTAN5FPATAN6F2XM17M2X8FYL2xPsine,cosine,sineandcosine,tangent,arctangent,2x1,YLog2X,and1YLog2XlWedevelopednew,tabledrivenalgorithmsforthetranscendentalfunctionsusingpolynomialapproximationtechniques.Thesealgorithmssubstantiallyimprovedperformanceandaccuracyoverthei486CPUimplementation,whichusedthemoretraditionalCordicalgorithms.TheapproximationtablesresideinanonchipROMalongwiththeotherspecialconstantsthatareusedforfloatingpointcomputation.TheperformanceimprovementofthetranscendentalinstructionsonthePentiumprocessorrangesfromtwotothreetimesoverthesameinstructionsonthei486CPUatthesamefrequency.Theworstcaseerrorforallthetranscendentalinstructionsislessthan1ulpunitinthelastplacewhenroundingtonearestevenandlessthan1.5ulpswhenroundinginothermodes.Thefunctionsareguaranteedtobemonotonic,withrespecttotheinputoperands,throughoutthedomainsupportedbytheinstruction.18IEEEMicroDevelopmentprocessDevelopingahighlyintegratedmicroprocessorinvolvescollaborationbetweennumerousteamshavingdiversetechnicalspecialtiesandworkingunderthedisciplineofwelldefinedmethodologies.AsmallteamofarchitectsandVLSIdesignersdevelopedtheinitialconceptsofthedesign.Thisgroupconductedfeasibilitystudiesofparallelinstructiondecodingandoptionsforbranchpredictiontechniques.Simultaneously,itevaluatedperformancebyhandforshortbenchmarksandcompileroptimizations.Asinitialdirectionswereestablished,additionalengineersparticipated,andsubteamsfocusedonthefollowingareas1behavioralmodelingofthemicroarchitecture2circuitfeasibilitydesignforcaches,decodingPLAsprogrammablelogicarrays,floatingpointdatapath,andothercriticalfunctions3aflexible,tracedrivensimulatorofinstructiontimingforperformanceevaluation4aprototypecompilerand5enhancementstoexistinginstructiontracingtools.ThroughoutthedesignwerefinedthePentiummicroprocessorusingbothtopdownandbottomupmethods.Topdownrefinementwasaccomplishedthroughcomprehensivecharacterizationofexecutingbenchmarkworkloadsonthei486CPU4andtracedrivenexperimentsconcerningalternativemachineorganizationsconductedbyarchitectsusingtheperformancesimulator.VLSIdesignengineersevaluatingfeaturescriticaltothetargetedareaandfrequencyrefinedthedesignfromthebottomup.Ontwooccasionsinthedesigntheaccumulationofchangesfrombottomuprefinementcausedtheneedforsubstantialrestructuringofthemicroprocessorsglobalchipplan,ordiediets.Onthoseoccasions,interdisciplinaryteamsofspecialistscollaboratedtobrainstormandevaluateideasthatcouldsatisfytheglobalorlocaldesignconstraints.Inoneinstance,wefounditnecessarytorefinethesetofinstructionsthatcouldbeexecutedinparallel.ConstraintshadbeenassignedtotheareaandspeedofthedecoderPLAs.TheVLSIdesignersidentifiedcombinationsofinstructionformatsthatwouldfeasiblydecodeinparallel,andthecompilerwritersdeterminedtheoptimalselection.Intheend,themeasuredperformanceofthePentiummicroprocessorinproductionsystemsiswithin2percentofthatpredictedbeforethedesignwascompleted.ThelogicvalidationofthePentiumprocessordesignpresentedamajorchallengetothedesignteam.AcomprehensivetestbasefromthevalidationofpreviousX86microprocessorswasavailable.However,thePentiumprocessormicroarchitectureintroducedseveralnewfundamentaltechniques,suchassuperscalar,writebackcache,andfloatingpointalgorithms,thatrequiredamorerigorousveriNamingthePentiumprocessorInnamingthefifthgenerationofitscompatiblemicroprocessorlinethePentiumprocessor,Inteldepartedfromtradition.PentiumbreaksastringofCPUproductsdatingbacktothelate1970sthatusednumerics8086,286,386,486.Thenaturalcoursewouldbetocallthischipthe336,saidAndrewS.Grove,presidentandchiefexecutiveofficer.Unfortunately,wecannottrademarkthosenumbers,whichmeansthatanycompanymightcallanychipa586,evenifitdoesntmeasureuptotherealthing.PentiumusestheGreekwordforfive,pente,asitsroottoassociatewiththefifthgenerationproductandaddsium,acommonendingfromtheperiodictableofelements.Thus,thePentiummicroprocessoristhefifthgeneration,akeyelementforfuturecomputing.ficationmethodology.ingofthePentiummicroprocessorWeuseddifferentvalidationapproachesinpresilicontest1Architectureverificationlookedattheblackboxfunctionalityfromtheprogrammerspointofview.WedesignedcomprehensiveteststocoverallpossibleaspectsoftheprogrammingmodelandallthePentiumprocessoruservisiblefeatures.June199319

注意事项

本文(53-Architecture of the Pentium Microprocessor.pdf)为本站会员(baixue100)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5