会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

20-HPS, A New Microarchitecture- Introduction and Rationale.pdf

  • 资源星级:
  • 资源大小:570.39KB   全文页数:6页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

20-HPS, A New Microarchitecture- Introduction and Rationale.pdf

HPS,ANEWMICROARCHITECTURERATIONALEANDINTRODUCTIONYaleN.Patt,WenmeiHwu,andMichaelShebanowComputerScienceDivisionUniversityofCalifornia,BerkeleyBerkeley,CA94720ABSTRACTHPSHighPerformanceSubstrateisanewmicroarchitecturetargetedforimplementingveryhighperformancecomputingengines.Ourmodelofexecutionisarestrictiononfinegranularitydataflow.Thispaperintroducesthemodel,providestherationaleforitsselection,anddescribesthedatapathandflowofinstructionsthroughthemicroengine.1.IntroductionAcomputersystemisamultilevelstructure,algorithmsatthetop,gatesandwiresatthebottom.Toachievehighperformance,onemustoptimizeatalllevelsofthisstructure.Atmostlevels,theconventionalwisdomsuggestsexploitingconcurrency.Severalproposalshavebeenputforwardastohowtodothis.Wealsoargueforexploitingconcurrency,focusinginparticularonthemicroarchitecturelevel.1.1.RestrictedDataFlow.WearecallingourengineHPS,whichstandsforHighPerformanceSubstrate,toreflectthenotionthatwhatweareproposingshouldbeusefulforimplementingverydissimilarISParchitectures.Ourmodelofthemicroenginei.e.,arestrictiononclassicalfinegranularitydataflowisnotunlikethatofDennis3,ArvindZl,andothers,butwithsomeveryimportantdifferences.Thesedifferenceswillbediscussedndetailinsection3.Forthemoment,itisimportanttounderstandthatunlikeclassicaldatafiowmachines,onlyasmallsubsetoftheentireprogramisintheHPSmicroengineatanyonetime.WedefinetheactivewindowasthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlypartofthedataflowgraphwhichisresidentinthemicroengine.AstheactivewindowmovesthroughPermissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarcnotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.thedynamicinstructionstream,HPSexecutestheentireprogram.1.2.PotentialLimitationsofOtherApproaches.Webelievethatanessentialingredientofhighperformancecomputingistheeffectiveutilizationofalotofconcurrency.Thusweseeapotentiallimitationinmicroenginesthatarelimitedtooneoperationpercycle.Similarly,weseeapotentiallimitationinamicroenginethatunderutilizesitsbandwidthtoeitherinstructionmemoryordatamemory.Finally,althoughweappreciatetheadvantagesofstaticscheduling,weseeapotentiallimitationinamicroenginethatpurportstoexecuteasubstantialnumberofoperationseachcycle,butmustrelyonanonruntimeschedulerfordeterminingwhattodonext.1.3.Outlineofthispaper.Thispaperisorganizedinfoursections.Section2delineatesthefundamentalreasonswhichledustothisnewmicroarchitecture.Section3describesthebasicoperationofHPS.Section4offerssomeconcludingremarks,anddescribeswhereourresearchinHPSisheading.2.Rationale.2.1.TheThreeTierModel.Webelievethatirregularparallelisminaprogramexistsbothlocallyandglobally.Ourmechanismexploitsthelocalparallelism,butdisregardsglobalparallelism.Ourbeliefisthattheexecutionofanalgorithmshouldbehandledinthreetiers.Atthetop,whereglobalparallelismcanbebestidentified,theexecutionmodel..shouldutilizelargegranularitydataflow.muchliketheproposaloftheCEDARproject41.Inthemiddle,wherefortyyearsofcollectedexperienceincomputerprocessingcanbeexploitedprobablywithoutharm,classicalsequentialcontrolflowshouldbethemodel.Atthebottom,wherewewanttoexploitlocalparallelism,finegranularitydataflowisrecommended.Ourthreetiermodelreflectsourconceptionthatthetoplevelshouldbealgorithmoriented,themiddlelevelsequentialcontrolflowISParchitectureoriented,andthebottomlevelmicroengineoriented.Q1985ACMO897911725/85/0012/010300.752.2.LocalParallelism.Wefeelobligedtoreemphasizetheimportanceoflocalparallelismtoourchoicea4executionmodel.indeed,wechosethisrestrictedformofdataflowspecificallybecauseourstudieshaveshownthattheparallelismavailablefrom.themiddlecontrolflowtieri.e.,thesequentialcontrolflowarchitectureishighlylocalized.Wearguethat,byrestrictingtheactiveinstructionwindow,wecanexploi,talmostalloftheinherentparallelismintheprogramwhileincurringverylittleofthesynchronizationcostswhichwouldbeneededtokeeptheentireprogramaroundasatotaldataflowgraph.2.3,Stalls,Bandwidth,andConcurrency.Webelievethatahighperformancecomputingengineshouldexhibitanumberofcharacteristics.First,allitacomponentsmustbekeptbusy.Theremustbefewstalls,bothintheflowofinformationi.e.,thepathtomemory,loadingofregisters,etc.andintheprocessingofinformationi.e.,thefunctionalunits.Second,theremustbeahighdegreeofconcurrermyavailable,suchasmultiplepathstomemory,multipleprocessingelements,andsomeformofpipelining,forexample.Inourview,therestricteddataflowmodel,withitsoutoforderexecutioncapability,bestenablestheabovetworequirements,asfollowsThecenterofourmodelisthesetofnodetables,whereoperationsawaittheiroperands.Instructionmemoryfeedsthemicroengineataconstantratewithfewstalls.DatamemoryandI/Osupplyandextractdataatconstantrateswithfewstalls.Functionalunitsarekeptbusybynodesthatcanfire.Somewhereinthissystem,therehastobeslack.Theslackisinthenodeswaitinginthenodetables.Sincenodescanexecuteoutoforder,thereisnoblockingduetounavailabledata.Decodedinstructionsaddnodestothenodetablesandexecutednodesremovethem.Thenodetablestendtogrowinthepresenceofdatadependencies,andshrinkasthesedependenciesbecomefewer.Meanwhile,ourpreliminarymeasurementssupport,themultiplecomponentsofthemicroenginearekeptbusy.3.TheHPSModelofExecution.3.1.Overview.AnabstractviewofHPSisshowninfigure1.Instructionsarefetchedanddecodedfromadynamicinstructionstream,shownatthetopofthefigure.ThefigureimpliesthattheinstructionstreamistakenfromasequentialcontrolflowISParchitecture.WeneedtoemphasizethatthisisnotanecessarypartoftheHPSspecification.Indeed,weareinvestigatinghavingHPSdirectlyprocessmultinodewordsi.e.,thenodesofadirectedgraphwhichwouldbeproducedasthetargetcodeofaforexampleCcompiler.Whatisnecessaryisthat,foreachinstruction,theoutputofthedecoderwhichispresentedtotheMergerforhandlingbyHPSisadataflowgraph.AveryimportantpartofthespecificationofHPSisthenotionoftheactiveinstructionwindow.Unlikeclassicaldataflowmachines,itisnotthecasethatthedataflowgraphfortheentireprogramisinthemachineatonetime.WedefinetheactivewindowasthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlybeingworkedoninthedataflowmicroengine.Astheinstructionwindowmovesthroughthedynamicinstructionstream,HPSexecutestheentireinstructionstream.Parallelismwhichexistswithinthewindowisfullyexploitedbythemicroengine.Thisparallelismislimitedinscopeergo,thetermrestricteddataflow.TheMergertakesthedataflowgraphcorrespondingtoeachISPinstructionand,usingageneralizedTomasuloalgorithmtoresolveanyexistingdatadependencies,mergesitintotheentiredataflowgraphfortheactivewindow.Eachnodeofthedataflowgraphisshippedtooneofthenodetableswhereitremainsuntilitisreadytofire.Whenalloperandsforadataflownodeareready,thedataflownodefiresbvtransmittinethenodetotheappropriatefunctionalunit.ThefunctionalunitanALU,memory,orI/Odeviceexecutesthenodeanddistributestheresult,ifany,tothoselocationswhereitisneededforsubsequentprocessingthenodetables,theMergerforresolvingsubsequentdependenciesandtheFetchControlUnitforbringingnewinstructionsintotheactivewindow.Whenallthedataflownodesforaparticularinstructionhavebeenexecuted,theinstructionissaidtohaveexecuted.Aninstructionisretiredfromtheactivewindowwhenithasexecuted104andalltheinstructionsbeforeithaveretired.Allsideeffectstomemoryaretakencareofwhenaninstructionretiresfromtheactivewindow.Thisisessentialforthecorrecthandlingofpreciseinterruptsll.Theinstructionfetchinganddecodingunitsmaintainthedegreeofparallelisminthenodetablesbybringingnewinstructionsintotheactivewindow,whichresultsinnewdataflownodesbeingmergedintothedataflownodetables.3.2.InstructionFlowFigure2showstheglobaldatapathofHPS.InstructionsenterthedatapathasinputtotheMerger.Thisinputisintheformofadateflowgraph,oneperinstruction.Thedataflowgraphcanbetheresultofdecodinganinstructioninaclassicalsequentialinstructionstream,oritcanbetheoutputofanonconventionalcompiler.Ineithercase,theMergerseesasetofdataflownodesanddatadependencies,oneforeachoperationthatmustbeperformedintheexecutionofthatinstruction,operationsare,forexample,reads,writes,addresscomputationsandALUfunctions,Intheexampleoffigure3,thedataflowgraphcorrespondingtotheVAXinstructionADDWlOOO,A,Bconsistsofthreenodesamemoryread,memorywrite,andanALUoperation.Figure3alsoshowsthestructureofthethreenodesandthefivevaluebufferentriesrequiredfortheinstruction.TheMerger,usingtheRegisterAliasTabletoresolvedatadependenciesnotexplicitintheindividualinstruction,formsthesetofdataflownodeswhicharenecessarytoexecutetheinstruction,Nodesarethentransmittedtotheappropriatenodetables.Nodetables,asweshallsee,arecontentaddressiblememories,andthusshouldbekeptsmall.ThesizeofeachnodetableisafunctionofthesizeoftheactivewindowandthedecodingrateoftheVonNeumanninstructionstream.InourexperimentswiththeVAXarchitecture,forexample,anactivewindowof16instructions,coupledwithadecodingrateofeightnodespercycle,requiredatmosta35entrynodetable.Foreachnode,aslotisreservedintheglobalmultiportvaluebufferforstoringtheresultoftheoperationofthatnode.Theindexofeachslotisdesignatedasatagforthecorrespondingnode,andiscarriedalongwiththenodeuntilitcompletesitsexecution.Valuebufferslotsareassignedinacircularqueue,thesizeofthebufferbeinglargeenoughtoguarranteeretirementofaninstructionbeforeitsvaluebufferslotisagainneeded.InthecaseofoursimulatedimplementationoftheVAXarchitecture,anactivewindowof16instructions,havingapproximatelyfournodesperinstruction,meansthatavaluebufferof136entriesismorethanadequate.Anoderemainsinitsnodetableuntilallofitsoperandsareavailable,atwhichpointitisreadytofirei.e.,itisexecutable.Anodeisfiredbytransmittingitsoperator,tag,andsetofoperandstooneofthefunctionalunitaassociatedwiththatnodetable.Whenexecutioncompletes,theresultanditstagaredistributedtoeachpotiofthevaluebuffer.Inthecaseofaresultdestinedforageneralpurposeregister,thecorrespondingtagisalsotransmittedtotheRegisterAliasTabletoupdateinformationstoredthere.Thecorrespondingtagisalsotransmittedtothenodetablesforthepurposeofsettingthereadybitsinthosenodesawaitingthisresult.nodes.Thenumberofresultsthatcanbedistributedinasinglecycleisafunctionofthebusstructureandtheorganizationofthenodetables.Theintentisthatineachcycle,multiplenodeswillbeineachstageoftheprocess.DO3IIOOO,A,83.3.DataDependenciesandtheirResolution.ftclm3.Memoryreadandwritenodespresentadditionalcomplications.Althoughthesewillbediscussedingreaterdetailin71,afewobservationshereareinorder.FirstisthefactthatatthetimememoryaccessnodesareissuedbytheMergerdependingofcourseontheaddressingstructureofthetargetarchitecture,theaddressofthememoryaccessmaybeunknown,andtheaddressesofothermemoryaccesseswhichcouldblockthenodebeingissuedmayalsobeunknown.AMemoryAliasTableandaReadStagingUnitareprovidedtohandletheseproblems,Secondisthefactthatwritescanoccuroutofordercoupledwithourrequirementthatexceptionhandlingmustallowthemachinestatetoberecoveredprecisely.AWriteBufferandanalgorithmforretiringinstructionsareprovidedforhandlingthisproblem.Onefinalobservationabouttheprocessingofnodesmustbemade.Thestagesthatanodegoesthroughi.e.,merging,waitingforoperands,firing,executing,anddistributingitsresultsisindependentoftheothernodesinthenodetables.Thatis,forexample,thenumberofnodesfirableinagivencycleislimitedbytheabilitytodetectthatmultiplenodesarefirableandthenumberoffunctionaluniteavailableforconcurrentprocessingofFundamentaltothecorrect,fast,outoforderexecutionofoperationsinHPSisthehandlingofdatadependenciesand,aswewillsee,theabsenceofblockinginthosecaseswhereblockingisunnecessary.Sinceourlocallyconcurrentimplementationmodelhastoconformtothetargetarchitecture,thelocalconcurrencyexploitedmustnotcauseincorrectexecutionresults.3.3.1.Data,Anti,andOutputDependencies.AmicrooperationBdependsonanothermicrooperationAifBhastobeexecutedafterAinordertoproducethecorrectresult.Therearethreewaysinwhichamicrooperationcandependonanothermicrooperationthroughregisterusagedata,anti,andoutputdependencies.AdatareadafterwritedependencyoccurswhenAisgoingtowritetotheregisterfromwhichBisgoingtoread.Inthiscase,AsuppliesinformationessentialtotheexecutionofB.AnantiwriteafterreaddependencyoccurswhenAisgoingtoreadfromtheregistertowhichBisgoingtowrite.AnoutputwriteafterwritedependencyoccurswhenAandBaregoingtowritetothesameregister.Inthelasttwocases,theexecutionofAdoesnotsupplyanyinformationnecessaryfortheexecutionofB.TheonlyreasonBdependsonAisthataregisterhasbeenallocatedtotwodifferenttemporaryvariablesduetoashortageofregisters.Infact,ifwehadanunlimitednumberofregisters,differenttemporaryvariableswouldneverhavetobeallocatedtothesameregisterandthesecondandthethirddependencieswouldneveroccur.So,aproperrenamingmechanismandextrabufferregisterswouldremoveantianddatadependencies.Then,theonlytypeofdependencythatcoulddelaymicreoperationexecutionwouldbeadatadependency.Inotherwords,amicrooperationcouldbeexecutedassoonasitsinputoperandsareproperlygenerated.Thisisexactlythedescriptionofadatatlowexecutionmodel.3.3.2.OurModifiedTomasuloAlgorithm.OuralgorithmforenforcingdatadependenciesandremovingantiandoutputdependenciesissimilartotheTomasuloalgorithmwhichwasusedintheFloatingPointUnitoftheIBM360191161.Duringexecution,thealgorithmmanagestwomajordatastructuresaregisteraliastableandasetofnodetables.Eachentryintheregisteraliastablekeepstrackofthedynamicinformationforaregisternecessaryeithertosupplyaninputoperandvalueortoestablishdependencyarcs.Therearetwofieldsineachregisteraliastableentry.Thefirstisareadybit.Thisbit,ifcleared,indicatesthatthereisanactivemicrooperationwhichisgoingtosupplytheregistervalue.Thesecondfieldisthetagfieldwhichprovidesanindexintoaresultbuffer.Thisindicateswheretheregistervaluecanbefoundifthereadybitisset.106Eachentryinanodetablecorrespondstoamicrooperationandhasanoperationfield,aresulttugfield,andtwooperandrecords.Theoperationfieldspecifiestheactionthatwillbeperformedontheinputoperands,Theresulttugfieldprovidesthelocationintheresultbufferthattheresultvaluewillbeshippedtoaftertheexecutionofthemicrooperation,Eachoperandrecordconsistsoftwofields.Thefirstisareadybit.Thisbitissetwhentheinputoperandhasbeenproperlyproduced.Thesecondfieldisthetagfieldwhichcontainsanindexintotheoperandbuffer.Thisindicateswheretheoperandcanbefoundifthereadybitisset.AdatapathdesignedforourmodifiedTomasuloalgorithmispresentedinfigure2.Therearetwophasesineachmachinecyclemerging/schedulinganddistribution,Initiallyallregisteraliastableentriesarereadyandtheinitialregistervaluesareinresultbufferentrieswhoseindexisinthetagfieldsofthecorrespondingregisteraliastableentries.Merging/SchedulingAnewmicrooperationisassignedanentryinanodetableandisgivenauniqueresulttag,First,thecontentsofbothfieldsintheregisteraliastableforeachinputoperandarecopiedtothecorrespondingfieldsintheoperandrecordsofthenewnodetableentry.Second,thereadybitoftheregistertobewrittenbythemicrooperationisresetandtheresulttagforthemicrooperationiswrittenintothetagfieldofthealiastableentry.Ifbothoperandsofanodearemarkedready,thisnodecanfire.Thetagsintheoperandrecordsareusedtoindexintotheresultbufferandobtaintheoperandvalues.Theoperationandtheoperandvaluesaresenttothefunctionunitforexecution.Theresulttag,whichwillbeusedtodistributetheresult,isalsosenttothefunctionunit.DistributionWhenafunctionunitfinishesexecutingamicrooperation,theresulttagofthatmicrooperationisusedtoselectaresultbufferentryandtheresultvalueisstoredintotheentry.Theresulttagisalsodistributedtotheregisteraliastableandthenodetable.Boththeregisteraliastableandthenodetablearecontentaddressablememories.Entriesinthesetablesareaddressedbythevalueoftheresulttag.Alloftheregisteraliastableentriesandalloftheoperandrecordsinthenodetableentriessettheirreadybitifthedistributedresulttugmatchestheirtagfieldcontents.Afterexecution,alltheregisteraliastableentriesarereadyandthecorrespondingregistervaluesareintheresultbufferentrieswhoseindicesareinthetagfields.Thealgorithmdescribedaboveisawinonatleasttwocounts.First,itremovesantiandoutputdependencieswithoutproducingincorrectresults.Infact,itcanheshownthatreservationschemeswithoutrenamingcannotremoveantiandoutputdependencieswithoutproducingincorrectresults.Second,unlikeScoreboardingforexample,theissuingprocessneverhastostallduetodependencies.Itcanalsobeshownthatanyreservationschemewithoutrenamingwillhavetostallforsomedependencies.4.ConcludingRemarks.ThepurposeofthispaperhasbeentointroducetheHPSmicroarchitecture.CurrentresearchatBerkeleyistakingHPSalongfourtracks.First,weareattemptingtodesign,athighperformance,threeverydissimilararchitecturesthemicroVAX,aCmachine,andaPrologprocessor.Equallyimportant,weareinvestigatingthelimitsofthismicroarchitecture,bothfromthestandpointofaminimalimplementationandfromthestandpointofacadilIacversion.Asistobeexpected,thereareissuestoberesolvedbeforeaneffectiveHPSimplementationcanbe,achieved.Forexample,ifHPSistoimplementasequentialcontrolbasedISParchitecture,thentherearedecodingissues,includingthequestionofanodecache,whichneedtobedecided.Second,HPSrequiresadatapaththat1hashighbandwidthand2allowstheprocessingofveryirregularparalleldata.Third,HPSneedsaschedulerwhichcandetermine,inrealtime,whichnodesarefirableandwhicharenot.Fourth,theoutoforderexecutionofnodesrequiresadditionalattentiontothedesignofthememorysystem,theinstructionretirementandrepairmechanisms,andtheI/Osystem.Theseissuesarethesubjectofacompanionpaper171intheseProceedings.Acknowledgement.TheauthorswishtoacknowledgefirsttheDigitalEquipmentCorporationforsupportingverygenerouslyourresearchinanumberofpositivewaysLindaWright,formerlyHeadofDigitalsEasternResearchLabinHudsonMassachusettsforprovidinganenvironmentduringthesummerof1984whereourideascouldflourish,BillKania,formerlywithDigitalsLaboratoryDataProductsGroup,forprovidingmajorcapitalequipmentgrantsthathavegreatlysupportedourabilitytodoresearchDigitalsExternalResearchGrantsProgram,alsoforprovidingmajorcapitalequipmenttoenhanceourabilitytodoresearchandFernandoColonOsorio,headofAdvancedDevelopmentwithDigitalsHighPerformanceSystems/ClustersGroup,forprovidingfundingofpartofthisworkandfirstratetechnicalinteractionwithhisgrouponthetoughproblems.WealsoacknowledgetheothermembersoftheHPSgroup,SteveMelvin,ChienChen,andJiajuinWeifortheircontributionstotheHPSmodelaswellastothispaper,Finally,wewishtoacknowledgeourcolleaguesintheAquariusResearchGroupatBerkeley,AlDespain,presiding,forthestimulatinginteractionwhichcharacterizesourdailyactivityatBerkeley.5.References.1.Anderson,D.W.,Sparacio,F.J.,Tomasulo,R.M.,TheIBMSysted360Model91MachinePhilosophyandInstructionHandling,IBMJournalofResearchandDevelopment,Vol.11,No.1,1967,pp.824.1072.ArvindandGoostelow,K.P.,ANewInterpreterforDataflowandItsImplicationsforComputerArchitecture,DepartmentofInformationandComputerScience,UniversityofCalifornia,Irvine,Tech.Report72,October1975.3.Dennis,J.B.,andMisunas,D.I.,APreliminaryArchitectureforaBasicDataFlowProcessor,ProceedingsoftheSecondInternationalSymposiumonComputerArchitecture,1975,pp126132.4.Gajski,D.,Kuck,D.,Lawrie,D.,Sameh,A.,CEDARALargeScaleMultiprocessor,ComputerArchitectureNews,March1983.5.Keller,R.M.,LookAheadProcessors,ComputingSurveys,vol.7,no.4,Dec.1975.6.Tomasulo,R.M.,AnEfficientAlgorithmforExploitingMultipleArithmeticUnits,IBMJournalofResearchandDevelopment,vol.11,1967,pp2533.PrinciplesandExamples,McGrawHill,1982.7.Patt,Y.N.,Melvin,SW.,Hwu,W.,andShebanow,MC.,CriticalIssuesRegardingHPS,aHighPerformanceMicroarchitecture,Proceedingsofthe18thInternationalMicroprogrammingWorkshop,Asilomar,CA,December,1985.108

注意事项

本文(20-HPS, A New Microarchitecture- Introduction and Rationale.pdf)为本站会员(baixue100)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5