20-HPS, A New Microarchitecture- Introduction and Rationale.pdf_第1页
20-HPS, A New Microarchitecture- Introduction and Rationale.pdf_第2页
20-HPS, A New Microarchitecture- Introduction and Rationale.pdf_第3页
20-HPS, A New Microarchitecture- Introduction and Rationale.pdf_第4页
20-HPS, A New Microarchitecture- Introduction and Rationale.pdf_第5页
已阅读5页,还剩1页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

HPS,ANEWMICROARCHITECTURE:RATIONALEANDINTRODUCTIONYaleN.Patt,Wen-meiHwu,andMichaelShebanowComputerScienceDivisionUniversityofCalifornia,BerkeleyBerkeley,CA94720ABSTRACTHPS(HighPerformanceSubstrate)isanewmicroarchitecturetargetedforimplementingveryhighperformancecomputingengines.Ourmodelofexecutionisarestrictiononfinegranularitydataflow.Thispaperintroducesthemodel,providestherationaleforitsselection,anddescribesthedatapathandflowofinstructionsthroughthemicroengine.1.IntroductionAcomputersystemisamultilevelstructure,algorithmsatthetop,gatesandwiresatthebottom.Toachievehighperformance,onemustoptimizeatalllevelsofthisstructure.Atmostlevels,theconventionalwisdomsuggestsexploitingconcurrency.Severalproposalshavebeenputforwardastohowtodothis.Wealsoargueforexploitingconcurrency,focusinginparticularonthemicroarchitecturelevel.1.1.RestrictedDataFlow.WearecallingourengineHPS,whichstandsforHighPerformanceSubstrate,toreflectthenotionthatwhatweareproposingshouldbeusefulforimplementingverydissimilarISParchitectures.Ourmodelofthemicroengine(i.e.,arestrictiononclassicalfinegranularitydataflow)isnotunlikethatofDennis3,ArvindZl,andothers,butwithsomeveryimportantdifferences.Thesedifferenceswillbediscussedndetailinsection3.Forthemoment,itisimportanttounderstandthatunlikeclassicaldatafiowmachines,onlyasmallsubsetoftheentireprogramisintheHPSmicroengineatanyonetime.Wedefinethe“activewindow”asthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlypartofthedataflowgraphwhichisresidentinthemicroengine.AstheactivewindowmovesthroughPermissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarcnotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.thedynamicinstructionstream,HPSexecutestheentireprogram.1.2.PotentialLimitationsofOtherApproaches.Webelievethatanessentialingredientofhighperformancecomputingistheeffectiveutilizationofalotofconcurrency.Thusweseeapotentiallimitationinmicroenginesthatarelimitedtooneoperationpercycle.Similarly,weseeapotentiallimitationinamicroenginethatunderutilizesitsbandwidthtoeitherinstructionmemoryordatamemory.Finally,althoughweappreciatetheadvantagesofstaticscheduling,weseeapotentiallimitationinamicroenginethatpurportstoexecuteasubstantialnumberofoperationseachcycle,butmustrelyonanon-run-timeschedulerfordeterminingwhattodonext.1.3.Outlineofthispaper.Thispaperisorganizedinfoursections.Section2delineatesthefundamentalreasonswhichledustothisnewmicroarchitecture.Section3describesthebasicoperationofHPS.Section4offerssomeconcludingremarks,anddescribeswhereourresearchinHPSisheading.2.Rationale.2.1.TheThreeTierModel.Webelievethatirregularparallelisminaprogramexistsbothlocallyandglobally.Ourmechanismexploitsthelocalparallelism,butdisregardsglobalparallelism.Ourbeliefisthattheexecutionofanalgorithmshouldbehandledinthreetiers.Atthetop,whereglobalparallelismcanbebestidentified,theexecutionmodel-.shouldutilizelargegranularitydataflow.muchliketheproposaloftheCEDARproject41.Inthemiddle,wherefortyyearsofcollectedexperienceincomputerprocessingcanbeexploitedprobablywithoutharm,classicalsequentialcontrolflowshouldbethemodel.Atthebottom,wherewewanttoexploitlocalparallelism,finegranularitydataflowisrecommended.Ourthreetiermodelreflectsourconceptionthatthetoplevelshouldbealgorithmoriented,themiddlelevelsequentialcontrolflowISParchitectureoriented,andthebottomlevelmicroengineoriented.Q1985ACMO-89791-1725/85/0012/0103$00.752.2.LocalParallelism.Wefeelobligedtore-emphasizetheimportanceoflocalparallelismtoourchoicea4executionmodel.indeed,wechosethisrestrictedformofdataflowspecificallybecauseourstudieshaveshownthattheparallelismavailablefrom.themiddlecontrolflowtier(i.e.,thesequentialcontrolflowarchitecture)ishighlylocalized.Wearguethat,byrestrictingtheactiveinstructionwindow,wecanexploi,talmostalloftheinherentparallelismintheprogramwhileincurringverylittleofthesynchronizationcostswhichwouldbeneededtokeeptheentireprogramaroundasatotaldataflowgraph.2.3,Stalls,Bandwidth,andConcurrency.Webelievethatahighperformancecomputingengineshouldexhibitanumberofcharacteristics.First,allitacomponentsmustbekeptbusy.Theremustbefewstalls,bothintheflowofinformation(i.e.,thepathtomemory,loadingofregisters,etc.)andintheprocessingofinformation(i.e.,thefunctionalunits).Second,theremustbeahighdegreeofconcurrermyavailable,suchasmultiplepathstomemory,multipleprocessingelements,andsomeformofpipelining,forexample.Inourview,therestricteddataflowmodel,withitsout-of-orderexecutioncapability,bestenablestheabovetworequirements,asfollows:Thecenterofourmodelisthesetofnodetables,whereoperationsawaittheiroperands.Instructionmemoryfeedsthemicroengineataconstantratewithfewstalls.DatamemoryandI/Osupplyandextractdataatconstantrateswithfewstalls.Functionalunitsarekeptbusybynodesthatcanfire.Somewhereinthissystem,therehastobe“slack.”Theslackisinthenodeswaitinginthenodetables.Sincenodescanexecuteout-of-order,thereisnoblockingduetounavailabledata.Decodedinstructionsaddnodestothenodetablesandexecutednodesremovethem.Thenodetablestendtogrowinthepresenceofdatadependencies,andshrinkasthesedependenciesbecomefewer.Meanwhile,ourpreliminarymeasurementssupport,themultiplecomponentsofthemicroenginearekeptbusy.3.TheHPSModelofExecution.3.1.Overview.AnabstractviewofHPSisshowninfigure1.Instructionsarefetchedanddecodedfromadynamicinstructionstream,shownatthetopofthefigure.ThefigureimpliesthattheinstructionstreamistakenfromasequentialcontrolflowISParchitecture.WeneedtoemphasizethatthisisnotanecessarypartoftheHPSspecification.Indeed,weareinvestigatinghavingHPSdirectlyprocessmultinodewords(i.e.,thenodesofadirectedgraph)whichwouldbeproducedasthetargetcodeofa(forexample)Ccompiler.Whatisnecessaryisthat,foreachinstruction,theoutputofthedecoderwhichispresentedtotheMergerforhandlingbyHPSisadataflowgraph.AveryimportantpartofthespecificationofHPSisthenotionoftheactiveinstructionwindow.Unlikeclassicaldataflowmachines,itisnotthecasethatthedataflowgraphfortheentireprogramisinthemachineatonetime.WedefinetheactivewindowasthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlybeingworkedoninthedataflowmicroengine.Astheinstructionwindowmovesthroughthedynamicinstructionstream,HPSexecutestheentireinstructionstream.Parallelismwhichexistswithinthewindowisfullyexploitedbythemicroengine.Thisparallelismislimitedinscope;ergo,theterm“restricteddataflow.”TheMergertakesthedataflowgraphcorrespondingtoeachISPinstructionand,usingageneralizedTomasuloalgorithmtoresolveanyexistingdatadependencies,mergesitintotheentiredataflowgraphfortheactivewindow.Eachnodeofthedataflowgraphisshippedtooneofthenodetableswhereitremainsuntilitisreadytofire.Whenalloperandsforadataflownodeareready,thedataflownodefiresbvtransmittinethenodetotheappropriatefunctionalunit.Thefunctionalunit(anALU,memory,orI/Odevice)executesthenodeanddistributestheresult,ifany,tothoselocationswhereitisneededforsubsequentprocessing:thenodetables,theMerger(forresolvingsubsequentdependencies)andtheFetchControlUnit(forbringingnewinstructionsintotheactivewindow).Whenallthedataflownodesforaparticularinstructionhavebeenexecuted,theinstructionissaidtohaveexecuted.Aninstructionisretiredfromtheactivewindowwhenithasexecuted104andalltheinstructionsbeforeithaveretired.Allsideeffectstomemoryaretakencareofwhenaninstructionretiresfromtheactivewindow.Thisisessentialforthecorrecthandlingofpreciseinterruptsll.Theinstructionfetchinganddecodingunitsmaintainthedegreeofparallelisminthenodetablesbybringingnewinstructionsintotheactivewindow,whichresultsinnewdataflownodesbeingmergedintothedataflownodetables.3.2.InstructionFlowFigure2showstheglobaldatapathofHPS.InstructionsenterthedatapathasinputtotheMerger.Thisinputisintheformofadateflowgraph,oneperinstruction.Thedataflowgraphcanbetheresultofdecodinganinstructioninaclassicalsequentialinstructionstream,oritcanbetheoutputofanon-conventionalcompiler.Ineithercase,theMergerseesasetofdataflownodes(anddatadependencies),oneforeachoperationthatmustbeperformedintheexecutionofthatinstruction,operationsare,forexample,reads,writes,addresscomputationsandALUfunctions,Intheexampleoffigure3,thedataflowgraphcorrespondingtotheVAXinstructionADDW#lOOO,A,Bconsistsofthreenodes:amemoryread,memorywrite,andanALUoperation.Figure3alsoshowsthestructureofthethreenodesandthefivevaluebufferentriesrequiredfortheinstruction.TheMerger,usingtheRegisterAliasTabletoresolvedatadependenciesnotexplicitintheindividualinstruction,formsthesetofdataflownodeswhicharenecessarytoexecutetheinstruction,Nodesarethentransmittedtotheappropriatenodetables.Nodetables,asweshallsee,arecontentaddressiblememories,andthusshouldbekeptsmall.ThesizeofeachnodetableisafunctionofthesizeoftheactivewindowandthedecodingrateoftheVonNeumanninstructionstream.InourexperimentswiththeVAXarchitecture,forexample,anactivewindowof16instructions,coupledwithadecodingrateofeightnodespercycle,requiredatmosta35entrynodetable.Foreachnode,aslotisreservedintheglobalmulti-portvaluebufferforstoringtheresultoftheoperationofthatnode.Theindexofeachslotisdesignatedasatagforthecorrespondingnode,andiscarriedalongwiththenodeuntilitcompletesitsexecution.Valuebufferslotsareassignedinacircularqueue,thesizeofthebufferbeinglargeenoughtoguarranteeretirementofaninstructionbeforeitsvaluebufferslotisagainneeded.(InthecaseofoursimulatedimplementationoftheVAXarchitecture,anactivewindowof16instructions,havingapproximatelyfournodesperinstruction,meansthatavaluebufferof136entriesismorethanadequate.)Anoderemainsinitsnodetableuntilallofitsoperandsareavailable,atwhichpointitisreadytofire(i.e.,itisexecutable).Anodeisfiredbytransmittingitsoperator,tag,andsetofoperandstooneofthefunctionalunitaassociatedwiththatnodetable.Whenexecutioncompletes,theresultanditstagaredistributedtoeachpotiofthevaluebuffer.Inthecaseofaresultdestinedforageneralpurposeregister,thecorrespondingtagisalsotransmittedtotheRegisterAliasTabletoupdateinformationstoredthere.Thecorrespondingtagisalsotransmittedtothenodetablesforthepurposeofsettingthereadybitsinthosenodesawaitingthisresult.nodes.Thenumberofresultsthatcanbedistributedinasinglecycleisafunctionofthebusstructureandtheorganizationofthenodetables.Theintentisthatineachcycle,multiplenodeswillbeineachstageoftheprocess.#DO&3IIOOO,A,83.3.DataDependenciesandtheirResolution.ftclm3.Memoryreadandwritenodespresentadditionalcomplications.Althoughthesewillbediscussedingreaterdetailin71,afewobservationshereareinorder.FirstisthefactthatatthetimememoryaccessnodesareissuedbytheMerger(dependingofcourseontheaddressingstructureofthetargetarchitecture),theaddressofthememoryaccessmaybeunknown,andtheaddressesofothermemoryaccesseswhichcouldblockthenodebeingissuedmayalsobeunknown.AMemoryAliasTableandaReadStagingUnitareprovidedtohandletheseproblems,Secondisthefactthatwritescanoccuroutofordercoupledwithourrequirementthatexceptionhandlingmustallowthemachinestatetoberecovered“precisely.”AWriteBufferandanalgorithmforretiringinstructionsareprovidedforhandlingthisproblem.Onefinalobservationabouttheprocessingofnodesmustbemade.Thestagesthatanodegoesthrough(i.e.,merging,waitingforoperands,firing,executing,anddistributingitsresults)isindependentoftheothernodesinthenodetables.Thatis,forexample,thenumberofnodesfirableinagivencycleislimitedbytheabilitytodetectthatmultiplenodesarefirableandthenumberoffunctionaluniteavailableforconcurrentprocessingofFundamentaltothecorrect,fast,out-of-orderexecutionofoperationsinHPSisthehandlingofdatadependenciesand,aswewillsee,theabsenceofblockinginthosecaseswhereblockingisunnecessary.Sinceourlocallyconcurrentimplementationmodelhastoconformtothetargetarchitecture,thelocalconcurrencyexploitedmustnotcauseincorrectexecutionresults.3.3.1.Data,Anti,andOutputDependencies.Amicro-operationBdependsonanothermicro-operationAifBhastobeexecutedafterAinordertoproducethecorrectresult.Therearethreewaysinwhichamicro-operationcandependonanothermicro-operationthroughregisterusage:data,anti,andoutputdependencies.Adata(read-after-write)dependencyoccurswhenAisgoingtowritetotheregisterfromwhichBisgoingtoread.Inthiscase,AsuppliesinformationessentialtotheexecutionofB.Ananti(write-after-read)dependencyoccurswhenAisgoingtoreadfromtheregistertowhichBisgoingtowrite.Anoutput(write-after-write)dependencyoccurswhenAandBaregoingtowritetothesameregister.Inthelasttwocases,theexecutionofAdoesnotsupplyanyinformationnecessaryfortheexecutionofB.TheonlyreasonBdependsonAisthataregisterhasbeenallocatedtotwodifferenttemporaryvariablesduetoashortageofregisters.Infact,ifwehadanunlimitednumberofregisters,differenttemporaryvariableswouldneverhavetobeallocatedtothesameregisterandthesecondandthethirddependencieswouldneveroccur.So,aproperrenamingmechanismandextrabufferregisterswouldremoveantianddatadependencies.Then,theonlytypeofdependencythatcoulddelaymicreoperationexecutionwouldbeadatadependency.Inotherwords,amicro-operationcouldbeexecutedassoonasitsinputoperandsareproperlygenerated.Thisisexactlythedescriptionofadatatlowexecutionmodel.3.3.2.OurModifiedTomasuloAlgorithm.OuralgorithmforenforcingdatadependenciesandremovingantiandoutputdependenciesissimilartotheTomasuloalgorithmwhichwasusedintheFloatingPointUnitoftheIBM360191161.Duringexecution,thealgorithmmanagestwomajordatastructures:aregisteraliastableandasetofnodetables.Eachentryintheregisteraliastablekeepstrackofthedynamicinformationforaregisternecessaryeithertosupplyaninputoperandvalueortoestablishdependencyarcs.Therearetwofieldsineachregisteraliastableentry.Thefirstisareadybit.Thisbit,ifcleared,indicatesthatthereisanactivemicro-operationwhichisgoingtosupplytheregistervalue.Thesecondfieldisthetagfieldwhichprovidesanindexintoaresultbuffer.Thisindicateswheretheregistervaluecanbefoundifthereadybitisset.106Eachentryinanodetablecorrespondstoamicro-operationandhasanoperationfield,aresulttugfield,andtwooperandrecords.Theoperationfieldspecifiestheactionthatwillbeperformedontheinputoperands,Theresulttugfieldprovidesthelocationintheresultbufferthattheresultvaluewillbeshippedtoaftertheexecutionofthemicro-operation,Eachoperandrecordconsistsoftwofields.Thefirstisareadybit.Thisbitissetwhentheinputoperandhasbeenproperlyproduced.Thesecondfieldisthetagfieldwhichcontainsanindexintotheoperandbuffer.Thisindicateswheretheoperandcanbefoundifthereadybitisset.AdatapathdesignedforourmodifiedTomasuloalgorithmispresentedinfigure2.Therearetwophasesineachmachinecycle:merging/schedulinganddistribution,Initiallyallregisteraliastableentriesarereadyandtheinitialregistervaluesareinresultbufferentrieswhoseindexisinthetagfieldsofthecorrespondingregisteraliastableentries.Merging/SchedulingAnewmicro-operationisassignedanentryinanodetableandisgivenauniqueresulttag,First,thecontentsofbothfieldsintheregisteraliastableforeachinputoperandarecopiedtothecorrespondingfieldsintheoperandrecordsofthenewnodetableentry.Second,thereadybitoftheregistertobewrittenbythemicro-operationisresetandtheresulttagforthemicro-operationiswrittenintothetagfieldofthealiastableentry.Ifbothoperandsofanodearemarkedready,thisnodecanfire.Thetagsintheoperandrecordsareusedtoindexintotheresultbufferandobtaintheoperandvalues.Theoperationandtheoperandvaluesaresenttothefunctionunitforexecution.Theresulttag,whichwillbeusedtodistributetheresult,isalsosenttothefunctionunit.DistributionWhenafunctionunitfinishesexecutingamicro-operation,theresulttagofthatmicro-operationisusedtoselectaresultbufferentryandtheresultvalueisstoredintotheentry.Theresulttagisalsodistributedtotheregisteraliastableandthenodetable.Boththeregisteraliastableandthenodetablearecontentaddressablememories.Entriesinthesetablesareaddressedbythevalueoftheresulttag.Alloftheregisteraliastableentriesandalloftheoperandrecordsinthenodetableentriessettheirreadybitifthedistributedresulttugmatchestheirtagfieldcontents.Afterexecution,alltheregisteraliastableentriesarereadyandthecorrespondingregistervaluesareintheresultbufferentrieswhoseindicesareinthetagfields.Thealgorithmdescribedaboveisawinonatleasttwocounts.First,itremovesantiandoutputdependencieswithoutproducingincorrectresults.Infact,itcanheshownthatreservationschemeswithoutrenamingcannotremoveantiandoutputdependencieswithoutproducingincorrectresults.Second,unlikeScoreboarding(forexample),theissuingprocessneverhastostallduetodependencies.Itcanalsobeshownthatanyreservationschemewithoutrenamingwillhavetostallforsomedependencies.4.ConcludingRemarks.ThepurposeofthispaperhasbeentointroducetheHPSmicroarchitecture.CurrentresearchatBerkeleyistakingHPSalongfourtracks.First,weareattemptingtodesign,athighperformance,threeverydissimilararchitectures:themicroVAX,aCmachine,andaPrologprocessor.Equallyimportant,weareinvestigatingthelimitsofthismicroarchitecture,bothfromthestandpointofaminimalimplementationandfromthestandpointofacadilIacversion.Asistobeexpected,thereareissuestoberesolvedbeforeaneffectiveHPSimplementationcanbe,achieved.Forexample,ifHPSistoimplementasequentialcontrolbasedISParchitecture,thentherearedecodingissues,includingthequestionofanodecache,whichneedtobedecided.Second,HPSrequiresadatapaththat(1)hashighbandwidthand(2)allowstheprocessingofveryirregularparalleldata.Third,HPSneedsaschedulerwhichcandetermine,inreal-time,whichnodesarefirableandwhicharenot.Fourth,the-out-of-orderexecutionofnodesrequiresadditionalattentiontothedesignofthememorysystem,theinstructionretirementandrepairmechanisms,andtheI/Osystem.Theseissuesarethesubjectofacompanionpaper171intheseProceedings.Acknowledgement.TheauthorswishtoacknowledgefirsttheDigitalEquipmentCorporationforsupportingverygenerouslyourresearchinanumberofpositiveways:LindaWright,formerlyHeadofDigit

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论