已阅读5页,还剩1页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
HPS,ANEWMICROARCHITECTURE:RATIONALEANDINTRODUCTIONYaleN.Patt,Wen-meiHwu,andMichaelShebanowComputerScienceDivisionUniversityofCalifornia,BerkeleyBerkeley,CA94720ABSTRACTHPS(HighPerformanceSubstrate)isanewmicroarchitecturetargetedforimplementingveryhighperformancecomputingengines.Ourmodelofexecutionisarestrictiononfinegranularitydataflow.Thispaperintroducesthemodel,providestherationaleforitsselection,anddescribesthedatapathandflowofinstructionsthroughthemicroengine.1.IntroductionAcomputersystemisamultilevelstructure,algorithmsatthetop,gatesandwiresatthebottom.Toachievehighperformance,onemustoptimizeatalllevelsofthisstructure.Atmostlevels,theconventionalwisdomsuggestsexploitingconcurrency.Severalproposalshavebeenputforwardastohowtodothis.Wealsoargueforexploitingconcurrency,focusinginparticularonthemicroarchitecturelevel.1.1.RestrictedDataFlow.WearecallingourengineHPS,whichstandsforHighPerformanceSubstrate,toreflectthenotionthatwhatweareproposingshouldbeusefulforimplementingverydissimilarISParchitectures.Ourmodelofthemicroengine(i.e.,arestrictiononclassicalfinegranularitydataflow)isnotunlikethatofDennis3,ArvindZl,andothers,butwithsomeveryimportantdifferences.Thesedifferenceswillbediscussedndetailinsection3.Forthemoment,itisimportanttounderstandthatunlikeclassicaldatafiowmachines,onlyasmallsubsetoftheentireprogramisintheHPSmicroengineatanyonetime.Wedefinethe“activewindow”asthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlypartofthedataflowgraphwhichisresidentinthemicroengine.AstheactivewindowmovesthroughPermissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarcnotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.thedynamicinstructionstream,HPSexecutestheentireprogram.1.2.PotentialLimitationsofOtherApproaches.Webelievethatanessentialingredientofhighperformancecomputingistheeffectiveutilizationofalotofconcurrency.Thusweseeapotentiallimitationinmicroenginesthatarelimitedtooneoperationpercycle.Similarly,weseeapotentiallimitationinamicroenginethatunderutilizesitsbandwidthtoeitherinstructionmemoryordatamemory.Finally,althoughweappreciatetheadvantagesofstaticscheduling,weseeapotentiallimitationinamicroenginethatpurportstoexecuteasubstantialnumberofoperationseachcycle,butmustrelyonanon-run-timeschedulerfordeterminingwhattodonext.1.3.Outlineofthispaper.Thispaperisorganizedinfoursections.Section2delineatesthefundamentalreasonswhichledustothisnewmicroarchitecture.Section3describesthebasicoperationofHPS.Section4offerssomeconcludingremarks,anddescribeswhereourresearchinHPSisheading.2.Rationale.2.1.TheThreeTierModel.Webelievethatirregularparallelisminaprogramexistsbothlocallyandglobally.Ourmechanismexploitsthelocalparallelism,butdisregardsglobalparallelism.Ourbeliefisthattheexecutionofanalgorithmshouldbehandledinthreetiers.Atthetop,whereglobalparallelismcanbebestidentified,theexecutionmodel-.shouldutilizelargegranularitydataflow.muchliketheproposaloftheCEDARproject41.Inthemiddle,wherefortyyearsofcollectedexperienceincomputerprocessingcanbeexploitedprobablywithoutharm,classicalsequentialcontrolflowshouldbethemodel.Atthebottom,wherewewanttoexploitlocalparallelism,finegranularitydataflowisrecommended.Ourthreetiermodelreflectsourconceptionthatthetoplevelshouldbealgorithmoriented,themiddlelevelsequentialcontrolflowISParchitectureoriented,andthebottomlevelmicroengineoriented.Q1985ACMO-89791-1725/85/0012/0103$00.752.2.LocalParallelism.Wefeelobligedtore-emphasizetheimportanceoflocalparallelismtoourchoicea4executionmodel.indeed,wechosethisrestrictedformofdataflowspecificallybecauseourstudieshaveshownthattheparallelismavailablefrom.themiddlecontrolflowtier(i.e.,thesequentialcontrolflowarchitecture)ishighlylocalized.Wearguethat,byrestrictingtheactiveinstructionwindow,wecanexploi,talmostalloftheinherentparallelismintheprogramwhileincurringverylittleofthesynchronizationcostswhichwouldbeneededtokeeptheentireprogramaroundasatotaldataflowgraph.2.3,Stalls,Bandwidth,andConcurrency.Webelievethatahighperformancecomputingengineshouldexhibitanumberofcharacteristics.First,allitacomponentsmustbekeptbusy.Theremustbefewstalls,bothintheflowofinformation(i.e.,thepathtomemory,loadingofregisters,etc.)andintheprocessingofinformation(i.e.,thefunctionalunits).Second,theremustbeahighdegreeofconcurrermyavailable,suchasmultiplepathstomemory,multipleprocessingelements,andsomeformofpipelining,forexample.Inourview,therestricteddataflowmodel,withitsout-of-orderexecutioncapability,bestenablestheabovetworequirements,asfollows:Thecenterofourmodelisthesetofnodetables,whereoperationsawaittheiroperands.Instructionmemoryfeedsthemicroengineataconstantratewithfewstalls.DatamemoryandI/Osupplyandextractdataatconstantrateswithfewstalls.Functionalunitsarekeptbusybynodesthatcanfire.Somewhereinthissystem,therehastobe“slack.”Theslackisinthenodeswaitinginthenodetables.Sincenodescanexecuteout-of-order,thereisnoblockingduetounavailabledata.Decodedinstructionsaddnodestothenodetablesandexecutednodesremovethem.Thenodetablestendtogrowinthepresenceofdatadependencies,andshrinkasthesedependenciesbecomefewer.Meanwhile,ourpreliminarymeasurementssupport,themultiplecomponentsofthemicroenginearekeptbusy.3.TheHPSModelofExecution.3.1.Overview.AnabstractviewofHPSisshowninfigure1.Instructionsarefetchedanddecodedfromadynamicinstructionstream,shownatthetopofthefigure.ThefigureimpliesthattheinstructionstreamistakenfromasequentialcontrolflowISParchitecture.WeneedtoemphasizethatthisisnotanecessarypartoftheHPSspecification.Indeed,weareinvestigatinghavingHPSdirectlyprocessmultinodewords(i.e.,thenodesofadirectedgraph)whichwouldbeproducedasthetargetcodeofa(forexample)Ccompiler.Whatisnecessaryisthat,foreachinstruction,theoutputofthedecoderwhichispresentedtotheMergerforhandlingbyHPSisadataflowgraph.AveryimportantpartofthespecificationofHPSisthenotionoftheactiveinstructionwindow.Unlikeclassicaldataflowmachines,itisnotthecasethatthedataflowgraphfortheentireprogramisinthemachineatonetime.WedefinetheactivewindowasthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlybeingworkedoninthedataflowmicroengine.Astheinstructionwindowmovesthroughthedynamicinstructionstream,HPSexecutestheentireinstructionstream.Parallelismwhichexistswithinthewindowisfullyexploitedbythemicroengine.Thisparallelismislimitedinscope;ergo,theterm“restricteddataflow.”TheMergertakesthedataflowgraphcorrespondingtoeachISPinstructionand,usingageneralizedTomasuloalgorithmtoresolveanyexistingdatadependencies,mergesitintotheentiredataflowgraphfortheactivewindow.Eachnodeofthedataflowgraphisshippedtooneofthenodetableswhereitremainsuntilitisreadytofire.Whenalloperandsforadataflownodeareready,thedataflownodefiresbvtransmittinethenodetotheappropriatefunctionalunit.Thefunctionalunit(anALU,memory,orI/Odevice)executesthenodeanddistributestheresult,ifany,tothoselocationswhereitisneededforsubsequentprocessing:thenodetables,theMerger(forresolvingsubsequentdependencies)andtheFetchControlUnit(forbringingnewinstructionsintotheactivewindow).Whenallthedataflownodesforaparticularinstructionhavebeenexecuted,theinstructionissaidtohaveexecuted.Aninstructionisretiredfromtheactivewindowwhenithasexecuted104andalltheinstructionsbeforeithaveretired.Allsideeffectstomemoryaretakencareofwhenaninstructionretiresfromtheactivewindow.Thisisessentialforthecorrecthandlingofpreciseinterruptsll.Theinstructionfetchinganddecodingunitsmaintainthedegreeofparallelisminthenodetablesbybringingnewinstructionsintotheactivewindow,whichresultsinnewdataflownodesbeingmergedintothedataflownodetables.3.2.InstructionFlowFigure2showstheglobaldatapathofHPS.InstructionsenterthedatapathasinputtotheMerger.Thisinputisintheformofadateflowgraph,oneperinstruction.Thedataflowgraphcanbetheresultofdecodinganinstructioninaclassicalsequentialinstructionstream,oritcanbetheoutputofanon-conventionalcompiler.Ineithercase,theMergerseesasetofdataflownodes(anddatadependencies),oneforeachoperationthatmustbeperformedintheexecutionofthatinstruction,operationsare,forexample,reads,writes,addresscomputationsandALUfunctions,Intheexampleoffigure3,thedataflowgraphcorrespondingtotheVAXinstructionADDW#lOOO,A,Bconsistsofthreenodes:amemoryread,memorywrite,andanALUoperation.Figure3alsoshowsthestructureofthethreenodesandthefivevaluebufferentriesrequiredfortheinstruction.TheMerger,usingtheRegisterAliasTabletoresolvedatadependenciesnotexplicitintheindividualinstruction,formsthesetofdataflownodeswhicharenecessarytoexecutetheinstruction,Nodesarethentransmittedtotheappropriatenodetables.Nodetables,asweshallsee,arecontentaddressiblememories,andthusshouldbekeptsmall.ThesizeofeachnodetableisafunctionofthesizeoftheactivewindowandthedecodingrateoftheVonNeumanninstructionstream.InourexperimentswiththeVAXarchitecture,forexample,anactivewindowof16instructions,coupledwithadecodingrateofeightnodespercycle,requiredatmosta35entrynodetable.Foreachnode,aslotisreservedintheglobalmulti-portvaluebufferforstoringtheresultoftheoperationofthatnode.Theindexofeachslotisdesignatedasatagforthecorrespondingnode,andiscarriedalongwiththenodeuntilitcompletesitsexecution.Valuebufferslotsareassignedinacircularqueue,thesizeofthebufferbeinglargeenoughtoguarranteeretirementofaninstructionbeforeitsvaluebufferslotisagainneeded.(InthecaseofoursimulatedimplementationoftheVAXarchitecture,anactivewindowof16instructions,havingapproximatelyfournodesperinstruction,meansthatavaluebufferof136entriesismorethanadequate.)Anoderemainsinitsnodetableuntilallofitsoperandsareavailable,atwhichpointitisreadytofire(i.e.,itisexecutable).Anodeisfiredbytransmittingitsoperator,tag,andsetofoperandstooneofthefunctionalunitaassociatedwiththatnodetable.Whenexecutioncompletes,theresultanditstagaredistributedtoeachpotiofthevaluebuffer.Inthecaseofaresultdestinedforageneralpurposeregister,thecorrespondingtagisalsotransmittedtotheRegisterAliasTabletoupdateinformationstoredthere.Thecorrespondingtagisalsotransmittedtothenodetablesforthepurposeofsettingthereadybitsinthosenodesawaitingthisresult.nodes.Thenumberofresultsthatcanbedistributedinasinglecycleisafunctionofthebusstructureandtheorganizationofthenodetables.Theintentisthatineachcycle,multiplenodeswillbeineachstageoftheprocess.#DO&3IIOOO,A,83.3.DataDependenciesandtheirResolution.ftclm3.Memoryreadandwritenodespresentadditionalcomplications.Althoughthesewillbediscussedingreaterdetailin71,afewobservationshereareinorder.FirstisthefactthatatthetimememoryaccessnodesareissuedbytheMerger(dependingofcourseontheaddressingstructureofthetargetarchitecture),theaddressofthememoryaccessmaybeunknown,andtheaddressesofothermemoryaccesseswhichcouldblockthenodebeingissuedmayalsobeunknown.AMemoryAliasTableandaReadStagingUnitareprovidedtohandletheseproblems,Secondisthefactthatwritescanoccuroutofordercoupledwithourrequirementthatexceptionhandlingmustallowthemachinestatetoberecovered“precisely.”AWriteBufferandanalgorithmforretiringinstructionsareprovidedforhandlingthisproblem.Onefinalobservationabouttheprocessingofnodesmustbemade.Thestagesthatanodegoesthrough(i.e.,merging,waitingforoperands,firing,executing,anddistributingitsresults)isindependentoftheothernodesinthenodetables.Thatis,forexample,thenumberofnodesfirableinagivencycleislimitedbytheabilitytodetectthatmultiplenodesarefirableandthenumberoffunctionaluniteavailableforconcurrentprocessingofFundamentaltothecorrect,fast,out-of-orderexecutionofoperationsinHPSisthehandlingofdatadependenciesand,aswewillsee,theabsenceofblockinginthosecaseswhereblockingisunnecessary.Sinceourlocallyconcurrentimplementationmodelhastoconformtothetargetarchitecture,thelocalconcurrencyexploitedmustnotcauseincorrectexecutionresults.3.3.1.Data,Anti,andOutputDependencies.Amicro-operationBdependsonanothermicro-operationAifBhastobeexecutedafterAinordertoproducethecorrectresult.Therearethreewaysinwhichamicro-operationcandependonanothermicro-operationthroughregisterusage:data,anti,andoutputdependencies.Adata(read-after-write)dependencyoccurswhenAisgoingtowritetotheregisterfromwhichBisgoingtoread.Inthiscase,AsuppliesinformationessentialtotheexecutionofB.Ananti(write-after-read)dependencyoccurswhenAisgoingtoreadfromtheregistertowhichBisgoingtowrite.Anoutput(write-after-write)dependencyoccurswhenAandBaregoingtowritetothesameregister.Inthelasttwocases,theexecutionofAdoesnotsupplyanyinformationnecessaryfortheexecutionofB.TheonlyreasonBdependsonAisthataregisterhasbeenallocatedtotwodifferenttemporaryvariablesduetoashortageofregisters.Infact,ifwehadanunlimitednumberofregisters,differenttemporaryvariableswouldneverhavetobeallocatedtothesameregisterandthesecondandthethirddependencieswouldneveroccur.So,aproperrenamingmechanismandextrabufferregisterswouldremoveantianddatadependencies.Then,theonlytypeofdependencythatcoulddelaymicreoperationexecutionwouldbeadatadependency.Inotherwords,amicro-operationcouldbeexecutedassoonasitsinputoperandsareproperlygenerated.Thisisexactlythedescriptionofadatatlowexecutionmodel.3.3.2.OurModifiedTomasuloAlgorithm.OuralgorithmforenforcingdatadependenciesandremovingantiandoutputdependenciesissimilartotheTomasuloalgorithmwhichwasusedintheFloatingPointUnitoftheIBM360191161.Duringexecution,thealgorithmmanagestwomajordatastructures:aregisteraliastableandasetofnodetables.Eachentryintheregisteraliastablekeepstrackofthedynamicinformationforaregisternecessaryeithertosupplyaninputoperandvalueortoestablishdependencyarcs.Therearetwofieldsineachregisteraliastableentry.Thefirstisareadybit.Thisbit,ifcleared,indicatesthatthereisanactivemicro-operationwhichisgoingtosupplytheregistervalue.Thesecondfieldisthetagfieldwhichprovidesanindexintoaresultbuffer.Thisindicateswheretheregistervaluecanbefoundifthereadybitisset.106Eachentryinanodetablecorrespondstoamicro-operationandhasanoperationfield,aresulttugfield,andtwooperandrecords.Theoperationfieldspecifiestheactionthatwillbeperformedontheinputoperands,Theresulttugfieldprovidesthelocationintheresultbufferthattheresultvaluewillbeshippedtoaftertheexecutionofthemicro-operation,Eachoperandrecordconsistsoftwofields.Thefirstisareadybit.Thisbitissetwhentheinputoperandhasbeenproperlyproduced.Thesecondfieldisthetagfieldwhichcontainsanindexintotheoperandbuffer.Thisindicateswheretheoperandcanbefoundifthereadybitisset.AdatapathdesignedforourmodifiedTomasuloalgorithmispresentedinfigure2.Therearetwophasesineachmachinecycle:merging/schedulinganddistribution,Initiallyallregisteraliastableentriesarereadyandtheinitialregistervaluesareinresultbufferentrieswhoseindexisinthetagfieldsofthecorrespondingregisteraliastableentries.Merging/SchedulingAnewmicro-operationisassignedanentryinanodetableandisgivenauniqueresulttag,First,thecontentsofbothfieldsintheregisteraliastableforeachinputoperandarecopiedtothecorrespondingfieldsintheoperandrecordsofthenewnodetableentry.Second,thereadybitoftheregistertobewrittenbythemicro-operationisresetandtheresulttagforthemicro-operationiswrittenintothetagfieldofthealiastableentry.Ifbothoperandsofanodearemarkedready,thisnodecanfire.Thetagsintheoperandrecordsareusedtoindexintotheresultbufferandobtaintheoperandvalues.Theoperationandtheoperandvaluesaresenttothefunctionunitforexecution.Theresulttag,whichwillbeusedtodistributetheresult,isalsosenttothefunctionunit.DistributionWhenafunctionunitfinishesexecutingamicro-operation,theresulttagofthatmicro-operationisusedtoselectaresultbufferentryandtheresultvalueisstoredintotheentry.Theresulttagisalsodistributedtotheregisteraliastableandthenodetable.Boththeregisteraliastableandthenodetablearecontentaddressablememories.Entriesinthesetablesareaddressedbythevalueoftheresulttag.Alloftheregisteraliastableentriesandalloftheoperandrecordsinthenodetableentriessettheirreadybitifthedistributedresulttugmatchestheirtagfieldcontents.Afterexecution,alltheregisteraliastableentriesarereadyandthecorrespondingregistervaluesareintheresultbufferentrieswhoseindicesareinthetagfields.Thealgorithmdescribedaboveisawinonatleasttwocounts.First,itremovesantiandoutputdependencieswithoutproducingincorrectresults.Infact,itcanheshownthatreservationschemeswithoutrenamingcannotremoveantiandoutputdependencieswithoutproducingincorrectresults.Second,unlikeScoreboarding(forexample),theissuingprocessneverhastostallduetodependencies.Itcanalsobeshownthatanyreservationschemewithoutrenamingwillhavetostallforsomedependencies.4.ConcludingRemarks.ThepurposeofthispaperhasbeentointroducetheHPSmicroarchitecture.CurrentresearchatBerkeleyistakingHPSalongfourtracks.First,weareattemptingtodesign,athighperformance,threeverydissimilararchitectures:themicroVAX,aCmachine,andaPrologprocessor.Equallyimportant,weareinvestigatingthelimitsofthismicroarchitecture,bothfromthestandpointofaminimalimplementationandfromthestandpointofacadilIacversion.Asistobeexpected,thereareissuestoberesolvedbeforeaneffectiveHPSimplementationcanbe,achieved.Forexample,ifHPSistoimplementasequentialcontrolbasedISParchitecture,thentherearedecodingissues,includingthequestionofanodecache,whichneedtobedecided.Second,HPSrequiresadatapaththat(1)hashighbandwidthand(2)allowstheprocessingofveryirregularparalleldata.Third,HPSneedsaschedulerwhichcandetermine,inreal-time,whichnodesarefirableandwhicharenot.Fourth,the-out-of-orderexecutionofnodesrequiresadditionalattentiontothedesignofthememorysystem,theinstructionretirementandrepairmechanisms,andtheI/Osystem.Theseissuesarethesubjectofacompanionpaper171intheseProceedings.Acknowledgement.TheauthorswishtoacknowledgefirsttheDigitalEquipmentCorporationforsupportingverygenerouslyourresearchinanumberofpositiveways:LindaWright,formerlyHeadofDigit
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年聚氨酯泡沫稳定剂投资申请报告
- 2024届江苏省苏锡常镇高三下学期教学情况调研(一)数学试卷(解析版)
- 2022-2023学年江西省吉安市吉州区部分学校高二下学期期末联考数学试题(解析版)
- 安全风险管理方案
- 2023年四川省行测笔试试题及答案
- 集中开展道路交通安全大检查工作方案
- 劳资沟通与民主管理-职工代表大会的组织召开(劳动法课件)
- 分散系及其分类详细教案
- Unit+1+Festivals+and+Celebrations+Reading+for+writing+传统节日应用文写作+课件-2023-2024学年人教版高中英语必修第三册
- 地产开业庆典兴城人居-兴邻里紫荟开街营销推广策划方案
- 《20211国标给排水专业图集资料》05SS907-6 砖砌排水检查井及跌水井
- 工学一体化教学评价表
- 苏教四年级下册多位数的认识复习PPT课件
- 乱了头绪的经理人[42页]
- 日立中央空调电气故障(课堂PPT)
- 无公害农产品种植业生产记录表农药肥料投入品田
- 鲁滨逊漂流记书迷记录卡
- [其它考试]输变电可靠性题库
- 在全市综合治理执行难工作会议上的讲话
- 养猪场污水UASB+SBR工艺处理工程设计
- 电力变压器冷却系统毕业设计
评论
0/150
提交评论