




已阅读5页,还剩1页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
HPS,ANEWMICROARCHITECTURE:RATIONALEANDINTRODUCTIONYaleN.Patt,Wen-meiHwu,andMichaelShebanowComputerScienceDivisionUniversityofCalifornia,BerkeleyBerkeley,CA94720ABSTRACTHPS(HighPerformanceSubstrate)isanewmicroarchitecturetargetedforimplementingveryhighperformancecomputingengines.Ourmodelofexecutionisarestrictiononfinegranularitydataflow.Thispaperintroducesthemodel,providestherationaleforitsselection,anddescribesthedatapathandflowofinstructionsthroughthemicroengine.1.IntroductionAcomputersystemisamultilevelstructure,algorithmsatthetop,gatesandwiresatthebottom.Toachievehighperformance,onemustoptimizeatalllevelsofthisstructure.Atmostlevels,theconventionalwisdomsuggestsexploitingconcurrency.Severalproposalshavebeenputforwardastohowtodothis.Wealsoargueforexploitingconcurrency,focusinginparticularonthemicroarchitecturelevel.1.1.RestrictedDataFlow.WearecallingourengineHPS,whichstandsforHighPerformanceSubstrate,toreflectthenotionthatwhatweareproposingshouldbeusefulforimplementingverydissimilarISParchitectures.Ourmodelofthemicroengine(i.e.,arestrictiononclassicalfinegranularitydataflow)isnotunlikethatofDennis3,ArvindZl,andothers,butwithsomeveryimportantdifferences.Thesedifferenceswillbediscussedndetailinsection3.Forthemoment,itisimportanttounderstandthatunlikeclassicaldatafiowmachines,onlyasmallsubsetoftheentireprogramisintheHPSmicroengineatanyonetime.Wedefinethe“activewindow”asthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlypartofthedataflowgraphwhichisresidentinthemicroengine.AstheactivewindowmovesthroughPermissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarcnotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.thedynamicinstructionstream,HPSexecutestheentireprogram.1.2.PotentialLimitationsofOtherApproaches.Webelievethatanessentialingredientofhighperformancecomputingistheeffectiveutilizationofalotofconcurrency.Thusweseeapotentiallimitationinmicroenginesthatarelimitedtooneoperationpercycle.Similarly,weseeapotentiallimitationinamicroenginethatunderutilizesitsbandwidthtoeitherinstructionmemoryordatamemory.Finally,althoughweappreciatetheadvantagesofstaticscheduling,weseeapotentiallimitationinamicroenginethatpurportstoexecuteasubstantialnumberofoperationseachcycle,butmustrelyonanon-run-timeschedulerfordeterminingwhattodonext.1.3.Outlineofthispaper.Thispaperisorganizedinfoursections.Section2delineatesthefundamentalreasonswhichledustothisnewmicroarchitecture.Section3describesthebasicoperationofHPS.Section4offerssomeconcludingremarks,anddescribeswhereourresearchinHPSisheading.2.Rationale.2.1.TheThreeTierModel.Webelievethatirregularparallelisminaprogramexistsbothlocallyandglobally.Ourmechanismexploitsthelocalparallelism,butdisregardsglobalparallelism.Ourbeliefisthattheexecutionofanalgorithmshouldbehandledinthreetiers.Atthetop,whereglobalparallelismcanbebestidentified,theexecutionmodel-.shouldutilizelargegranularitydataflow.muchliketheproposaloftheCEDARproject41.Inthemiddle,wherefortyyearsofcollectedexperienceincomputerprocessingcanbeexploitedprobablywithoutharm,classicalsequentialcontrolflowshouldbethemodel.Atthebottom,wherewewanttoexploitlocalparallelism,finegranularitydataflowisrecommended.Ourthreetiermodelreflectsourconceptionthatthetoplevelshouldbealgorithmoriented,themiddlelevelsequentialcontrolflowISParchitectureoriented,andthebottomlevelmicroengineoriented.Q1985ACMO-89791-1725/85/0012/0103$00.752.2.LocalParallelism.Wefeelobligedtore-emphasizetheimportanceoflocalparallelismtoourchoicea4executionmodel.indeed,wechosethisrestrictedformofdataflowspecificallybecauseourstudieshaveshownthattheparallelismavailablefrom.themiddlecontrolflowtier(i.e.,thesequentialcontrolflowarchitecture)ishighlylocalized.Wearguethat,byrestrictingtheactiveinstructionwindow,wecanexploi,talmostalloftheinherentparallelismintheprogramwhileincurringverylittleofthesynchronizationcostswhichwouldbeneededtokeeptheentireprogramaroundasatotaldataflowgraph.2.3,Stalls,Bandwidth,andConcurrency.Webelievethatahighperformancecomputingengineshouldexhibitanumberofcharacteristics.First,allitacomponentsmustbekeptbusy.Theremustbefewstalls,bothintheflowofinformation(i.e.,thepathtomemory,loadingofregisters,etc.)andintheprocessingofinformation(i.e.,thefunctionalunits).Second,theremustbeahighdegreeofconcurrermyavailable,suchasmultiplepathstomemory,multipleprocessingelements,andsomeformofpipelining,forexample.Inourview,therestricteddataflowmodel,withitsout-of-orderexecutioncapability,bestenablestheabovetworequirements,asfollows:Thecenterofourmodelisthesetofnodetables,whereoperationsawaittheiroperands.Instructionmemoryfeedsthemicroengineataconstantratewithfewstalls.DatamemoryandI/Osupplyandextractdataatconstantrateswithfewstalls.Functionalunitsarekeptbusybynodesthatcanfire.Somewhereinthissystem,therehastobe“slack.”Theslackisinthenodeswaitinginthenodetables.Sincenodescanexecuteout-of-order,thereisnoblockingduetounavailabledata.Decodedinstructionsaddnodestothenodetablesandexecutednodesremovethem.Thenodetablestendtogrowinthepresenceofdatadependencies,andshrinkasthesedependenciesbecomefewer.Meanwhile,ourpreliminarymeasurementssupport,themultiplecomponentsofthemicroenginearekeptbusy.3.TheHPSModelofExecution.3.1.Overview.AnabstractviewofHPSisshowninfigure1.Instructionsarefetchedanddecodedfromadynamicinstructionstream,shownatthetopofthefigure.ThefigureimpliesthattheinstructionstreamistakenfromasequentialcontrolflowISParchitecture.WeneedtoemphasizethatthisisnotanecessarypartoftheHPSspecification.Indeed,weareinvestigatinghavingHPSdirectlyprocessmultinodewords(i.e.,thenodesofadirectedgraph)whichwouldbeproducedasthetargetcodeofa(forexample)Ccompiler.Whatisnecessaryisthat,foreachinstruction,theoutputofthedecoderwhichispresentedtotheMergerforhandlingbyHPSisadataflowgraph.AveryimportantpartofthespecificationofHPSisthenotionoftheactiveinstructionwindow.Unlikeclassicaldataflowmachines,itisnotthecasethatthedataflowgraphfortheentireprogramisinthemachineatonetime.WedefinetheactivewindowasthesetofISPinstructionswhosecorrespondingdataflownodesarecurrentlybeingworkedoninthedataflowmicroengine.Astheinstructionwindowmovesthroughthedynamicinstructionstream,HPSexecutestheentireinstructionstream.Parallelismwhichexistswithinthewindowisfullyexploitedbythemicroengine.Thisparallelismislimitedinscope;ergo,theterm“restricteddataflow.”TheMergertakesthedataflowgraphcorrespondingtoeachISPinstructionand,usingageneralizedTomasuloalgorithmtoresolveanyexistingdatadependencies,mergesitintotheentiredataflowgraphfortheactivewindow.Eachnodeofthedataflowgraphisshippedtooneofthenodetableswhereitremainsuntilitisreadytofire.Whenalloperandsforadataflownodeareready,thedataflownodefiresbvtransmittinethenodetotheappropriatefunctionalunit.Thefunctionalunit(anALU,memory,orI/Odevice)executesthenodeanddistributestheresult,ifany,tothoselocationswhereitisneededforsubsequentprocessing:thenodetables,theMerger(forresolvingsubsequentdependencies)andtheFetchControlUnit(forbringingnewinstructionsintotheactivewindow).Whenallthedataflownodesforaparticularinstructionhavebeenexecuted,theinstructionissaidtohaveexecuted.Aninstructionisretiredfromtheactivewindowwhenithasexecuted104andalltheinstructionsbeforeithaveretired.Allsideeffectstomemoryaretakencareofwhenaninstructionretiresfromtheactivewindow.Thisisessentialforthecorrecthandlingofpreciseinterruptsll.Theinstructionfetchinganddecodingunitsmaintainthedegreeofparallelisminthenodetablesbybringingnewinstructionsintotheactivewindow,whichresultsinnewdataflownodesbeingmergedintothedataflownodetables.3.2.InstructionFlowFigure2showstheglobaldatapathofHPS.InstructionsenterthedatapathasinputtotheMerger.Thisinputisintheformofadateflowgraph,oneperinstruction.Thedataflowgraphcanbetheresultofdecodinganinstructioninaclassicalsequentialinstructionstream,oritcanbetheoutputofanon-conventionalcompiler.Ineithercase,theMergerseesasetofdataflownodes(anddatadependencies),oneforeachoperationthatmustbeperformedintheexecutionofthatinstruction,operationsare,forexample,reads,writes,addresscomputationsandALUfunctions,Intheexampleoffigure3,thedataflowgraphcorrespondingtotheVAXinstructionADDW#lOOO,A,Bconsistsofthreenodes:amemoryread,memorywrite,andanALUoperation.Figure3alsoshowsthestructureofthethreenodesandthefivevaluebufferentriesrequiredfortheinstruction.TheMerger,usingtheRegisterAliasTabletoresolvedatadependenciesnotexplicitintheindividualinstruction,formsthesetofdataflownodeswhicharenecessarytoexecutetheinstruction,Nodesarethentransmittedtotheappropriatenodetables.Nodetables,asweshallsee,arecontentaddressiblememories,andthusshouldbekeptsmall.ThesizeofeachnodetableisafunctionofthesizeoftheactivewindowandthedecodingrateoftheVonNeumanninstructionstream.InourexperimentswiththeVAXarchitecture,forexample,anactivewindowof16instructions,coupledwithadecodingrateofeightnodespercycle,requiredatmosta35entrynodetable.Foreachnode,aslotisreservedintheglobalmulti-portvaluebufferforstoringtheresultoftheoperationofthatnode.Theindexofeachslotisdesignatedasatagforthecorrespondingnode,andiscarriedalongwiththenodeuntilitcompletesitsexecution.Valuebufferslotsareassignedinacircularqueue,thesizeofthebufferbeinglargeenoughtoguarranteeretirementofaninstructionbeforeitsvaluebufferslotisagainneeded.(InthecaseofoursimulatedimplementationoftheVAXarchitecture,anactivewindowof16instructions,havingapproximatelyfournodesperinstruction,meansthatavaluebufferof136entriesismorethanadequate.)Anoderemainsinitsnodetableuntilallofitsoperandsareavailable,atwhichpointitisreadytofire(i.e.,itisexecutable).Anodeisfiredbytransmittingitsoperator,tag,andsetofoperandstooneofthefunctionalunitaassociatedwiththatnodetable.Whenexecutioncompletes,theresultanditstagaredistributedtoeachpotiofthevaluebuffer.Inthecaseofaresultdestinedforageneralpurposeregister,thecorrespondingtagisalsotransmittedtotheRegisterAliasTabletoupdateinformationstoredthere.Thecorrespondingtagisalsotransmittedtothenodetablesforthepurposeofsettingthereadybitsinthosenodesawaitingthisresult.nodes.Thenumberofresultsthatcanbedistributedinasinglecycleisafunctionofthebusstructureandtheorganizationofthenodetables.Theintentisthatineachcycle,multiplenodeswillbeineachstageoftheprocess.#DO&3IIOOO,A,83.3.DataDependenciesandtheirResolution.ftclm3.Memoryreadandwritenodespresentadditionalcomplications.Althoughthesewillbediscussedingreaterdetailin71,afewobservationshereareinorder.FirstisthefactthatatthetimememoryaccessnodesareissuedbytheMerger(dependingofcourseontheaddressingstructureofthetargetarchitecture),theaddressofthememoryaccessmaybeunknown,andtheaddressesofothermemoryaccesseswhichcouldblockthenodebeingissuedmayalsobeunknown.AMemoryAliasTableandaReadStagingUnitareprovidedtohandletheseproblems,Secondisthefactthatwritescanoccuroutofordercoupledwithourrequirementthatexceptionhandlingmustallowthemachinestatetoberecovered“precisely.”AWriteBufferandanalgorithmforretiringinstructionsareprovidedforhandlingthisproblem.Onefinalobservationabouttheprocessingofnodesmustbemade.Thestagesthatanodegoesthrough(i.e.,merging,waitingforoperands,firing,executing,anddistributingitsresults)isindependentoftheothernodesinthenodetables.Thatis,forexample,thenumberofnodesfirableinagivencycleislimitedbytheabilitytodetectthatmultiplenodesarefirableandthenumberoffunctionaluniteavailableforconcurrentprocessingofFundamentaltothecorrect,fast,out-of-orderexecutionofoperationsinHPSisthehandlingofdatadependenciesand,aswewillsee,theabsenceofblockinginthosecaseswhereblockingisunnecessary.Sinceourlocallyconcurrentimplementationmodelhastoconformtothetargetarchitecture,thelocalconcurrencyexploitedmustnotcauseincorrectexecutionresults.3.3.1.Data,Anti,andOutputDependencies.Amicro-operationBdependsonanothermicro-operationAifBhastobeexecutedafterAinordertoproducethecorrectresult.Therearethreewaysinwhichamicro-operationcandependonanothermicro-operationthroughregisterusage:data,anti,andoutputdependencies.Adata(read-after-write)dependencyoccurswhenAisgoingtowritetotheregisterfromwhichBisgoingtoread.Inthiscase,AsuppliesinformationessentialtotheexecutionofB.Ananti(write-after-read)dependencyoccurswhenAisgoingtoreadfromtheregistertowhichBisgoingtowrite.Anoutput(write-after-write)dependencyoccurswhenAandBaregoingtowritetothesameregister.Inthelasttwocases,theexecutionofAdoesnotsupplyanyinformationnecessaryfortheexecutionofB.TheonlyreasonBdependsonAisthataregisterhasbeenallocatedtotwodifferenttemporaryvariablesduetoashortageofregisters.Infact,ifwehadanunlimitednumberofregisters,differenttemporaryvariableswouldneverhavetobeallocatedtothesameregisterandthesecondandthethirddependencieswouldneveroccur.So,aproperrenamingmechanismandextrabufferregisterswouldremoveantianddatadependencies.Then,theonlytypeofdependencythatcoulddelaymicreoperationexecutionwouldbeadatadependency.Inotherwords,amicro-operationcouldbeexecutedassoonasitsinputoperandsareproperlygenerated.Thisisexactlythedescriptionofadatatlowexecutionmodel.3.3.2.OurModifiedTomasuloAlgorithm.OuralgorithmforenforcingdatadependenciesandremovingantiandoutputdependenciesissimilartotheTomasuloalgorithmwhichwasusedintheFloatingPointUnitoftheIBM360191161.Duringexecution,thealgorithmmanagestwomajordatastructures:aregisteraliastableandasetofnodetables.Eachentryintheregisteraliastablekeepstrackofthedynamicinformationforaregisternecessaryeithertosupplyaninputoperandvalueortoestablishdependencyarcs.Therearetwofieldsineachregisteraliastableentry.Thefirstisareadybit.Thisbit,ifcleared,indicatesthatthereisanactivemicro-operationwhichisgoingtosupplytheregistervalue.Thesecondfieldisthetagfieldwhichprovidesanindexintoaresultbuffer.Thisindicateswheretheregistervaluecanbefoundifthereadybitisset.106Eachentryinanodetablecorrespondstoamicro-operationandhasanoperationfield,aresulttugfield,andtwooperandrecords.Theoperationfieldspecifiestheactionthatwillbeperformedontheinputoperands,Theresulttugfieldprovidesthelocationintheresultbufferthattheresultvaluewillbeshippedtoaftertheexecutionofthemicro-operation,Eachoperandrecordconsistsoftwofields.Thefirstisareadybit.Thisbitissetwhentheinputoperandhasbeenproperlyproduced.Thesecondfieldisthetagfieldwhichcontainsanindexintotheoperandbuffer.Thisindicateswheretheoperandcanbefoundifthereadybitisset.AdatapathdesignedforourmodifiedTomasuloalgorithmispresentedinfigure2.Therearetwophasesineachmachinecycle:merging/schedulinganddistribution,Initiallyallregisteraliastableentriesarereadyandtheinitialregistervaluesareinresultbufferentrieswhoseindexisinthetagfieldsofthecorrespondingregisteraliastableentries.Merging/SchedulingAnewmicro-operationisassignedanentryinanodetableandisgivenauniqueresulttag,First,thecontentsofbothfieldsintheregisteraliastableforeachinputoperandarecopiedtothecorrespondingfieldsintheoperandrecordsofthenewnodetableentry.Second,thereadybitoftheregistertobewrittenbythemicro-operationisresetandtheresulttagforthemicro-operationiswrittenintothetagfieldofthealiastableentry.Ifbothoperandsofanodearemarkedready,thisnodecanfire.Thetagsintheoperandrecordsareusedtoindexintotheresultbufferandobtaintheoperandvalues.Theoperationandtheoperandvaluesaresenttothefunctionunitforexecution.Theresulttag,whichwillbeusedtodistributetheresult,isalsosenttothefunctionunit.DistributionWhenafunctionunitfinishesexecutingamicro-operation,theresulttagofthatmicro-operationisusedtoselectaresultbufferentryandtheresultvalueisstoredintotheentry.Theresulttagisalsodistributedtotheregisteraliastableandthenodetable.Boththeregisteraliastableandthenodetablearecontentaddressablememories.Entriesinthesetablesareaddressedbythevalueoftheresulttag.Alloftheregisteraliastableentriesandalloftheoperandrecordsinthenodetableentriessettheirreadybitifthedistributedresulttugmatchestheirtagfieldcontents.Afterexecution,alltheregisteraliastableentriesarereadyandthecorrespondingregistervaluesareintheresultbufferentrieswhoseindicesareinthetagfields.Thealgorithmdescribedaboveisawinonatleasttwocounts.First,itremovesantiandoutputdependencieswithoutproducingincorrectresults.Infact,itcanheshownthatreservationschemeswithoutrenamingcannotremoveantiandoutputdependencieswithoutproducingincorrectresults.Second,unlikeScoreboarding(forexample),theissuingprocessneverhastostallduetodependencies.Itcanalsobeshownthatanyreservationschemewithoutrenamingwillhavetostallforsomedependencies.4.ConcludingRemarks.ThepurposeofthispaperhasbeentointroducetheHPSmicroarchitecture.CurrentresearchatBerkeleyistakingHPSalongfourtracks.First,weareattemptingtodesign,athighperformance,threeverydissimilararchitectures:themicroVAX,aCmachine,andaPrologprocessor.Equallyimportant,weareinvestigatingthelimitsofthismicroarchitecture,bothfromthestandpointofaminimalimplementationandfromthestandpointofacadilIacversion.Asistobeexpected,thereareissuestoberesolvedbeforeaneffectiveHPSimplementationcanbe,achieved.Forexample,ifHPSistoimplementasequentialcontrolbasedISParchitecture,thentherearedecodingissues,includingthequestionofanodecache,whichneedtobedecided.Second,HPSrequiresadatapaththat(1)hashighbandwidthand(2)allowstheprocessingofveryirregularparalleldata.Third,HPSneedsaschedulerwhichcandetermine,inreal-time,whichnodesarefirableandwhicharenot.Fourth,the-out-of-orderexecutionofnodesrequiresadditionalattentiontothedesignofthememorysystem,theinstructionretirementandrepairmechanisms,andtheI/Osystem.Theseissuesarethesubjectofacompanionpaper171intheseProceedings.Acknowledgement.TheauthorswishtoacknowledgefirsttheDigitalEquipmentCorporationforsupportingverygenerouslyourresearchinanumberofpositiveways:LindaWright,formerlyHeadofDigit
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025版柴油产品进出口合同范本共
- 二零二五年度汽车租赁公司车辆保险理赔合同范本
- 2025版滑雪教练员培训与赛事管理合同
- 2025年度财务担保合同示范文本
- 2025版家庭养老照护床位租赁与智慧家居系统集成合同
- 2025版电梯设备销售与智能化升级合同
- 2025年度二手车专业评估与买卖合同
- 二零二五版高性能配电箱销售合同书
- 2025年未签订合同的赔偿标准是多少
- 土地流转社区农业开发协议
- 招标业务合作协议书范本
- 集成电路工程师笔试试题及答案
- 贵州贵州省建设投资集团有限公司招聘笔试真题2024
- 广西钦州市2024-2025学年高二下学期期末检测英语试题【含答案解析】
- 【课件】三角形的中线、角平分线、高课件2025-2026学年人教版数学八年级上册
- 2025年温州市交通发展集团招聘考试试题(含答案)
- 2025年新修订《治安管理处罚法》
- 采购面料知识培训课件
- 2025年新疆中考语文真题(原卷版)
- 海上试验活动方案
- 电厂安全培训课件
评论
0/150
提交评论