会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

49-The Stanford DASH Multiprocessor.pdf49-The Stanford DASH Multiprocessor.pdf -- 5 元

宽屏显示 收藏 分享

页面加载中... ... 广告 0 秒后退出

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

TheStanfordDashMultiprocessorDanielLenoski,JamesLaudon,KouroshGharachorloo,WolfDietrichWeber,AnoopGupta,JohnHennessy,MarkHorowitz,andMonicaS.LamStanfordUniversityDirectorybasedcachecoherencegivesDashtheeaseofuseofsharedmemoryarchitectureswhilemaintainingthescalabilityofmessagepassingmachines.heComputerSystemsLaboratoryatStanfordUniversityisdevelopingasharedmemorymultiprocessorcalledDashanabbreviationforDirectoryArchitectureforSharedMemory.Thefundamentalpremisebehindthearchitectureisthatitispossibletobuildascalablehighperformancemachinewithasingleaddressspaceandcoherentcaches.TheDasharchitectureisscalableinthatitachieveslinearornearlinearperformancegrowthasthenumberofprocessorsincreasesfromafewtoafewthousand.Thisperformanceresultsfromdistributingthememoryamongprocessingnodesandusinganetworkwithscalablebandwidthtoconnectthenodes.Thearchitectureallowsshareddatatobecached,therebysignificantlyreducingthelatencyofmemoryaccessesandyieldinghigherprocessorutilizationandhigheroverallperformance.Adistributeddirectorybasedprotocolprovidescachecoherencewithoutcompromisingscalability.TheDashprototypesystemisthefirstoperationalmachinetoincludeascalablecachecoherencemechanism.Theprototypeincorporatesupto64highperformanceRISCmicroprocessorstoyieldperformanceupto1.6billioninstructionspersecondand600millionscalarfloatingpointoperationspersecond.Thedesignoftheprototypehasprovideddeeperinsightintothearchitecturalandimplementationchallengesthatariseinalargescalemachinewithasingleaddressspace.Theprototypewillalsoserveasaplatformforstudyingrealapplicationsandsoftwareonalargeparallelsystem.ThisarticlebeginsbydescribingtheoverallgoalsforDash,themajorfeaturesofthearchitecture,andthemethodsforachievingscalability.Next,wedescribethedirectorybasedcoherenceprotocolindetail.Wethenprovideanoverviewoftheprototypemachineandthecorrespondingsoftwaresupport,followedbysomeMarch1992preliminaryperformancenumbers.ThearticleconcludeswithadiscussionofrelatedworkandthecurrentstatusoftheDashhardwareandsoftware.TheDashteamManygraduatestudentsandfacultvmemberscontributedtotheDashproject.ThePhDstudentsareDanielLenoskiandJamesLauDashprojectOverviewTheoverallgoaloftheDashprojectistoinvestigatehighlyparallelarchitectures.Forthesearchitecturestoachievewidespreaduse,theymustrunavarietyofapplicationsefficientlywithoutimposingexcessiveprogrammingdifficulty.Toachievebothhighperformanceandwideapplicability,webelieveaparallelarchitecturemustprovidescalabilitytosupporthundredstothousandsofprocessors.highperformanceindividualprocessors,andasinglesharedaddressspace.Thegapbetweenthecomputingpowerofmicroprocessorsandthatofthelargestsupercomputersisshrinking.whilethepriceiperformanceadvantageofmicroprocessorsisincreasing.Thisclearlypointstousingmicroprocessorsasthecomputeenginesinamultiprocessor.Thechallengeliesinbuildingamachinethatcanscaleupitsperformancewhilemaintainingtheinitialprice/performanceadvantageoftheindividualprocessors.Scalabilityallowsaparallelarchitecturetoleveragecommoditymicroprocessorsandsmallscalemultiprocessorstobuildlargerscalemachines.Theselargermachinesoffersubstantiallyhigherperformance,whichprovidestheimpetusforprogrammerstoporttheirsequentialapplicationstoparallelarchitecturesinsteadofwaitingforthenexthigherperformanceuniprocessor.Highperformanceprocessorsareimportanttoachievebothhightotalsystemperformanceandgeneralapplicability.Usingthefastestmicroprocessorsreducestheimpactoflimitedorunevenparallelisminherentinsomeapplications.Italsoallowsawidersetofapplicationstoexhibitacceptableperformancewithlesseffortfromtheprogrammer.Asingleaddressspaceenhancestheprogrammabilityofaparallelmachinebyreducingtheproblemsofdatapartitioninganddynamicloaddistribution,twoofthetoughestproblemsinprogrammingparallelmachines.Thesharedaddressspacealsoimprovessupportforautomaticallyparallelizingcompilers,standardoperatingsystems,multiprodonDasharchitectureandhardwaredesignKouroshGharachorlooDasharchitectureandconsistencymodelsWolfDietrichWeberDashsimulatorandscalabledirectoriesTrumanJoeDashhardwareandprotocolverificationtoolsLuisStevensoperatingsystemHelenDavisandStephenGoldschmidttracegenerationtools,synchronizationpatterns,localitystudiesToddMowryevaluationofprefetchoperationsAaronGoldbergandMargaretMartonosiperformancedebuggingtoolsTomChanakmeshroutingchipdesignRichardSimonisyntheticloadgeneratoranddirectorystudiesJosepTorrellassharingpatternsinapplicationsEdwardRothberg,JaswinderPalSingh,andLarrySouleapplicationsandalgorithmdevelopment.StaffresearchengineerDavidNakahiracontributedtothehardwaredesign.ThefacultyassociatedwiththeprojectareAnoopGupta,JohnHennessy,MarkHorowitz,andMonicaLam.gramming,andincrementaltuningofparallelapplicationsfeaturesthatmakeasingleaddressspacemachinemucheasiertousethanamessagepassingmachine.Cachingofmemory,includingsharedwritabledata,allowsmultiprocessorswithasingleaddressspacetoachievehighperformancethroughreducedmemorylatency.Unfortunately,cachingshareddataintroducestheproblemofcachecoherenceseethesidebarandaccompanyingfigure.Whilehardwaresupportforcachecoherencehasitscosts,italsooffersmanybenefits.Withouthardwaresupport,theresponsibilityforcoherencefallstotheuserorthecompiler.Exposingtheissueofcoherencetotheuserwouldleadtoacomplexprogrammingmodel,whereusersmightwellavoidcachingtoeasetheprogrammingburden.Handlingthecoherenceprobleminthecompilerisattractive.butcurrentlycannotbedoneinawaythatiscompetitivewithhardware.Withhardwaresupportedcachecoherence,thecompilercanaggressivelyoptimizeprogramstoreducelatencywithouthavingtorelypurelyonaconservativestaticdependenceanalysis.Themajorproblemwithexistingcachecoherentsharedaddressmachinesisthattheyhavenotdemonstratedtheabilitytoscaleeffectivelybeyondafewhighperformanceprocessors.Todate,onlymessagepassingmachineshaveshownthisability.Webelievethatusingadirectorybasedcoherencemechanismwillpermitsingleaddressspacemachinestoscaleaswellasmessagepassingmachines,whileprovidingamoreflexibleandgeneralprogrammingmodel.DashsystemorganizationMostexistingmultiprocessorswithcachecoherencerelyonsnoopingtomaintaincoherence.Unfortunately,snoopingschemesdistributetheinformationaboutwhichprocessorsarecachingwhichdataitemsamongthecaches.Thus,straightforwardsnoopingschemesrequirethatallcachesseeeverymemoryrequestfromeveryprocessor.Thisinherentlylimitsthescalabilityofthesemachinesbecausethecommonbusandtheindividualprocessorcacheseventuallysaturate.WithtodayshighperformanceRISCprocessorsthissaturationcanoccurwithjustafewprocessors.Directorystructuresavoidthescalabilityproblemsofsnoopyschemesbyremovingtheneedtobroadcasteverymemoryrequesttoallprocessorcaches.Thedirectorymaintainspointerstotheprocessorcachesholdingacopyofeachmemoryblock.Onlythecacheswithcopiescanbeaffectedbyanaccesstothememoryblock,andonlythosecachesneedbenotifiedoftheaccess.Thus,theprocessorcachesandinterconnectwillnotsaturateduetocoherencerequests.Furthermore.directorybasedcoherenceisnotdependentonanyspecificinterconnectionnetworklikethebususedbymostsnoopingschemes.Thesamescalable,lowlatencynetworkssuchasOmeganetworksorknaryncubesusedbynoncachecoherentand64COMPUTERCachecoherenceCachecoherenceproblemscanariseinsharedmemorymultiprocessorswhenmorethanoneprocessorcacheholdsacopyofadataitema.Uponawrite,thesecopiesmustbeupdatedorinvalidatedb.Mostsystemsuseinvalidationsincethisallowsthewritingprocessortogainexclusiveaccesstothecachelineandcompletefurtherwritesintothecachelinewithoutgeneratingexternaltrafficc.Thisfufthercomplicatescoherencesincethisdirtycachemustrespondinsteadofmemoryonsubsequentaccessesbyotherprocessorsd.Smallscalemultiprocessorsfrequentlyuseasnoopycachecoherenceprotocol,whichreliesonallcachesmonitoringthecommonbusthatconnectstheprocessorstomemory.Thismonitoringallowscachestoindependentlydeterminewhentoinvalidatecachelinesb,andwhentointervenebecausetheycontainthemostuptodatecopyofagivenlocationd.Snoopyschemesdonotscaletoalargenumberofprocessorsbecausethecommonbusorindividualprocessorcacheseventuallysaturate,sincetheymustprocesseverymemoryrequestfromeveryprocessor.onmemoryrequestsbykeepingtrackofwhichcachesholdeachmemoryblock.AsimpledirectorystructurefirstproposedbyCensierandFeautrierhasonedirectoryentryperblockofmemorye.Eachentrycontainsonepresencebitperprocessorcache.Inaddition,astatebitindicateswhethertheblockisuncached,sharedinmultiplecaches,orheldexclusivelybyonecachethatIS,whethertheblockisdirty.Usingthestateandpresencebits,thememorycantellwhichcachesneedtobeinvalidatedwhenalocationiswrittenb.Likewise,thedirectoryindicateswhethermemoryscopyoftheblockisuptodateorwhichcacheholdsthemostrecentcopyd.Ifthememoryanddirectoryarepartitionedintoindependentunitsandconnectedtotheprocessorsbyascalableinterconnect,thememorysystemcanprovidescalablememorybandwidth.ThedirectoryrelievestheprocessorcachesfromsnoopingReferences1.J.ArchibaldandJ.L.Baer,CacheCoherenceProtocolsEvaluationUsingaMultiprocessorSimulationModel,ACMTrans.ComputerSystems,Vol.4,No.4,Nov.1986,pp.273298.2.L.CensierandP.Feautrier,ANewSolutiontoCoherenceProblemsinMulticacheSystems,/E€€Trans.Computers,Vol.C27,No.12,Dec.1978,pp.1,1121,118.Store3,ACacheCache4LoadAerdDataStatebitPresencebitsemessagepassingmachinescanbeemployed.Theconceptofdirectorybasedcachecoherenceisnotnew.Itwasfirstproposedinthelate1970s.However,theoriginaldirectorystructureswerenotscalablebecausetheyusedacentralizeddirectorythatquicklybecameabottleneck.TheDasharchitectureovercomesthislimitationbypartitioninganddistributingthedirectoryandmainmemory,andbyusinganewcoherenceprotocolthatcansuitablyexploitdistributeddirectories.Inaddition,DashprovidesseveralothermechanismstoMarch199265reduceandhidethelatencyofmemoryoperations.Figure1showsDashshighlevelorganization.Thearchitectureconsistsofanumberofprocessingnodesconnectedthroughdirectorycontrollerstoalowlatencyinterconnectionnetwork.Eachprocessingnode,orcluster,consistsofasmallnumberofhighperformanceprocessorsandaportionofthesharedmemoryinterconnectedbyabus.Multiprocessingwithintheclustercanbeviewedeitherasincreasingthepowerofeachprocessingnodeorasreducingthecostofthedirectoryandnetworkinterfacebyamortizingitoveralargernumberofprocessors.Distributingmemorywiththeprocessorsisessentialbecauseitallowsthesystemtoexploitlocality.Allprivatedataandcodereferences,alongwithsomeofthesharedreferences,canbemadeloo00Figure1.TheDasharchitectureconsistsofasetofclustersconnectedbyageneralinterconnectionnetwork.Directorymemorycontainspointerstotheclusterscurrentlycachingeachmemoryline.ScalabilityoftheDashapproachWehaveoutlinedwhywebelieveasingleaddressspacemachinewithcachecoherenceholdsthemostpromisefordeliveringscalableperformancetoawiderangeofapplications.Here,weaddressthemoredetailedissuesinscalingsuchadirectorybasedsystem.Thethreeprimaryissuesareensuringthatthesystemprovidesscalablememorybandwidth,thatthecostsscalereasonably,andthatmechanismsareprovidedtodealwithlargememorylatencies.Scalabilityinamultiprocessorrequiresthetotalmemorybandwidthtoscalelinearlywiththenumberofprocessors.Dashprovidesscalablebandwidthtodatacaltothecluster.Thesereferencesavoidarchitectureissimilartomanyscalableobjectsresidinginlocalmemorybydisthelongerlatencyofremotereferencesmessagepassingmachines.Whilenottributingthephysicalmemoryamongandreducethebandwidthdemandsonoptimizedtodoso,Dashcouldemulatetheclusters.Fordataaccessesthatmusttheglobalinterconnect.Exceptforthesuchmachineswithreasonableeffibeservicedremotely,Dashusesascaldirectorymemory,theresultingsystemciency.ableinterconnectionnetwork.Support100goAverageinvalidationspersharedwrite0.71Averageinvalidationspersharedwrite0.39looM8079270E26055084053016.E60260vv550cLmc.6.3.1.1.1.1.O.O.1.3.2.1.1.1.o.o.401234567891010012345678910210InvalidationsInvalidationsFigure2.CacheinvalidationpatternsforMP3DaandPThorb.MP3Dusesaparticlebasedsimulationtechniquetodeterminethestructureofshockwavescausedbyobjectsflyingathighspeedintheupperatmosphere.PThorisaparallellogicsimulatorbasedontheChandyMisraalgorithm.66COMPUTERofcoherentcachescouldpotentiallycompromisethescalabilityofthenetworkbyrequiringfrequentbroadcastmessages.Theuseofdirectories,however,removestheneedforsuchbroadcastsandthecoherencetrafficconsistsonlyofpointtopointmessagestoclustersthatarecachingthatlocation.Sincetheseclustersmusthaveoriginallyfetchedthedata,thecoherencetrafficwillbewithinsomesmallconstantfactoroftheoriginaldatatraffic.Infact,sinceeachcachedblockisusuallyreferencedseveraltimesbeforebeinginvalidated,cachingnormallyreducesoverallglobaltrafficsignificantly.Thisdiscussionofscalabilityassumesthattheaccessesareuniformlydistributedacrossthemachine.Unfortunately,theuniformaccessassumptiondoesnotalwaysholdforhighlycontendedsynchronizationobjectsandforheavilyshareddataobjects.Theresultinghotspotsconcentratedaccessestodatafromthememoryofasingleclusteroverashortdurationoftimecansignificantlyreducethememoryandnetworkthroughput.Thereductionoccursbecausethedistributionofresourcesisnotexploitedasitisunderuniformaccesspatterns.Toaddresshotspots,Dashreliesonacombinationofhardwareandsoftwaretechniques.Forexample,Dashprovidesspecialextensionstothedirectorybasedprotocoltohandlesynchronizationreferencessuchasqueuebasedlocksdiscussedfurtherinthesection,Supportforsynchronization.Furthermore,sinceDashallowscachingofsharedwritabledata,itavoidsmanyofthedatahotspotsthatoccurinotherparallelmachinesthatdonotpermitsuchcaching.Forhotspotsthatcannotbemitigatedbycaching,somecanberemovedbythecoherenceprotocolextensionsdiscussedinthesection,Updateanddeliveroperations,whileotherscanonlyberemovedbyrestructuringatthesoftwarelevel.Forexample,whenusingaprimitivesuchasabarrier,itispossibleforsoftwaretoavoidhotspotsbygatheringandreleasingprocessorsthroughatreeofmemorylocations.Regardingsystemcosts,amajorscalabilityconcernuniquetoDashlikemachinesistheamountofdirectorymemoryrequired.Ifthephysicalmemoryinthemachinegrowsproportionallywiththenumberofprocessingnodes,thenusingabitvectortokeeptrackofallclusterscachingamemoryblockdoesnotscalewell.ThetotalamountofdirectorymemoryneededisPxMILmegabits,wherePisthenumberofclusters,Misthemegabitsofmemorypercluster,andI,isthecachelinesizeinbits.Thus,thefractionofmemorydevotedtokeepingdirectoryinformationgrowsasPIL.Dependingonthemachinesize,thisgrowthmayormaynotbetolerable.Forexample,consideramachinethatcontainsupto32clustersofeightprocessorseachandhasacachememorylinesizeof32bytes.Forthismachine,theoverheadfordirectorymemoryisonly12.5percentofphysicalmemoryasthesystemscalesfromeightto256processors.Thisiscomparablewiththeoverheadofsupportinganerrorcorrectingcodeonmemory.Forlargermachines.wheretheoverheadwouldbecomeintolerable,severalalternativesexist.First,wecantakeadvantageofthefactthatatanygiventimeamemoryblockisusuallycachedbyaverysmallnumberofprocessors.Forexample,Figure2showsthenumberofinvalidationsgeneratedbytwoapplicationsrunonasimulated32processormachine.Thesegraphsshowthatmostwritescauseinvalidationstoonlyafewcaches.Wehaveobtainedsimilarresultsforalargenumberofapplications.Consequently,itispossibletoreplacethecompletedirectorybitvectorbyasmallnumberofpointersandtousealimitedbroadcastofinvalidationsintheunusualcasewhenthenumberofpointersistoosmall.Second,wecantakeadvantageofthefactthatmostmainmemoryblockswillnotbepresentinanyprocessorscache,andthusthereisnoneedtoprovideadedicateddirectoryentryforeverymemoryblock.Studieshaveshownthatasmalldirectorycacheperformsalmostaswellasafulldirectory.Thesetwotechniquescanbecombinedtosupportmachineswiththousandsofprocessorswithoutundueoverheadfromdirectorymemory.Theissueofmemoryaccesslatencyalsobecomesmoreprominentasanarchitectureisscaledtoalargernumberofnodes.Therearetwocomplementaryapproachesformanaginglatencymethodsthatreducelatencyandmechanismsthathelptolerateit.Dashusesbothapproaches,thoughourmainfocushasbeentoreducelatencyasmuchaspossible.Althoughlatencytoleratingtechniquesareimportant.theyoftenrequireadditionalapplicationparallelismtobeeffective.HardwarecoherentcachesprovidetheprimarylatencyreductionmechanisminDash.Cachingshareddatasignificantlyreducestheaveragelatencyforremoteaccessesbecauseofthespatialandtemporallocalityofmemoryaccesses.Forreferencesnotsatisfiedbythecache,thecoherenceprotocolattemptstominimizelatency,asshowninthenextsection.Furthermore,aspreviouslymentioned,wecanreducelatencybyallocatingdatatomemoryclosetotheprocessorsthatuseit.Whileaveragememorylatencyisreduced,referencesthatcorrespondtointerprocessorcommunicationcannotavoidtheinherentlatenciesofalargemachine.InDash,thelatencyfortheseaccessesisaddressedbyavarietyoflatencyhidingmechanisms.Thesemechanismsrangefromsupportofarelaxedmemoryconsistencymodeltosupportofnonblockingprefetchoperations.TheseoperationsaredetailedinthesectionsonMemoryconsistencyandPrefetchoperations.Wealsoexpectsoftwaretoplayacriticalroleinachievinggoodperformanceonahighlyparallelmachine.Obviously,applicationsneedtoexhibitgoodparallelismtoexploittherichcomputationalresourcesofalargemachine.Inaddition,applications,compilers,andoperatingsystemsneedtoexploitcacheandmemorylocalitytogetherwithlatencyhidingtechniquestoachievehighprocessorutilization.Applicationsstillbenefitfromthesingleaddressspace,however,becauseonlyperformancecriticalcodeneedstobetunedtothesystem.Othercodecanassumeasimpleuniformmemorymodel.TheDashcachecoherenceprotocolWithintheDashsystemorganization,thereisstillagreatdealoffreedominselectingthespecificcachecoherenceprotocol.ThissectionexplainsthebasiccoherenceprotocolthatDashusesfornormalreadandwriteoperations,thenoutlinestheresultingmemoryconsistencymodelvisibletotheprogrammerandcompiler.Finally,itdetailsextensionstotheprotocolthatsupportlatencyhidingandefficientsynchronization.March199261Memoryhierarchy.Dashimplementsaninvalidationbasedcachecoherenceprotocol.Amemorylocationmaybeinoneofthreestatesuncachednotcachedbyanyclustersharedinanunmodifiedstateinthecachesofoneormoreclustersordirtymodifiedinasinglecacheofsomecluster.Thedirectorykeepsthesummaryinformationforeachmemoryblock,specifyingitsstateandtheclustersthatarecachingit.TheDashmemorysystemcanbelogicallybrokenintofourlevelsofhierarchy,asillustratedinFigure3.Thefirstlevelistheprocessorscache.Thiscacheisdesignedtomatchtheprocessorspeedandsupportsnoopingfromthebus.Arequestthatcannotbeservicedbytheprocessorscacheissenttothesecondlevelinthehierarchy.thelocalcluster.Thislevelincludestheotherprocessorscacheswithintherequestingprocessorscluster.Ifthedataislocallycached,therequestcanbeservicedwithinthecluster.Otherwise,therequestissenttothehomeclusterlevel.Thehomelevelconsistsoftheclusterthatcontainsthedirectoryandphysicalmemoryforagivenmemoryaddress.Formanyaccessesforexample,mostprivatedatareferences.thelocalandhomeclusterarethesame,andthehierarchycollapsestothreelevels.Ingeneral,however,arequestwilltravelthroughtheinterconnectionnetworktothehomecluster.Thehomeclustercanusuallysatisfytherequestimmediately,butifthedirectoryentryisinadirtystate,orinsharedstatewhentherequestingprocessorrequestsexclusiveaccess,thefourthlevelmustalsobeaccessed.Theremoteclusterlevelforamemoryblockconsistsoftheclustersmarkedbythedirectoryasholdingacopyoftheblock.Toillustratethedirectoryprotocol,firstconsiderhowaprocessorreadtraversesthememoryhierarchyProcessorlevelIftherequestedlocationispresentintheprocessorscache,thecachesimplysuppliesthedata.Otherwise,therequestgoestothelocalclusterlevel.LocalclusterlevelIfthedataresideswithinoneoftheothercacheswithinthelocalcluster,thedataissupProcessorlevelProcessorcacheLocalclusterlevelOtherprocessorcacheswithinlocalclusterHomeclusterlevelIIDirectoryandmainmemoryassociatedwithagivenaddressIRemoteclusterlevelProcessorcachesin1remoteclustersFigure3.MemoryhierarchyofDash.pliedbythatcacheandnostatechangeisrequiredatthedirectorylevel.Iftherequestmustbesentbeyondthelocalclusterlevel,itgoesfirsttothehomeclustercorrespondingtothataddress.HomeclusterlevelThehomeclusterexaminesthedirectorystateofthememorylocationwhilesimultaneouslyfetchingtheblockfrommainmemory.Iftheblockisclean,thedataissenttotherequesterandthedirectoryisupdatedtoshowsharingbytherequester.Ifthelocationisdirty,therequestisforwardedtotheremoteclusterindicatedbythedirectory.RemoteclusterlevelThedirtyclusterreplieswithasharedcopyofthedata,whichissentdirectlytotherequester.Inaddition,asharingwritebackmessageissenttothehomeleveltoupdatemainmemoryandchangethedirectorystatetoindicatethattherequestingandremoteclusternowhavesharedcopiesofthedata.Havingthedirtyclusterresponddirectlytotherequester,asopposedtoroutingitthroughthehome.reducesthelatencyseenbytherequestingprocessor.NowconsiderthesequenceofoperationsthatoccurswhenalocationiswrittenProcessorlevelIfthelocationisdirtyinthewritingprocessorscache,thewritecancompleteimmediately.Otherwise,areadexclusiverequestisissuedonthelocalclustersbustoobtainexclusiveownershipofthelineandretrievetheremainingportionofthecacheline.LocalclusterlevelIfoneofthecacheswithintheclusteralreadyownsthecacheline,thenthereadexclusiverequestisservicedatthelocallevelbyacachetocachetransfer.Thisallowsprocessorswithinaclustertoalternatelymodifythesamememoryblockwithoutanyinterclusterinteraction.Ifnolocalcacheownstheblock,thenareadexclusiverequestissenttothehomecluster.HomeclusterlevelThehomeclustercanimmediatelysatisfyanownershiprequestforalocationthatisintheuncachedorsharedstate.Inaddition,ifablockisinthesharedstate,thenallcachedcopiesmustbeinvalidated.Thedirectoryindicatestheclustersthathavetheblockcached.Invalidationrequestsaresenttotheseclusterswhilethehomeconcurrentlysendsanexclusivedatareplytotherequestingcluster.Ifthedirectoryindicatesthattheblockisdirty,thenthereadexclusiverequestmustbeforwardedtothedirtycluster,asinthecaseofaread.RemoteclusterlevelIfthedirectoryhadindicatedthatthememoryblockwasshared,thentheremoteclustersreceiveaninvalidationrequesttoeliminatetheirsharedcopy.Uponreceivingtheinvalidation,theremoteclusterssendanacknowledgmenttotherequestingcluster.Ifthedirectoryhadindicatedadirtystate,thenthedirtyclusterreceivesareadexclusiverequest.Asinthecaseoftheread,theremoteclusterrespondsdirectlytotherequestingclusterandsendsadirtytransfermessagetothehomeindicatingthattherequestingclusternowholdstheblockexclusively.Whenthewritingclusterreceivesalltheinvalidationacknowledgmentsorthereplyfromthehomeordirtycluster,itisguaranteedthatallcopiesoftheolddatahavebeenpurgedfromthesystem.Iftheprocessordelayscompletingthewriteuntilallacknowledgmentsarereceived,thenthenewwritevaluewillbecomeavailabletoallotherprocessorsatthesametime.However,invalidationsinvolveroundtripmessagestomultipleclusters,resultinginpotentiallylongdelays.Higherprocessorutilizationcanbeobtainedbyallowingthewritetoproceedimmediatelyafterthe68COMPUTERownershipreplyisreceivedfromthehome.Unfortunately,thismayleadtoReleaseconsistencyprovidesa10to40inconsistencieswiththememorymodelassumedbytheprogrammer.ThenextsectiondescribeshowDashrelaxestheconstraintsonmemoryrequestorderpercentincreaseining,whilestillprovidingareasonableperformanceoversequentialconsistency.programmingmodeltotheuser.Memoryconsistency.Thememoryconsistencymodelsupportedbyanarchitecturedirectlyaffectstheamountofbufferingandpipeliningthatcantakeplaceamongmemoryrequests.Inaddition,ithasadirecteffectonthecomplexityoftheprogrammingmodelpresentedtotheuser.ThegoalinDashistoprovidesubstantialfreedomintheorderingamongmemoryrequests,whilestillprovidingareasonableprogrammingmodeltotheuser.Atoneendoftheconsistencyspectrumisthesequentialconsistencymode1,whichrequiresexecutionoftheparallelprogramtoappearasaninterleavingoftheexecutionoftheparallelprocessesonasequentialmachine.Sequentialconsistencycanbeguaranteedbyrequiringaprocessortocompleteonememoryrequestbeforeitissuesthenextrequest.4Sequentialconsistency,whileconceptuallyappealing,imposesalargeperformancepenaltyonmemoryaccesses.Formanyapplications,suchamodelistoostrict,andonecanmakedowithaweakernotionofconsistency.Asanexample,considerthecaseofaprocessorupdatingadatastructurewithinacriticalsection.Ifupdatingthestructurerequiresseveralwrites,eachwriteinasequentiallyconsistentsystemwillstalltheprocessoruntilallothercachedcopiesofthatlocationhavebeeninvalidated.Butthesestallsareunnecessaryastheprogrammerhasalreadymadesurethatnootherprocesscanrelyontheconsistencyofthatdatastructureuntilthecriticalsectionisexited.Ifthesynchronizationpointscanbeidentified,thenthememoryneedonlybeconsistentatthosepoints.Inparticular,Dashsupportstheuseofthereleaseconsistencymodel,whichonlyrequirestheoperationstohavecompletedbeforeacriticalsectionisreleasedthatis,alockisunlocked.Suchaschemehastwoadvantages.First,itprovidestheuserwithareasonableprogrammingmodel,sincetheprogrammerisassuredthatwhenthecriticalsectionisexited,allotherprocessorswillhaveaconsistentviewofthemodifieddatastructure.Second,itpermitsreadstobypasswritesandtheinvalidationsofdifferentwriteoperationstooverlap,resultinginlowerlatenciesforaccessesandhigheroverallperformance.Detailedsimulationstudiesforprocessorswithblockingreadshaveshownthatreleaseconsistencyprovidesa10to40percentincreaseinperformanceoversequentialconsistency.Thedisadvantageofthemodelisthattheprogrammerorcompilermustidentifyallsynchronizationaccesses.TheDashprototypesupportsthereleaseconsistencymodelinhardware.Sinceweusecommercialmicroprocessors,theprocessorstallsonreadoperationsuntilthereaddataisreturnedfromthecacheorlowerlevelsofthememoryhierarchy.Writeoperations,however,arenonblocking.Thereisawritebufferbetweenthefirstandsecondlevelcaches.Thewritebufferqueuesupthewriterequestsandissuestheminorder.Furthermore,theservicingofwriterequestsisoverlapped.Assoonasthecachereceivestheownershipanddatafortherequestedcacheline,thewritedataisremovedfromthewritebufferandwrittenintothecacheline.Thenextwriterequestcanbeservicedwhiletheinvalidationacknowledgmentsforthepreviouswriteoperationsfilterin.Thus,parallelismexistsattwolevelstheprocessorexecutesotherinstructionsandaccessesitsfirstlevelcachewhilewriteoperationsarepending,andinvalidationsofmultiplewriteoperationsareoverlapped.TheDashprototypealsoprovidesfenceoperationsthatstalltheprocessororwritebufferuntilpreviousoperationscomplete.Thesefenceoperationsallowsoftwaretoemulatemorestringentconsistencymodels.Memoryaccessoptimizations.Theuseofreleaseconsistencyhelpshidethelatencyofwriteoperations.However,sincetheprocessorstallsonreadoperations,itseestheentiredurationofallreadaccesses.Forapplicationsthatexhibitpoorcachebehaviororextensiveread/writesharing,thiscanleadtosignificantdelayswhiletheprocessorwaitsforremotecachemissestobefilled.TohelpwiththeseproblemsDashprovidesavarietyofprefetchandpipeliningoperations.Prefetchoperations.Aprefetchoperationisanexplicitnonblockingrequesttofetchdatabeforetheactualmemoryoperationisissued.Hopefully,bythetimetheprocessneedsthedata,itsvaluehasbeenbroughtclosertotheprocessor,hidingthelatencyoftheregularblockingread.Inaddition,nonblockingprefetchallowsthepipeliningofreadmisseswhenmultiplecacheblocksareprefetched.Asasimpleexampleofitsuse,aprocesswantingtoaccessarowofamatrixstoredinanotherclustersmemorycandosoefficientlybyfirstissuingprefetchreadsforallcacheblockscorrespondingtothatrow.Dashsprefetchoperationsarenonbindingandsoftwarecontrolled.Theprocessorissuesexplicitprefetchoperationsthatbringasharedorexclusivecopyofthememoryblockintotheprocessorscache.Notbindingthevalueatthetimeoftheprefetchisimportantinthatissuingtheprefetchdoesnotaffecttheconsistencymodelorforcethecompilertodoaconservativestaticdependencyanalysis.Thecoherenceprotocolkeepstheprefetchedcachelinecoherent.Ifanotherprocessorhappenstowritetothelocationbeforetheprefetchingprocessoraccessesthedata,thedatawillsimplybeinvalidated.Theprefetchwillberenderedineffective,buttheprogramwillexecutecorrectly.Supportforanexclusiveprefetchoperationaidscaseswheretheblockisfirstreadandthenupdated.Byfirstissuingtheexclusiveprefetch,theprocessoravoidsfirstobtainingasharedcopyandthenhavingtorerequestanexclusivecopyoftheblock.Studieshaveshownthat,forcertainapplications,theadditionofasmallnumberofprefetchinstructionscanincreaseprocessorutilizationbymorethanafactoroftwo.Updateanddeliveroperations.Insomeapplications,itmaynotbepossiblefortheconsumerprocesstoissueaprefetchearlyenoughtoeffectivelyhidethelatencyofmemory.Likewise,ifmultipleMarch199269consumersneedthesameitemofdata,thecommunicationtrafficcanbereducedifdataismulticasttoalltheconsumerssimultaneously.Therefore,Dashprovidesoperationsthatallowtheproducertosenddatadirectlytoconsumers.Therearetwowaysfortheproducingprocessortospecifytheconsumingprocessors.Theupdatewriteoperationsendsthenewdatadirectlytoallprocessorsthathavecachedthedata,whilethedeliveroperationsendsthedatatospecifiedclusters.Theupdatewriteprimitiveupdatesthevalueofallexistingcopiesofadataword.Usingthisprimitive,aprocessordoesnotneedtofirstacquireanexclusivecopyofthecacheline,whichwouldresultininvalidatingallothercopies.Rather,dataisdirectlywrittenintothehomememoryandallothercachesholdingacopyoftheline.Thesesemanticsareparticularlyusefulforeventsynchronization,suchasthereleaseeventforabarrier.Thedeliverinstructionexplicitlyspecifiesthedestinationclustersofthetransfer.Tousethisprimitive,theproducerfirstwritesintoitscacheusingnormal,invalidatingwriteoperations.Theproducerthenissuesadeliverinstruction,givingthedestinationclustersasabitvector.Acopyofthecachelineisthensenttothespecifiedclusters,andthedirectoryisupdatedtoindicatethatthevariousclustersnowsharethedata.Thisoperationisusefulincaseswhentheproducermakesmultiplewritestoablockbeforetheconsumerswillwantitorwhentheconsumersareunlikelytobecachingtheitematthetimeofthewrite.Supportforsynchronization.Theaccesspatternstolocationsusedforsynchronizationareoftendifferentfromthoseforothershareddata.Forexample,wheneverahighlycontendedlockisreleased,waitingnodesrushtograbthelock.Inthecaseofbarriers,manyprocessorsmustbesynchronizedandthenreleased.Suchactivityoftencauseshotspotsinthememorysystem.Consequently,synchronizationvariablesoftenwarrantspecialtreatment.Inadditiontoupdatewrites,Dashprovidestwoextensionstothecoherenceprotocolthatdirectlysupportsynchronizationobjects.Thefirstisqueuebasedlocks,andthesecondisfetchandincrementoperations.Mostcachecoherentarchitectureshandlelocksbyprovidinganatomictestsetinstructionandacachedtestandtestsetschemeforspinwaiting.Ideally,thesespinlocksshouldmeetthefollowingcriteriaminimumamountoftrafficgenerlowlatencyreleaseofawaitingprolowlatencyacquisitionofafreelock.atedwhilewaiting,cessor,andCachedtestsetschemesaremoderatelysuccessfulinsatisfyingthesecriteriaforlowcontentionlocks,butfailforhighcontentionlocks.Forexample,assumethereareNprocessorsspinningonalockvalueintheircaches.Whenthelockisreleased,allNcachevaluesareinvalidated,andNreadsaregeneratedtothememorysystem.Dependingonthetiming,itispossiblethatallNprocessorscomebacktodothetestsetonthelocationoncetheyrealizethelockisfree,resultinginfurtherinvalidationsandrereads.Suchascenarioproducesunnecessarytrafficandincreasesthelatencyinacquiringandreleasingalock.ThequeuebasedlocksinDashaddressthisproblembyusingthedirectorytoindicatewhichprocessorsarespinningonthelock.Whenthelockisreleased,oneofthewaitingclustersischosenatrandomandisgrantedthelock.Thegrantrequestinvalidatesonlythatclusterscachesandallowsoneprocessorwithinthatclustertoacquirethelockwithalocaloperation.Thisschemelowersboththetrafficandthelatencyinvolvedinreleasingaprocessorwaitingonalock.Informingonlyoneclusterofthereleasealsoeliminatesunnecessarytrafficandlatencythatwouldbeincurredifallwaitingprocessorswereallowedtocontend.Atimeoutmechanismonthelockgrantallowsthegranttobesenttoanotherclusterifthespinningprocesshasbeenswappedoutormigrated.ThequeuedonlockbitprimitivedescribedinGoodmanetal.issimilartoDashsqueuebasedlocks,butusespointersintheprocessorcachestomaintainthelistofthewaitingprocessors.Thefetchandincrementandfetchunddecrementprimitivesprovideatomicincrementanddecrementoperationsonuncachedmemorylocations.Thevaluereturnedbytheoperationsisthevaluebeforetheincrementordecrement.Theseoperationshavelowserializationandareusefulforimplementingseveralsynchronizationprimitivessuchasbarriers,distributedloops,andworkqueues.Theserializationoftheseoperationsissmallbecausetheyaredonedirectlyatthememorysite.Thelowserializationprovidedbythefetchandincrementoperationisespeciallyimportantwhenmanyprocessorswanttoincrementalocation,ashappenswhengettingthenextindexinadistributedloop.Thebenefitsoftheproposedoperationsbecomeapparentwhencontrastedwiththealternativeofusinganormalvariableprotectedbyalocktoachievetheatomicincrementanddecrement.Thealternativeresultsinsignificantlymoretraffic,longerlatency,andincreasedserialization.TheDashimplementationAhardwareprototypeoftheDasharchitectureiscurrentlyunderconstruction.Whilewehavedevelopedadetailedsoftwaresimulatorofthesystem,wefeelthatahardwareimplementationisneededtofullyunderstandtheissuesinthedesignofscalablecachecoherentmachines,toverifythefeasibilityofsuchdesigns,andtoprovideaplatformforstudyingrealapplicationsandsoftwarerunningonalargeensembleofprocessors.Tofocusoureffortonthenovelaspectsofthedesignandtospeedthecompletionofausablesystem,thebaseclusterhardwareusedintheprototypeisacommerciallyavailablebusbasedmultiprocessor.Whiletherearesomeconstraintsimposedbythegivenhardware,theprototypesatisfiesourprimarygoalsofscalablememorybandwidthandhighperformance.TheprototypeincludesmostofDashsarchitecturalfeaturessincemanyofthemcanonlybefullyevaluatedontheactualhardware.Thesystemalsoincludesdedicatedperformancemonitoringlogictoaidintheevaluation.Dashprototypecluster.TheprototypesystemusesaSiliconGraphicsPowerStation4D1340asthebasecluster.The4D1340systemconsistsoffourMipsR3000processorsandR3010floatingpointcoprocessorsrunningat33megahertz.EachR30001R3010combinationcanreachexecutionratesupto25VAXMIPSand10Mflops.EachCOMPUTER1rLfJLIIhinterfacememoryReplymeshIProcessorFirstlevelIandDcacheISecondlevelcachc1ITIinterfacememoryII\IInstructionDDataFigure4.Blockdiagramofa2x2Dashsystem.CPUcontainsa64kilobyteinstructioncacheanda64Kbytewritethroughdatacache.The64Kbytedatacacheinterfacestoa256Kbytesecondlevelwritebackcache.Theinterfaceconsistsofareadbufferandafourworddeepwritebuffer.Boththefirstandsecondlevelcachesaredirectmappedandsupport16bytelines.Thefirstlevelcachesrunsynchronouslytotheirassociated33MHzprocessorswhilethesecondlevelcachesrunsynchronoustothe16MHzmemorybus.Thesecondlevelprocessorcachesareresponsibleforbussnoopingandmaintainingcoherenceamongthecachesinthecluster.CoherenceismaintainedusinganIllinois,orMESImodified,exclusive,shared,invalid,protocol.ThemainadvantageofusingtheIllinoisprotocolinDashisthecachetocachetransfersspecifiedinit.Whiletheydolittletoreducethelatencyformissesscrvicedbylocalmemory.localcachetocachetransferscangreatlyreducethepenaltyforremotememorymisses.Thesetofprocessorcachesactsasaclustercacheforremotememory.ThememorybusMPbusofthe4D1340isasynchronousbusandconsistsofseparate32bitaddressand64bitdatabuses.TheMPbusispipelinedandsupportsmemorytocacheandcachetocachetransfersof16byteseveryfourbusclockswithalatencyofsixbusclocks.Thisresultsinamaximumbandwidthof64Mbytespersecond.WhiletheMPbusispipelined,itisnotasplittransactionbus.Tousethe4D1340inDash,wehavehadtomakeminormodificationstotheexistingsystemboardsanddesignapairofnewboardstosupportthedirectorymemoryandinterclusterinterface.Themainmodificationtotheexistingboardsistoaddabusretrysignalthatisusedwhenarequestrequiresservicefromaremotecluster.Thecentralbusarbiterhasalsobeenmodifiedtoacceptamaskfromthedirectory.Themaskholdsoffaprocessorsretryuntiltheremoterequesthasbeenserviced.Thiseffectivelycreatesasplittransactionbusprotocolforrequestsrequiringremoteservice.Thenewdirectorycontrollerboardscontainthedirectorymemory,theinterclustercoherencestatemachinesandbuffers,andalocalsectionoftheglobalinterconnectionnetwork.Theinterconnectionnetworkconsistsofapairofwormholeroutedmeshes,eachwith16bitwidechannels.Onemeshisdedicatedtotherequestmessageswhiletheotherhandlesreplies.Figure4showsablockdiagramoffourclustersconnectedtoforma2x2Dashsystem.SuchasystemcouldscaletosupporthundredsMarch199271
编号:201401051948386821    大小:2.35MB    格式:PDF    上传时间:2014-01-05
  【编辑】
5
关 键 词:
工业、机械、能源、设计、建模、模具、工学
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:18次
baixue100上传于2014-01-05

官方联系方式

客服手机:13961746681   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

工业、机械、能源、设计、建模、模具、工学  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5