49-The Stanford DASH Multiprocessor.pdf

上传人：*** IP属地：浙江上传时间：2014-01-05 格式：PDF 页数：17 大小：2.35MB 积分：6 举报 版权申诉

49-The Stanford DASH Multiprocessor.pdf_第2页

49-The Stanford DASH Multiprocessor.pdf_第3页

49-The Stanford DASH Multiprocessor.pdf_第4页

49-The Stanford DASH Multiprocessor.pdf_第5页

已阅读5页，还剩12页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

TheStanfordDashMultiprocessorDanielLenoski,JamesLaudon,KouroshGharachorloo,Wolf-DietrichWeber,AnoopGupta,JohnHennessy,MarkHorowitz,andMonicaS.LamStanfordUniversityDirectory-basedcachecoherencegivesDashtheease-of-useofshared-memoryarchitectureswhilemaintainingthescalabilityofmessage-passingmachines.heComputerSystemsLaboratoryatStanfordUniversityisdevelopingashared-memorymultiprocessorcalledDash(anabbreviationforDirec-toryArchitectureforSharedMemory).Thefundamentalpremisebehindthearchitectureisthatitispossibletobuildascalablehigh-performancemachinewithasingleaddressspaceandcoherentcaches.TheDasharchitectureisscalableinthatitachieveslinearornear-linearperformancegrowthasthenumberofprocessorsincreasesfromafewtoafewthousand.Thisperformanceresultsfromdistributingthememoryamongprocess-ingnodesandusinganetworkwithscalablebandwidthtoconnectthenodes.Thearchitectureallowsshareddatatobecached,therebysignificantlyreducingthelatencyofmemoryaccessesandyieldinghigherprocessorutilizationandhigheroverallperformance.Adistributeddirectory-basedprotocolprovidescacheco-herencewithoutcompromisingscalability.TheDashprototypesystemisthefirstoperationalmachinetoincludeascalablecache-coherencemechanism.Theprototypeincorporatesupto64high-perfor-manceRISCmicroprocessorstoyieldperformanceupto1.6billioninstructionspersecondand600millionscalarfloatingpointoperationspersecond.Thedesignoftheprototypehasprovideddeeperinsightintothearchitecturalandimplemen-tationchallengesthatariseinalarge-scalemachinewithasingleaddressspace.Theprototypewillalsoserveasaplatformforstudyingrealapplicationsandsoftwareonalargeparallelsystem.ThisarticlebeginsbydescribingtheoverallgoalsforDash,themajorfeaturesofthearchitecture,andthemethodsforachievingscalability.Next,wedescribethedirectory-basedcoherenceprotocolindetail.Wethenprovideanoverviewoftheprototypemachineandthecorrespondingsoftwaresupport,followedbysomeMarch1992preliminaryperformancenumbers.ThearticleconcludeswithadiscussionofrelatedworkandthecurrentstatusoftheDashhardwareandsoftware.TheDashteamManygraduatestudentsandfac-ultvmemberscontributedtotheDashproject.ThePhDstudentsareDanielLenoskiandJamesLau-DashprojectOverviewTheoverallgoaloftheDashprojectistoinvestigatehighlyparallelarchitec-tures.Forthesearchitecturestoachievewidespreaduse,theymustrunavarietyofapplicationsefficientlywithoutim-posingexcessiveprogrammingdifficul-ty.Toachievebothhighperformanceandwideapplicability,webelieveapar-allelarchitecturemustprovidescalabil-itytosupporthundredstothousandsofprocessors.high-performanceindivid-ualprocessors,andasinglesharedad-dressspace.Thegapbetweenthecomputingpow-erofmicroprocessorsandthatofthelargestsupercomputersisshrinking.whilethepriceiperformanceadvantageofmicroprocessorsisincreasing.Thisclearlypointstousingmicroprocessorsasthecomputeenginesinamultipro-cessor.Thechallengeliesinbuildingamachinethatcanscaleupitsperfor-mancewhilemaintainingtheinitialprice/performanceadvantageoftheindividu-alprocessors.Scalabilityallowsaparal-lelarchitecturetoleveragecommoditymicroprocessorsandsmall-scalemulti-processorstobuildlargerscalemachines.Theselargermachinesoffersubstan-tiallyhigherperformance,whichpro-videstheimpetusforprogrammerstoporttheirsequentialapplicationstopar-allelarchitecturesinsteadofwaitingforthenexthigherperformanceuniproces-sor.High-performanceprocessorsareimportanttoachievebothhightotalsystemperformanceandgeneralappli-cability.Usingthefastestmicroproces-sorsreducestheimpactoflimitedorunevenparallelisminherentinsomeapplications.Italsoallowsawidersetofapplicationstoexhibitacceptableper-formancewithlesseffortfromthepro-grammer.Asingleaddressspaceenhancestheprogrammabilityofaparallelmachinebyreducingtheproblemsofdataparti-tioninganddynamicloaddistribution,twoofthetoughestproblemsinpro-grammingparallelmachines.Thesharedaddressspacealsoimprovessupportforautomaticallyparallelizingcompilers,standardoperatingsystems,multipro-don(Dasharchitectureandhard-waredesign)；KouroshGharachor-loo(Dasharchitectureandconsistencymodels)；Wolf-DietrichWeber(Dashsimulatorandscal-abledirectories)；TrumanJoe(Dashhardwareandprotocolverifi-cationtools)；LuisStevens(operat-ingsystem)；HelenDavisandSte-phenGoldschmidt(tracegenerationtools,synchronizationpatterns,localitystudies)；ToddMowry(evaluationofprefetchoper-ations)；AaronGoldbergandMarg-aretMartonosi(performancede-buggingtools)；TomChanak(meshroutingchipdesign)；RichardSimo-ni(syntheticloadgeneratoranddi-rectorystudies)；JosepTorrellas(sharingpatternsinapplications)；EdwardRothberg,JaswinderPalSingh,andLarrySoule(applica-tionsandalgorithmdevelopment).StaffresearchengineerDavidNa-kahiracontributedtothehardwaredesign.ThefacultyassociatedwiththeprojectareAnoopGupta,JohnHennessy,MarkHorowitz,andMonicaLam.gramming,andincrementaltuningofparallelapplications-featuresthatmakeasingle-address-spacemachinemucheasiertousethanamessage-pass-ingmachine.Cachingofmemory,includingsharedwritabledata,allowsmultiprocessorswithasingleaddressspacetoachievehighperformancethroughreducedmemorylatency.Unfortunately,cach-ingshareddataintroducestheproblemofcachecoherence(seethesidebarandaccompanyingfigure).Whilehardwaresupportforcachecoherencehasitscosts,italsooffersmanybenefits.Withouthardwaresup-port,theresponsibilityforcoherencefallstotheuserorthecompiler.Expos-ingtheissueofcoherencetotheuserwouldleadtoacomplexprogrammingmodel,whereusersmightwellavoidcachingtoeasetheprogrammingbur-den.Handlingthecoherenceprobleminthecompilerisattractive.butcur-rentlycannotbedoneinawaythatiscompetitivewithhardware.Withhard-ware-supportedcachecoherence,thecompilercanaggressivelyoptimizepro-gramstoreducelatencywithouthavingtorelypurelyonaconservativestaticdependenceanalysis.Themajorproblemwithexistingcache-coherentshared-addressma-chinesisthattheyhavenotdemonstrat-edtheabilitytoscaleeffectivelybe-yondafewhigh-performanceprocessors.Todate,onlymessage-passingmachineshaveshownthisability.Webelievethatusingadirectory-basedcoherencemech-anismwillpermitsingle-address-spacemachinestoscaleaswellasmessage-passingmachines,whileprovidingamoreflexibleandgeneralprogrammingmodel.DashsystemorganizationMostexistingmultiprocessorswithcachecoherencerelyonsnoopingtomaintaincoherence.Unfortunately,snoopingschemesdistributetheinfor-mationaboutwhichprocessorsarecach-ingwhichdataitemsamongthecaches.Thus,straightforwardsnoopingschemesrequirethatallcachesseeeverymemo-ryrequestfromeveryprocessor.Thisinherentlylimitsthescalabilityofthesemachinesbecausethecommonbusandtheindividualprocessorcacheseventu-allysaturate.Withtodayshigh-perfor-manceRISCprocessorsthissaturationcanoccurwithjustafewprocessors.Directorystructuresavoidthescal-abilityproblemsofsnoopyschemesbyremovingtheneedtobroadcasteverymemoryrequesttoallprocessorcaches.Thedirectorymaintainspointerstotheprocessorcachesholdingacopyofeachmemoryblock.Onlythecacheswithcopiescanbeaffectedbyanaccesstothememoryblock,andonlythosecach-esneedbenotifiedoftheaccess.Thus,theprocessorcachesandinterconnectwillnotsaturateduetocoherencere-quests.Furthermore.directory-basedco-herenceisnotdependentonanyspecif-icinterconnectionnetworklikethebususedbymostsnoopingschemes.Thesamescalable,low-latencynetworkssuchasOmeganetworksork-naryn-cubesusedbynon-cache-coherentand64COMPUTERCachecoherenceCache-coherenceproblemscanariseinshared-memorymultiprocessorswhenmorethanoneprocessorcacheholdsacopyofadataitem(a).Uponawrite,thesecopiesmustbeupdatedorinvalidated(b).Mostsystemsuseinvalidationsincethisallowsthewritingprocessortogainexclusiveac-cesstothecachelineandcompletefurtherwritesintothecachelinewithoutgeneratingexternaltraffic(c).Thisfufthercomplicatescoherencesincethisdirtycachemustrespondinsteadofmemoryonsubsequentaccessesbyotherproces-sors(d).Small-scalemultiprocessorsfrequentlyuseasnoopycache-coherenceprotocol,whichreliesonallcachesmoni-toringthecommonbusthatconnectstheprocessorstomemory.Thismonitoringallowscachestoindependentlyde-terminewhentoinvalidatecachelines(b),andwhentoin-tervenebecausetheycontainthemostup-to-datecopyofagivenlocation(d).Snoopyschemesdonotscaletoalargenumberofprocessorsbecausethecommonbusorindividualprocessorcacheseventuallysaturate,sincetheymustpro-cesseverymemoryrequestfromeveryprocessor.onmemoryrequestsbykeepingtrackofwhichcachesholdeachmemoryblock.Asimpledirectorystructurefirstpro-posedbyCensierandFeautrier*hasonedirectoryentryperblockofmemory(e).Eachentrycontainsonepresencebitperprocessorcache.Inaddition,astatebitindicateswheth-ertheblockisuncached,sharedinmultiplecaches,orheldexclusivelybyonecache(thatIS,whethertheblockisdirty).Usingthestateandpresencebits,thememorycantellwhichcachesneedtobeinvalidatedwhenalocationiswritten(b).Likewise,thedirectoryindicateswhethermemoryscopyoftheblockisuptodateorwhichcacheholdsthemostrecentcopy(d).Ifthememoryanddirectoryarepartitionedintoin-dependentunitsandconnectedtotheprocessorsbyascal-ableinterconnect,thememorysystemcanprovidescalablememorybandwidth.ThedirectoryrelievestheprocessorcachesfromsnoopingReferences1.J.ArchibaldandJ.-L.Baer,CacheCoherenceProtocols:Evalu-ationUsingaMultiprocessorSimulationModel,ACMTrans.ComputerSystems,Vol.4,No.4,Nov.1986,pp.273-298.2.L.CensierandP.Feautrier,ANewSolutiontoCoherenceProblemsinMulticacheSystems,/ETrans.Computers,Vol.C-27,No.12,Dec.1978,pp.1,112-1,118.Store#3,ACacheCache(4LoadAer(d)DataStatebitPresencebits(e)message-passingmachinescanbeem-ployed.Theconceptofdirectory-basedcachecoherenceisnotnew.Itwasfirstpro-posedinthelate1970s.However,theoriginaldirectorystructureswerenotscalablebecausetheyusedacentral-izeddirectorythatquicklybecameabottleneck.TheDasharchitectureover-comesthislimitationbypartitioninganddistributingthedirectoryandmainmemory,andbyusinganewcoherenceprotocolthatcansuitablyexploitdis-tributeddirectories.Inaddition,DashprovidesseveralothermechanismstoMarch199265reduceandhidethelatencyofmemoryoperations.Figure1showsDashshigh-levelorganization.Thearchitectureconsistsofanumberofprocessingnodesconnectedthroughdirecto-rycontrollerstoalow-laten-cyinterconnectionnetwork.Eachprocessingnode,orcluster,consistsofasmallnumberofhigh-performanceprocessorsandaportionofthesharedmemoryintercon-nectedbyabus.Multipro-cessingwithintheclustercanbeviewedeitherasincreas-ingthepowerofeachpro-cessingnodeorasreducingthecostofthedirectoryandnetworkinterfacebyamor-tizingitoveralargernum-berofprocessors.Distributingmemorywiththeprocessorsisessentialbe-causeitallowsthesystemtoexploitlocality.Allprivatedataandcodereferences,alongwithsomeofthesharedreferences,canbemadelo-o00Figure1.TheDasharchitectureconsistsofasetofclus-tersconnectedbyageneralinterconnectionnetwork.Di-rectorymemorycontainspointerstotheclusterscurrentlycachingeachmemoryline.ScalabilityoftheDashapproachWehaveoutlinedwhywebelieveasingle-address-spacemachinewithcachecoherenceholdsthemostpromisefordeliveringscal-ableperformancetoawiderangeofapplications.Here,weaddressthemorede-tailedissuesinscalingsuchadirectory-basedsystem.Thethreeprimaryissuesareensuringthatthesystempro-videsscalablememorybandwidth,thatthecostsscalereasonably,andthatmechanismsareprovidedtodealwithlargememoryla-tencies.Scalabilityinamultipro-cessorrequiresthetotalmemorybandwidthtoscalelinearlywiththenumberofprocessors.Dashprovidesscalablebandwidthtodatacaltothecluster.Thesereferencesavoidarchitectureissimilartomanyscalableobjectsresidinginlocalmemorybydis-thelongerlatencyofremotereferencesmessage-passingmachines.Whilenottributingthephysicalmemoryamongandreducethebandwidthdemandsonoptimizedtodoso,Dashcouldemulatetheclusters.Fordataaccessesthatmusttheglobalinterconnect.Exceptforthesuchmachineswithreasonableeffi-beservicedremotely,Dashusesascal-directorymemory,theresultingsystemciency.ableinterconnectionnetwork.Support100goAverageinvalidationspersharedwrite:0.71Averageinvalidationspersharedwrite:0.39looM80-79270-E260-550-840-530-16.-E60260-v)v)550-cLmc..1.1.O.O..1.1.o.o.401234567891010012345678910210InvalidationsInvalidationsFigure2.CacheinvalidationpatternsforMP3D(a)andPThor(b).MP3Dusesaparticle-basedsimulationtechniquetodeterminethestructureofshockwavescausedbyobjectsflyingathighspeedintheupperatmosphere.PThorisaparal-lellogicsimulatorbasedontheChandy-Misraalgorithm.66COMPUTERofcoherentcachescouldpotentiallycompromisethescalabilityofthenet-workbyrequiringfrequentbroadcastmessages.Theuseofdirectories,how-ever,removestheneedforsuchbroad-castsandthecoherencetrafficconsistsonlyofpoint-to-pointmessagestoclus-tersthatarecachingthatlocation.Sincetheseclustersmusthaveoriginallyfetchedthedata,thecoherencetrafficwillbewithinsomesmallconstantfac-toroftheoriginaldatatraffic.Infact,sinceeachcachedblockisusuallyref-erencedseveraltimesbeforebeingin-validated,cachingnormallyreducesoverallglobaltrafficsignificantly.Thisdiscussionofscalabilityassumesthattheaccessesareuniformlydistrib-utedacrossthemachine.Unfortunate-ly,theuniformaccessassumptiondoesnotalwaysholdforhighlycontendedsynchronizationobjectsandforheavilyshareddataobjects.Theresultinghotspots-concentratedaccessestodatafromthememoryofasingleclusteroverashortdurationoftime-cansignificantlyreducethememoryandnetworkthroughput.Thereductionoc-cursbecausethedistributionofresourc-esisnotexploitedasitisunderuniformaccesspatterns.Toaddresshotspots,Dashreliesonacombinationofhardwareandsoft-waretechniques.Forexample,Dashprovidesspecialextensionstothedi-rectory-basedprotocoltohandlesyn-chronizationreferencessuchasqueue-basedlocks(discussedfurtherinthesection,“Supportforsynchronization”).Furthermore,sinceDashallowscach-ingofsharedwritabledata,itavoidsmanyofthedatahotspotsthatoccurinotherparallelmachinesthatdonotper-mitsuchcaching.Forhotspotsthatcannotbemitigatedbycaching,somecanberemovedbythecoherencepro-tocolextensionsdiscussedinthesec-tion,“Updateanddeliveroperations,”whileotherscanonlyberemovedbyrestructuringatthesoftwarelevel.Forexample,whenusingaprimitivesuchasabarrier,itispossibleforsoftwaretoavoidhotspotsbygatheringandreleas-ingprocessorsthroughatreeofmemo-rylocations.Regardingsystemcosts,amajorscal-abilityconcernuniquetoDash-likema-chinesistheamountofdirectorymem-oryrequired.Ifthephysicalmemoryinthemachinegrowsproportionallywiththenumberofprocessingnodes,thenusingabit-vectortokeeptrackofallclusterscachingamemoryblockdoesnotscalewell.Thetotalamountofdi-rectorymemoryneededisPxMILmegabits,wherePisthenumberofclusters,Misthemegabitsofmemorypercluster,andI,isthecache-linesizeinbits.Thus,thefractionofmemorydevotedtokeepingdirectoryinforma-tiongrowsasPIL.Dependingonthemachinesize,thisgrowthmayormaynotbetolerable.Forexample,consideramachinethatcontainsupto32clus-tersofeightprocessorseachandhasacache(memory)linesizeof32bytes.Forthismachine,theoverheadfordi-rectorymemoryisonly12.5percentofphysicalmemoryasthesystemscalesfromeightto256processors.Thisiscomparablewiththeoverheadofsup-portinganerror-correctingcodeonmemory.Forlargermachines.wheretheover-headwouldbecomeintolerable,sever-alalternativesexist.First,wecantakeadvantageofthefactthatatanygiventimeamemoryblockisusuallycachedbyaverysmallnumberofprocessors.Forexample,Figure2showsthenum-berofinvalidationsgeneratedbytwoapplicationsrunonasimulated32-pro-cessormachine.Thesegraphsshowthatmostwritescauseinvalidationstoonlyafewcaches.(Wehaveobtainedsimilarresultsforalargenumberofapplica-tions.)Consequently,itispossibletoreplacethecompletedirectorybit-vec-torbyasmallnumberofpointersandtousealimitedbroadcastofinvalidationsintheunusualcasewhenthenumberofpointersistoosmall.Second,wecantakeadvantageofthefactthatmostmainmemoryblockswillnotbepresentinanyprocessorscache,andthusthereisnoneedtoprovideadedicateddirec-toryentryforeverymemoryblock.Stud-ieshaveshownthatasmalldirectorycacheperformsalmostaswellasafulldirectory.Thesetwotechniquescanbecombinedtosupportmachineswiththousandsofprocessorswithoutundueoverheadfromdirectorymemory.Theissueofmemoryaccesslatencyalsobecomesmoreprominentasanarchitectureisscaledtoalargernum-berofnodes.Therearetwocomple-mentaryapproachesformanagingla-tency:methodsthatreducelatencyandmechanismsthathelptolerateit.Dashusesbothapproaches,thoughourmainfocushasbeentoreducelatencyasmuchaspossible.Althoughlatencytoleratingtechniquesareimportant.theyoftenrequireadditionalapplicationparallel-ismtobeeffective.Hardware-coherentcachesprovidetheprimarylatencyreductionmechanisminDash.Cachingshareddatasignifi-cantlyreducestheaveragelatencyforremoteaccessesbecauseofthespatialandtemporallocalityofmemoryac-cesses.Forreferencesnotsatisfiedbythecache,thecoherenceprotocolat-temptstominimizelatency,asshowninthenextsection.Furthermore,asprevi-ouslymentioned,wecanreducelatencybyallocatingdatatomemoryclosetotheprocessorsthatuseit.Whileaveragememorylatencyisreduced,referencesthatcorrespondtointerprocessorcom-municationcannotavoidtheinherentlatenciesofalargemachine.InDash,thelatencyfortheseaccessesisaddressedbyavarietyoflatencyhidingmecha-nisms.Thesemechanismsrangefromsupportofarelaxedmemoryconsisten-cymodeltosupportofnonblockingprefetchoperations.Theseoperationsaredetailedinthesectionson“Mem-oryconsistency“and“Prefetchopera-tions.”Wealsoexpectsoftwaretoplayacriticalroleinachievinggoodperfor-manceonahighlyparallelmachine.Ob-viously,applicationsneedtoexhibitgoodparallelismtoexploittherichcomputa-tionalresourcesofalargemachine.Inaddition,applications,compilers,andoperatingsystemsneedtoexploitcacheandmemorylocalitytogetherwithla-tencyhidingtechniquestoachievehighprocessorutilization.Applicationsstillbenefitfromthesingleaddressspace,however,becauseonlyperformance-crit-icalcodeneedstobetunedtothesys-tem.Othercodecanassumeasimpleuniformmemorymodel.TheDashcache-coherenceprotocolWithintheDashsystemorganization,thereisstillagreatdealoffreedominselectingthespecificcache-coherenceprotocol.ThissectionexplainsthebasiccoherenceprotocolthatDashusesfornormalreadandwriteoperations,thenoutlinestheresultingmemorycon-sistencymodelvisibletotheprogram-merandcompiler.Finally,itdetailsex-tensionstotheprotocolthatsupportlatencyhidingandefficientsynchroni-zation.March199261Memoryhierarchy.Dashimplementsaninvalidation-basedcache-coherenceprotocol.Amemorylocationmaybeinoneofthreestates:uncached-notcachedbyanyclus-ter；*shared-inanunmodifiedstateinthecachesofoneormoreclusters；ordirty-modifiedinasinglecacheofsomecluster.Thedirectorykeepsthesummaryinfor-mationforeachmemoryblock,specify-ingitsstateandtheclustersthatarecachingit.TheDashmemorysystemcanbelog-icallybrokenintofourlevelsofhierar-chy,asillustratedinFigure3.Thefirstlevelistheprocessorscache.Thiscacheisdesignedtomatchtheprocessorspeedandsupportsnoopingfromthebus.Arequestthatcannotbeservicedbytheprocessorscacheissenttothesecondlevelinthehierarchy.thelocalcluster.Thislevelincludestheotherproces-sorscacheswithintherequestingpro-cessorscluster.Ifthedataislocallycached,therequestcanbeservicedwith-inthecluster.Otherwise,therequestissenttothehomeclusterlevel.Thehomelevelconsistsoftheclusterthatcon-tainsthedirectoryandphysicalmemo-ryforagivenmemoryaddress.Formanyaccesses(forexample,mostprivatedatareferences).thelocalandhomeclusterarethesame,andthehierarchycollaps-estothreelevels.Ingeneral,however,arequestwilltravelthroughtheinter-connectionnetworktothehomeclus-ter.Thehomeclustercanusuallysatisfytherequestimmediately,butifthedi-rectoryentryisinadirtystate,orinsharedstatewhentherequestingpro-cessorrequestsexclusiveaccess,thefourthlevelmustalsobeaccessed.Theremoteclusterlevelforamemoryblockconsistsoftheclustersmarkedbythedirectoryasholdingacopyoftheblock.Toillustratethedirectoryprotocol,firstconsiderhowaprocessorreadtraversesthememoryhierarchy:Processorlevel-Iftherequestedlocationispresentintheprocessorscache,thecachesimplysuppliesthedata.Otherwise,therequestgoestothelocalclusterlevel.*Localclusterlevel-Ifthedataresideswithinoneoftheothercacheswithinthelocalcluster,thedataissup-ProcessorlevelProcessorcacheLocalclusterlevelOtherprocessorcacheswithinlocalclusterHomeclusterlevelIIDirectoryandmainmemoryassociatedwithagivenaddressIRemoteclusterlevelProcessorcachesin1remoteclustersFigure3.MemoryhierarchyofDash.pliedbythatcacheandnostatechangeisrequiredatthedirectorylevel.Iftherequestmustbesentbeyondthelocalclusterlevel,itgoesfirsttothehomeclustercorrespondingtothataddress.Homeclusterlevel-Thehomeclus-terexaminesthedirectorystateofthememorylocationwhilesimultaneouslyfetchingtheblockfrommainmemory.Iftheblockisclean,thedataissenttotherequesterandthedirectoryisup-datedtoshowsharingbytherequester.Ifthelocationisdirty,therequestisforwardedtotheremoteclusterindi-catedbythedirectory.Remoteclusterlevel-Thedirtyclusterreplieswithasharedcopyofthedata,whichissentdirectlytothere-quester.Inaddition,asharingwrite-backmessageissenttothehomeleveltoupdatemainmemoryandchangethedirectorystatetoindicatethatthere-questingandremoteclusternowhavesharedcopiesofthedata.Havingthedirtyclusterresponddirectlytothere-quester,asopposedtoroutingitthroughthehome.reducesthelatencyseenbytherequestingprocessor.Nowconsiderthesequenceofopera-tionsthatoccurswhenalocationiswrit-ten:Processorlevel-Ifthelocationisdirtyinthewritingprocessorscache,thewritecancompleteimmediately.Otherwise,aread-exclusiverequestisissuedonthelocalclustersbustoob-tainexclusiveownershipofthelineandretrievetheremainingportionofthecacheline.Localclusterlevel-Ifoneofthecacheswithintheclusteralreadyownsthecacheline,thentheread-exclusiverequestisservicedatthelocallevelbyacache-to-cachetransfer.Thisallowspro-cessorswithinaclustertoalternatelymodifythesamememoryblockwithoutanyinterclusterinteraction.Ifnolocalcacheownstheblock,thenaread-ex-clusiverequestissenttothehomeclus-ter.Homeclusterlevel-Thehomeclus-tercanimmediatelysatisfyanowner-shiprequestforalocationthatisintheuncachedorsharedstate.Inaddition,ifablockisinthesharedstate,thenallcachedcopiesmustbeinvalidated.Thedirectoryindicatestheclustersthathavetheblockcached.Invalidationrequestsaresenttotheseclusterswhilethehomeconcurrentlysendsanexclusivedatareplytotherequestingcluster.Ifthedirectoryindicatesthattheblockisdirty,thentheread-exclusiverequestmustbeforwardedtothedirtycluster,asinthecaseofaread.Remoteclusterlevel-Ifthedirec-toryhadindicatedthatthememoryblockwasshared,thentheremoteclustersreceiveaninvalidationrequesttoelim-inatetheirsharedcopy.Uponreceivingtheinvalidation,theremoteclusterssendanacknowledgmenttotherequestingcluster.Ifthedirectoryhadindicatedadirtystate,thenthedirtyclusterre-ceivesaread-exclusiverequest.Asinthecaseoftheread,theremoteclusterrespondsdirectlytotherequestingclus-terandsendsadirty-transfermessagetothehomeindicatingthattherequest-ingclusternowholdstheblockexclu-sively.Whenthewritingclusterreceivesalltheinvalidationacknowledgmentsorthereplyfromthehomeordirtycluster,itisguaranteedthatallcopiesoftheolddatahavebeenpurgedfromthesystem.Iftheprocessordelayscompletingthewriteuntilallacknowledgmentsarere-ceived,thenthenewwritevaluewillbecomeavailabletoallotherproces-sorsatthesametime.However,invali-dationsinvolveround-tripmessagestomultipleclusters,resultinginpotential-lylongdelays.Higherprocessorutiliza-tioncanbeobtainedbyallowingthewritetoproceedimmediatelyafterthe68COMPUTERownershipreplyisreceivedfromthehome.Unfortunately,thismayleadtoReleaseconsistencyprovidesa10-to40-inconsistencieswiththememorymodelassumedbytheprogrammer.ThenextsectiondescribeshowDashrelaxestheconstraintsonmemoryrequestorder-percentincreaseining,grammingmodeltotheuser.Memoryconsistency.Thememoryconsistencymodelsupportedbyanar-chitecturedirectlyaffectstheamountofbufferingandpipeliningthatcantakeplaceamongmemoryrequests.Inaddi-tion,ithasadirecteffectonthecom-plexityoftheprogrammingmodelpre-sentedtotheuser.ThegoalinDashistoprovidesubstantialfreedomintheor-deringamongmemoryrequests,whilestillprovidingareasonableprogram-mingmodeltotheuser.Atoneendoftheconsistencyspec-trumisthesequentialconsistencymod-e1,whichrequiresexecutionofthepar-allelprogramtoappearasaninterleavingoftheexecutionoftheparallelprocess-esonasequentialmachine.Sequentialconsistencycanbeguaranteedbyre-quiringaprocessortocompleteonememoryrequestbeforeitissuesthenextrequest.4Sequentialconsistency,whileconceptuallyappealing,imposesalargeperformancepenaltyonmemoryac-cesses.Formanyapplications,suchamodelistoostrict,andonecanmakedowithaweakernotionofconsistency.Asanexample,considerthecaseofaprocessorupdatingadatastructurewith-inacriticalsection.Ifupdatingthestruc-turerequiresseveralwrites,eachwriteinasequentiallyconsistentsystemwillstalltheprocessoruntilallothercachedcopiesofthatlocationhavebeeninval-idated.Butthesestallsareunnecessaryastheprogrammerhasalreadymadesurethatnootherprocesscanrelyontheconsistencyofthatdatastructureuntilthecriticalsectionisexited.Ifthesynchronizationpointscanbeidenti-fied,thenthememoryneedonlybeconsistentatthosepoints.Inparticular,Dashsupportstheuseofthereleaseconsistencymodel,whichonlyrequirestheoperationstohavecompletedbe-foreacriticalsectionisreleased(thatis,alockisunlocked).Suchaschemehastwoadvantages.First,itprovidestheuserwithareason-ableprogrammingmodel,sincethepro-grammerisassuredthatwhenthecriti-calsectionisexited,allotherprocessorswillhaveaconsistentviewofthemod-ifieddatastructure.Second,itpermitsreadstobypasswritesandtheinvalida-tionsofdifferentwriteoperationstooverlap,resultinginlowerlatenciesforaccessesandhigheroverallperformance.Detailedsimulationstudiesforproces-sorswithblockingreadshaveshownthatreleaseconsistencyprovidesa10-to40-percentincreaseinperformanceoversequentialconsistency.Thedis-advantageofthemodelisthatthepro-grammerorcompilermustidentifyallsynchronizationaccesses.TheDashprototypesupportsthere-leaseconsistencymodelinhardware.Sinceweusecommercialmicroproces-sors,theprocessorstallsonreadopera-tionsuntilthereaddataisreturnedfromthecacheorlowerlevelsofthememoryhierarchy.Writeoperations,however,arenonblocking.Thereisawritebufferbetweenthefirst-andsec-ond-levelcaches.Thewritebufferqueuesupthewriterequestsandissuestheminorder.Furthermore,theservic-ingofwriterequestsisoverlapped.Assoonasthecachereceivestheowner-shipanddatafortherequestedcacheline,thewritedataisremovedfromthewritebufferandwrittenintothecacheline.Thenextwriterequestcanbeser-vicedwhiletheinvalidationacknowl-edgmentsforthepreviouswriteopera-tionsfilterin.Thus,parallelismexistsattwolevels:theprocessorexecutesotherinstructionsandaccessesitsfirst-levelcachewhilewriteoperationsarepend-ing,andinvalidationsofmultiplewriteoperationsareoverlapped.TheDashprototypealsoprovidesfenceoperationsthatstalltheprocessororwrite-bufferuntilpreviousopera-tionscomplete.Thesefenceoperationsallowsoftwaretoemulatemorestrin-gentconsistencymodels.Memoryaccessoptimizations.Theuseofreleaseconsistencyhelpshidethelatencyofwriteoperations.However,sincetheprocessorstallsonreadoper-ations,itseestheentiredurationofallreadaccesses.Forapplicationsthatex-hibitpoorcachebehaviororextensiveread/writesharing,thiscanleadtosig-nificantdelayswhiletheprocessorwaitsforremotecachemissestobefilled.TohelpwiththeseproblemsDashprovidesavarietyofprefetchandpipeliningop-erations.Prefetchoperations.Aprefetchoper-ationisanexplicitnonblockingrequesttofetchdatabeforetheactualmemoryoperationisissued.Hopefully,bythetimetheprocessneedsthedata,itsval-uehasbeenbroughtclosertothepro-cessor,hidingthelatencyoftheregularblockingread.Inaddition,nonblockingprefetchallowsthepipeliningofreadmisseswhenmultiplecacheblocksareprefetched.Asasimpleexampleofitsuse,aprocesswantingtoaccessarowofamatrixstoredinanotherclustersmem-orycandosoefficientlybyfirstissuingprefetchreadsforallcacheblockscor-respondingtothatrow.Dashsprefetchoperationsarenon-bindingandsoftwarecontrolled.Theprocessorissuesexplicitprefetchoper-ationsthatbringasharedorexclusivecopyofthememoryblockintothepro-cessorscache.Notbindingthevalueatthetimeoftheprefetchisimportantinthatissuingtheprefetchdoesnotaffecttheconsistencymodelorforcethecom-pilertodoaconservativestaticdepen-dencyanalysis.Thecoherenceprotocolkeepstheprefetchedcachelinecoher-ent.Ifanotherprocessorhappenstowritetothelocationbeforetheprefetch-ingprocessoraccessesthedata,thedatawillsimplybeinvalidated.Theprefetchwillberenderedineffective,butthepro-gramwillexecutecorrectly.Supportforanexclusiveprefetchoperationaidscaseswheretheblockisfirstreadandthenupdated.Byfirstissuingtheexclu-siveprefetch,theprocessoravoidsfirstobtainingasharedcopyandthenhav-ingtorerequestanexclusivecopyoftheblock.Studieshaveshownthat,forcer-tainapplications,theadditionofasmallnumberofprefetchinstructionscanin-creaseprocessorutilizationbymorethanafactoroftwo.Updateanddeliveroperations.Insomeapplications,itmaynotbepossiblefortheconsumerprocesstoissueaprefetchearlyenoughtoeffectivelyhidethela-tencyofmemory.Likewise,ifmultipleMarch199269consumersneedthesameitemofdata,thecommunicationtrafficcanbere-ducedifdataismulticasttoallthecon-sumerssimultaneously.Therefore,Dashprovidesoperationsthatallowthepro-ducertosenddatadirectlytoconsum-ers.Therearetwowaysfortheproduc-ingprocessortospecifytheconsumingprocessors.Theupdate-writeoperationsendsthenewdatadirectlytoallproces-sorsthathavecachedthedata,whilethedeliveroperationsendsthedatatospec-ifiedclusters.Theupdate-writeprimitiveupdatesthevalueofallexistingcopiesofadataword.Usingthisprimitive,aprocessordoesnotneedtofirstacquireanexclu-sivecopyofthecacheline,whichwouldresultininvalidatingallothercopies.Rather,dataisdirectlywrittenintothehomememoryandallothercacheshold-ingacopyoftheline.Thesesemanticsareparticularlyusefulforeventsynchro-nization,suchasthereleaseeventforabarrier.Thedeliverinstructionexplicitlyspec-ifiesthedestinationclustersofthetrans-fer.Tousethisprimitive,theproducerfirstwritesintoitscacheusingnormal,invalidatingwriteoperations.Thepro-ducerthenissuesadeliverinstruction,givingthedestinationclustersasabitvector.Acopyofthecachelineisthensenttothespecifiedclusters,andthedirectoryisupdatedtoindicatethatthevariousclustersnowsharethedata.Thisoperationisusefulincaseswhentheproducermakesmultiplewritestoablockbeforetheconsumerswillwantitorwhentheconsumersareunlikelytobecachingtheitematthetimeofthewrite.Supportforsynchronization.Theac-cesspatternstolocationsusedforsyn-chronizationareoftendifferentfromthoseforothershareddata.Forexam-ple,wheneverahighlycontendedlockisreleased,waitingnodesrushtograbthelock.Inthecaseofbarriers,manypro-cessorsmustbesynchronizedandthenreleased.Suchactivityoftencauseshotspotsinthememorysystem.Consequent-ly,synchronizationvariablesoftenwar-rantspecialtreatment.Inadditiontoupdatewrites,Dashprovidestwoexten-sionstothecoherenceprotocolthatdi-rectlysupportsynchronizationobjects.Thefirstisqueue-basedlocks,andthesecondisfetch-and-incrementopera-tions.Mostcache-coherentarchitectureshandlelocksbyprovidinganatomictest&setinstructionandacachedtest-and-test&setschemeforspinwaiting.Ideally,thesespinlocksshouldmeetthefollowingcriteria:minimumamountoftrafficgener-lowlatencyreleaseofawaitingpro-lowlatencyacquisitionofafreelock.atedwhilewaiting,cessor,andCachedtest&setschemesaremoder-atelysuccessfulinsatisfyingthesecrite-riaforlow-contentionlocks,butfailforhigh-contentionlocks.Forexample,assumethereareNprocessorsspinningonalockvalueintheircaches.Whenthelockisreleased,allNcachevaluesareinvalidated,andNreadsaregener-atedtothememorysystem.Dependingonthetiming,itispossiblethatallNprocessorscomebacktodothetest&setonthelocationoncetheyrealizethelockisfree,resultinginfurtherinvali-dationsandrereads.Suchascenarioproducesunnecessarytrafficandincreas-esthelatencyinacquiringandreleasingalock.Thequeue-basedlocksinDashad-dressthisproblembyusingthedirecto-rytoindicatewhichprocessorsarespin-ningonthelock.Whenthelockisreleased,oneofthewaitingclustersischosenatrandomandisgrantedthelock.Thegrantrequestinvalidatesonlythatclusterscachesandallowsonepro-cessorwithinthatclustertoacquirethelockwithalocaloperation.Thisschemelowersboththetrafficandthelatencyinvolvedinreleasingaprocessorwait-ingonalock.Informingonlyoneclus-terofthereleasealsoeliminatesunnec-essarytrafficandlatencythatwouldbeincurredifallwaitingprocessorswereallowedtocontend.Atime-outmecha-nismonthelockgrantallowsthegranttobesenttoanotherclusterifthespin-ningprocesshasbeenswappedoutormigrated.Thequeued-on-lock-bitprimitivedescribedinGoodmanetal.issimilartoDashsqueue-basedlocks,butusespointersintheprocessorcach-estomaintainthelistofthewaitingprocessors.Thefetch-and-incrementandfetch-und-decrementprimitivesprovideatomicincrementanddecrementoperationsonuncachedmemorylocations.Thevaluereturnedbytheoperationsisthevaluebeforetheincrementordecrement.Theseoperationshavelowserializationandareusefulforimplementingseveralsynchronizationprimitivessuchasbar-riers,distributedloops,andworkqueues.Theserializationoftheseoperationsissmallbecausetheyaredonedirectlyatthememorysite.Thelowserializationprovidedbythefetch-and-incrementoperationisespeciallyimportantwhenmanyprocessorswanttoincrementalocation,ashappenswhengettingthenextindexinadistributedloop.Thebenefitsoftheproposedoperationsbecomeapparentwhencontrastedwiththealternativeofusinganormalvari-abl

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

49-The Stanford DASH Multiprocessor.pdf

文档简介

温馨提示

最新文档

评论

49-The Stanford DASH Multiprocessor.pdf

文档简介

温馨提示

最新文档

评论

相关文档