50-DDM--A Cache-Only Memory Architecture.pdf_第1页
50-DDM--A Cache-Only Memory Architecture.pdf_第2页
50-DDM--A Cache-Only Memory Architecture.pdf_第3页
50-DDM--A Cache-Only Memory Architecture.pdf_第4页
50-DDM--A Cache-Only Memory Architecture.pdf_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DDM-ACache-OnlvMemoryArchitectureErikHagersten,AndersLandin,andSeifHaridiSwedishInstituteofComputerScienceultiprocessorsprovidingasharedmemoryviewtotheprogrammeraretypicallyimplementedassuch-withasharedmemory.WeintroduceManarchitecturewithlargecachestoreducelatencyandnetworkload.Becauseallsystemmemoryresidesinthecaches,aminimumnumberofnetworkaccessesareneeded.Still,itpresentsashared-memoryviewtotheprogrammer.Singlebus.Shared-memorysystemsbasedonasinglebushavesometensofprocessors,eachonewithalocalcache,andtypicallysufferfrombussaturation.Acache-coherenceprotocolineachcachesnoopsthetrafficonthecommonbusandpreventsinconsistenciesincachecontents.ComputersmanufacturedbySequentandEncoreusethiskindofarchitecture.Becauseitprovidesauniformaccesstimetothewholesharedmemory,itiscalledauniformmemoryarchitecture(UMA).ThecontentionforthecommonmemoryandthecommonbuslimitsthescalabilityofUMAs.Distributed.ComputerssuchastheBBNButterflyandtheIBMRP3useanarchitecturewithdistributedsharedmemory,knownasanonuniformmemoryarchitecture(NUMA).Eachprocessornodecontainsaportionofthesharedmemory,soaccesstimestodifferentpartsofthesharedaddressspacecanvary.NUMAsoftenhavenetworksotherthanasinglebus,andthenetworkdelaycanvarytodifferentnodes.TheearlierNUMAsdidnothavecoherentcachesandlefttheproblemofmaintainingcoherencetotheprogrammer.Today,researchersarestrivingtowardcoherentNUMAswithdirectory-basedcache-coherenceproto-col.Bystaticallypartitioningtheworkanddata,programmerscanoptimizeprogramsforNUMAs.ApartitioningthatenablesprocessorstomakemostoftheiraccessestotheirpartofthesharedmemoryachievesabetterscalabilitythanispossibleinUMA.Anewarchitecturehastheprogrammingparadigmofshared-memoryarchitecturesCache-only.Inacache-onlymemoryarchitecture(COMA),thememoryorga-nizationissimilartothatofaNUMAinthateachprocessorholdsaportionofthebutnophysicallysharedmemory.Cachesaddressspace.However,thepartitioningofdataamongthememoriesdoesnothavetobestatic,sincealldistributedmemoriesareorganizedlikelarge(second-level)caches.Thetaskofsuchamemoryistwofold.Besidesbeingalargecachefortheprocessor,itmayalsocontainsomedatafromthesharedaddressspacethattheprocessorneverhasaccessed-inotherwords,itisacacheandavirtualpartofattachedtotheprocessorscontainalltheSystemmemory.440018-9162/92/0900-0044$01.0001992ICOMPUTERthesharedmemory.Wecallthisinter-mediateformofmemoryattractionmem-ory.Acoherenceprotocolattractsthedatausedbyaprocessortoitsattractionmemory.Comparabletoacacheline,thecoherenceunitmovedaroundbytheprotocoliscalledanitem.Onamemoryreference,avirtualaddressistranslatedintoanitemidentifier.Theitemidentifierspaceislogicallythesameasthephysicaladdressspaceoftypicalmachines,butthereisnopermanentmappingbetweenanitemidentifierandaphysicalmemorylocation.Instead,anitemidentifiercorrespondstoaloca-tioninanattractionmemory,whosetagmatchestheitemidentifier.Actually,therearecaseswheremultipleloca-tionsofdifferentattractionmemoriescouldmatch.ACOMAprovidesaprogrammingmodelidenticaltothatofshared-mem-oryarchitectures,butitdoesnotre-quirestaticdistributionofexecutionandmemoryusagetorunefficiently.RunninganoptimizedNUMAprogramonaCOMAresultsinaNUMA-likebehavior,sincetheworkspacesofthedifferentprocessorsmigratetotheirattractionmemories.However,aUMAversionofthesameprogramwouldhaveasimilarbehavior,becausethedataareattractedtotheusingprocessorregard-lessoftheaddress.ACOMAalsoadaptstoandperformswellforprogramswithamoredynamicorsemidynamicsched-uling.Theworkspacemigratesaccord-ingtoitsusagethroughoutthecompu-tation.ProgramscanbeoptimizedforaCOMAtotakethispropertyintoac-counttoachieveabetterlocality.ACOMAallowsfordynamicdatausewithoutduplicatingmuchmemory,comparedwithanarchitectureinwhichacacheddatumalsooccupiesspaceinthesharedmemory.Toavoidincreas-ingthememorycost,theattractionmemoriesshouldbeimplementedwithordinarymemorycomponents.There-fore,weviewtheCOMAapproachasasecond-level,orhigherlevel,cachetech-nique.Theaccessingtimetotheattrac-tionmemoryofaCOMAiscompar-abletothattothememoryofacache-coherentNUMA.Figure1com-paresCOMAstoothershared-memoryarchitectures.AnewCOMA.ThisarticledescribesthebasicideasbehindanewCOMA.Thearchitecture,calledtheDataDiffu-sionMachine(DDM),reliesonahier-NetworkNetwork1IIProcessorProcessor(a)(b)(4Figure1.Shared-memoryarchitecturescomparedwithCOMAs:(a)uniformmemoryarchitecture(UMA),(b)nonuniformmemoryarchitecture(NUMA),and(c)cache-onlymemoryarchitecture(COMA).readNotation:in-transaction/out-transactionP=processortransactionN=networktransactionD=DirtyI=InvalidR=ResewedV=ValidPwrite,1Pread0UPreadFigure2.Anexampleofaprotocolsimilartothewrite-onceprotocol.archicalnetworkstructure.Weintro-ducethekeyideasbehindDDMbydescribingasmallmachineanditspro-tocol.Wealsodescribealargemachinewithhundredsofprocessors,overviewtheongoingprototypeproject,andpro-videsimulatedperformancefigures.Cache-coherencestrategiesTheproblemofmaintainingcoher-enceamongread-writedatasharedbydifferentcacheshasbeenstudiedex-tensively.Eithersoftwareorhardwarecanmaintaincoherence.WebelievehardwarecoherenceisneededinaCOMAforefficiency,sincetheitemmustbesmalltopreventperformancedegradationbyfalsesharing.(Infalsesharing,twoprocessorsaccessingdif-ferentpartsofthesameitemconflictwitheachother,eventhoughtheydonotshareanydata.)Wemeasuredaspeedupof50percentwhenfalseshar-ingwasremovedfromthewindtunnelapplication,MP3D-Diff,reportedinthe“Simulatedperformance”section.Hard-ware-basedschemesmaintaincoherencewithoutinvolvingsoftwareandcanbeimplementedmoreefficiently.Exam-plesofhardware-basedprotocolsaresnooping-cacheprotocolsanddirecto-ry-basedprotocols.Snooping-cacheprotocolshaveadis-tributedimplementation.Eachcacheisresponsibleforsnoopingtrafficonthebusandtakingactionstoavoidaninco-herence.Anexampleofsuchaprotocolisthewrite-onceprotocolintroducedbyGoodmananddiscussedbySten-strom.AsFigure2shows,inthatproto-col,eachcachelinecanbeinoneoffourstates:Invalid,Valid,Reserved,orDirty.ManycachescanhavethesamecachelineinthestateValidatthesametime,andmayreaditlocally.WhenwritingtoacachelineinValid,thelinechangesSeptember199245statetoReserved,andawriteissentonthecommonbustothecommonmemory.AllothercacheswithlinesinValidsnoopthewriteandinvalidatetheircopies.Atthispointthereisonlyonecachedcopyofthecachelinecon-tainingthenewlywrittenval-ue.Thecommonmemorynowalsocontainsthenewvalue.IfacachealreadyhasthecachelineinReserved,itcanperformawritelocallywithoutanytransactionsonthecommonbus.Itsvaluenowdiffersfromthatinthememory.anditsstateisthere-forechangedtoDirty.Anyreadrequestsfromothercachestothatcachelinemustnowbeinterceptedtopro-videthenewvalue(markedby“intercept”inFigure2).selectionIIIIIIIIIIIProcessorIFigure3.Thearchitectureofasingle-busDDM.Belowtheattractionmemoriesaretheprocessors.Ontopofthebusarearbitrationandselection.Snoopingcachesrelyonbroadcast-ingandarenotsuitedforgeneralinter-connectionnetworks:Unrestrictedbroadcastingwoulddrasticallyreducetheavailablebandwidth,therebyobvi-atingtheadvantageofgeneralnetworks.Instead.directory-basedschemessendmessagesdirectlybetweennodes.Areadrequestissenttomainmemory,withoutanysnooping.Themainmemo-ryknowsifthecachelineiscached-andinwhichcacheorcaches-andwhetherithasbeenmodified.Ifthelinehasbeenmodified,thereadrequestispassedontothecachewithacopy,whichprovidesacopyfortherequest-ingcache.Onawritetoasharedcacheline,awriterequestsenttothemainmemorycausesinvalidationmessagestoallcacheswithcopiestobesent.Thecachesrespondwithacknowledgemes-sages.Toachievesequentialconsisten-cy,allacknowledgmentsmustbere-ceivedbeforethewriteisperformed.Thecache-coherenceprotocolforaCiMAcanadopttechniquesusedinothercache-coherenceprotocolsandextendthemwiththefunctionalityforfindingadatumonacachereadmissandforhandlingreplacement.Adirec-tory-basedprotocolcouldhaveapartofthedirectoryinformation,thedirectoryhome,staticallydistributedinaNUMAfashion,whilethedatawouldbeal-lowedtomovefreely.Retrievingthedataonareadmisswouldthenrequireoneextraindirectaccesstothedirecto-ryhometofindwheretheitemcurrent-46lyresides.Theaccesstime,includingthisextraindirection,wouldbeidenti-caltothatrequiredforreadingadirtycachelinenotinaNUMAshomenode.Thedirectoryhomecanalsomakesurethatthelastcopyofanitemisnotlost.Insteadoftheabovestrategy,DDMisbasedonahierarchicalsnoopingbusarchitectureandusesahierarchicalsearchalgorithmforfindinganitem.ThedirectoryinformationinDDMisdynamicallydistributedinthehierar-chy.AminimalCOMAWeintroduceDDMbylookingatthesmallestinstanceofthearchitecture,whichcouldbeaCOMAonitsownorasubsystemofalargerCOMA.AsinglebusconnectstheattractionmemoriesoftheminimalDDM.Thedistributionandcoherenceofdataamongtheat-tractionmemoriesarecontrolledbythesnoopingprotocolmemoryabove,andtheinterfacebetweentheprocessorandtheattractionmemoryisdefinedbytheprotocolmemorybelow.Theprotocolviewsacachelineofanattractionmem-ory,herecalledanitem,asoneunit.Theattractionmemorystoresonesmallstatefieldperitem.Figure3showsthenodearchitectureinthesingle-busDDM.DDMusesanasynchronoussplit-transactionbus:Thebusisreleasedbe-tweenarequestingtransactionanditsreply,forexample,betweenareadre-questanditsdatareply.Thedelaybetweentherequestanditsreplycanbeofarbi-trarylength,andtheremightbealargenumberofout-standingrequests.Thereplytransactionwilleventuallyappearonthebusasadiffer-enttransaction.Unlikeoth-erbuses,theDDMbushasaselectionmechanismtomakesurethatatmostonenodeisselectedtoservicearequest.Thisguaranteesthateachtransactiononthebusdoesnotproducemorethanonenewtransactionforthebus,arequirementnecessaryfordeadlockavoidance.Single-busDDMprotocol.Wedevelopedanewproto-col,similarinmanywaystothesnooping-cacheprotocol,limitingbroadcastrequirementstoasmallersubsystemandaddingsupportforre-plaement.Thewritecoherencepartoftheprotocolisthewrite-invalidatetype:Tokeepdatacoherent,allcopiesoftheitemexcepttheonetobeupdatedareerasedonawrite.InaCOMAwithasmallitemsize,thealternativeap-proach,writeupdate,couldalsobeat-tractive:Onawrite,thenewvalueismulticasttoall“caches”withasharedcopyoftheitem.Theprotocolalsohandlestheattrac-tionofdata(read)andreplacementwhenasetinanattractionmemorygetsfull.Thesnoopingprotocoldefinesanewstateandanewtransactiontosendasafunctionofthetransactionappearingonthebus,andthepresentstateoftheitemintheattractionmemory:Protocol:oldstatextransaction-+newstatexnewtransactionAnitemcanbeinoneofsevenstates(thesubsystemistheattractionmemo-ry):*Invalid.Thissubsystemdoesnotcontaintheitem.Exclusive.Thissubsystemandnoothercontainstheitem.Shared.Thissubsystemandpossi-blyothersubsystemscontaintheitem.Reading.Thissubsystemiswaitingforadatavalueafterhavingissuedaread.COMPUTERI.Waiting.ThissubsystemiswaitingtobecomeExclusiveafterhavingissuedanerase.Reading-and-Waiting.Thissub-systemiswaitingforadatavalue,latertobecomeExclusive.Answering.Thissubsystemhaspromisedtoanswerareadrequest.NdatdNeraseNexclusiveIThefirstthreestates-Invalid,Ex-clusive,andShared-correspondtothestatesInvalid,Reserved,andValidinGoodmanswrite-onceprotocol.ThestateDirtyinthatprotocol-withthemeaningthatthisistheonlycachedcopyanditsvaluediffersfromthatinthememory-hasnocorrespondenceinaCOMA.NewstatesintheprotocolarethetransientstatesReading,Wait-ing,Reading-and-Waiting,andAnswer-ing.Transientstatesarerequiredbe-causeofthesplit-transactionbusandtheneedtorememberoutstandingre-quests.Thebuscarriesthefollowingtransac-tions:1(Nerase/NreadErase.Eraseallcopiesofthisitem.Exclusive.AcknowledgeaneraseRead.Readacopyoftheitem.Data.Carrythedatainreplytoanearlierreadrequest.Inject.Carrytheonlycopyofanitemandlookforasubsystemtomoveinto-causedbyareplace-ment.Out.Carrytheitemonitswayoutofthesubsystem-causedbyare-placement.Itwillterminatewhenanothercopyoftheitemisfound.request.Nexclusive/Nread0PwriteAprocessorwritinganiteminExclu-sivestateorreadinganiteminExclu-siveorSharedstateproceedswithoutinterruption.AsFigure4shows,areadattemptofaniteminInvalidwillresultinaReadrequestandanewstate,Read-ing.Thebusselectionmechanismwillselectoneattractionmemorytoservicetherequest,eventuallyputtingaDatatransactiononthebus.Therequestingattractionmemory,nowinReading,willgrabtheDatatransaction,changetoShared,andcontinue.ProcessorsareallowedtowriteonlytoitemsinExclusivestate.IftheitemisinShared,allothercopieshavetobeerasedandanacknowledgmentreceivedbeforethewritingisallowed.Theat-tractionmemorysendsanErasetrans-actionandwaitsfortheExclusiveac-PreadLaNeraselNreadANread/NdataPread/NreadNdataNotation:in-transaction/out-transactionNerasePwrite/NreadP=fromprocessorE=ExclusiveI=invalidNreadlNdataN=fromnetworkR=ReadingRW=Reading-S=Sharedand-Waiting($PreadW=WaitingFigure4.Asimplifiedrepresentationoftheattractionmemoryprotocolnotin-cludingreplacement.knowledgmentinthenewstate,Wait-ing.Manysimultaneousattemptstowritethesameitemwillresultinmanyattrac-tionmemoriesinWaiting,allwithanoutstandingErasetransactionintheiroutputbuffers.ThefirstErasetoreachthebusisthewinnerofthewriterace.Allothertransactionsboundforthesameitemareremovedfromthesmalloutputbuffers.Therefore,thebuffersalsohavetosnooptransactions.Theoutputbufferscanbelimitedtoadepthofthree,anddeadlockcanstillbeavoid-edwithaspecialarbitrationalgorithm.ThelosingattractionmemoriesinWait-ingchangestatetoReading-and-Wait-ing,whileoneofthemputsareadre-questinitsoutputbuffer.EventuallythetopprotocolofthebusreplieswithanExclusiveacknowledgment,tellingtheonlyattractionmemoryleftinWait-ingthatitmaynowproceed.WritingtoanitemintheInvalidstateresultsinaReadrequestandanewstate,Reading-and-Waiting.UpontheDatareply,thestatechangestoWaitingandanEraserequestissent.Replacement.Likeordinarycaches,theattractionmemorywillrunoutofspace,forcingsomeitemstomakeroomformorerecentlyaccessedones.Ifthesetwhereanitemissupposedtoresideisfull,oneiteminthesetisselectedtobereplaced.Forexample.theoldestiteminShared.ofwhichtheremightbeothercopies,maybeselected.Replac-inganiteminSharedgeneratesanOuttransaction.Thespaceusedbytheitemcannowbereclaimed.IfanOuttrans-actionseesanattractionmemoryinShared,Reading,Waiting,orReading-and-Waiting,itdoesnothing;otherwiseitisconvertedtoanInjecttransactionbythetopprotocol.AnInjecttransac-tioncanalsobeproducedbyreplacinganiteminExclusive.Theinjecttransac-tionisthelastcopyofanitemtryingtofindanewhomeinanewattractionmemory.Inthesingle-busimplementa-tion,itwilldosofirstbychoosinganemptyspace(Invalidstate),andsecondbyreplacinganiteminSharedstate-inotherwords,itwilldecreasetheamountofsharing.Iftheitemidentifierspace,whichcorrespondstothephysi-caladdressspaceofconventionalarchi-tectures,isnotmadelargerthanthesumoftheattractionmemorysizes,itispossibletodeviseasimpleschemethatguaranteesaphysicallocationforeachitem.Oftenaprogramusesonlyaportionofacomputersphysicaladdressspace.Thisisespeciallytrueofoperatingsys-temswithafacilityforeagerreclaimingofunusedworkspace.InDDM,theunuseditemspacecanbeusedtoin-creasethedegreeofsharingbypurgingtheunuseditems.Theoperatingsystemmightevenchangethedegreeofshar-ingdynamically.ThehierarchicalDDMSofar,wehavepresentedacache-coherentsingle-busmultiprocessorwith-outphysicallysharedmemory.Instead,theresourcesformhugesecond-levelcachescalledattractionmemories,min-imizingthenumberofaccessestotheSeptember199247II-GIGIElD=DirectoryP=ProcessorAM=AttractionmemoryIDirectoryiControllerJIDDMbusFigure5.AhierarchicalDDMwiththreelevels.Figure6.Thearchitectureofadirectory.onlysharedresourceleft:thesharedbus.Datacanresideinanyormanyoftheattractionmemories.Dataareauto-maticallymovedwhereneeded.Tomakethesingle-busDDMasub-systemofalargehierarchicalDDM,wereplacethetopwithadirectory,whichinterfacesbetweenthebusandahigherlevelbusofthesametype.Figure5showsthehierarchy.Thedirectoryisaset-associativestatememorythatkeepsinformationforalltheitemsintheattractionmemoriesbelowit,butcontainsnodata.Thedi-rectorycananswerthesequestions:“Isthisitembelowme?”and“Doesthisitemexistoutsidemysubsystem?”Fromthebusabove,thedirectoryssnoopingprotocoldirectoryabovebehavesverymuchlikethememoryaboveprotocol.Fromthebusbelow,itsdirectorybelowprotocolbehaveslikethetopprotocolforitemsintheExclusivestate.Thismakesoperationsonitemslocaltoabusidenticaltothoseofthesingle-busDDM.Thedirectorypassesthroughonlytrans-actionsfrombelowthatcannotbecom-pletedinsideitssubsystemortransac-tionsfromabovethatneedtobeservicedbyitssubsystem.Inthatsense,thedi-rectoryactsasafilter.AsFigure6shows,thedirectoryhasasmalloutputbufferaboveittostoretransactionswaitingtobesentonthehigherbus.Transactionsforthelowerbusarestoredinanotheroutputbufferbelow,andtransactionsfromthelowerbusarestoredinaninputbuffer.Adirectoryreadsfromtheinputbufferwhenithasthetimeandspacetodoalookupinitsstatusmemory.Thisisnotpartoftheatomicsnoopingactionofthebus.ThehierarchicalDDManditsproto-colhaveseveralsimilaritieswitharchi-tecturesproposedbyWilson5andGood-manandWoest.6DDMis,however,differentinitsuseoftransientstatesintheprotocol,itslackofphysicallysharedmemory,anditsnetwork(higherlevelcaches)thatstoresonlystateinforma-tionandnodata.Multilevelread.Ifthesubsystemsconnectedtothebuscannotsatisfyareadrequest,thenexthigherdirectoryretransmitstherequestonthenexthigh-erbus.ThedirectoryalsochangestheitemsstatetoReading,markingtheoutstandingrequest.Eventually,therequestreachesalevelinthehierarchywhereadirectorycontainingacopyoftheitemisselectedtoanswerthere-quest.TheselecteddirectorychangestheitemsstatetoAnswering,markinganoutstandingrequestfromabove,andretransmitstheReadrequestonitslow-erbus.AsFigure7shows,thetransientstatesReadingandAnsweringinthedirectoriesmarktherequestspaththroughthehierarchy,likeanunwoundredreadthreadthatshowsthewaythroughamaze,appearinginredinFigure7.Aflow-controlmechanisminthepro-tocolpreventsdeadlockiftoomanyprocessorstrytounwindareadthreadtothesamesetinadirectory.Whentherequestfinallyreachesanattractionmemorywithacopyoftheitem,itsdatareplysimplyfollowsthereadthreadbacktotherequestingnode,changingallthestatesalongthepathtoShared.CombinedreadsandbroadcastsaresimpletoimplementinDDM.IfaReadrequestfindsthereadthreadunwoundfortherequesteditem(ReadingorAnsweringstate),itsimplyterminatesandwaitsfortheDatareplythateven-tuallywillfollowthatpathonitswayback.Multilevelwrite.AnErasefrombe-lowtoadirectorywiththeiteminEx-clusivestateresultsinanExclusiveac-knowledgmentbeingsentbelow.AnErasethatcannotgetitsacknowledg-mentfromthedirectorywillworkitswayupthehierarchy,changingthedi-rectoriesstatestoWaitingtomarktheoutstandingrequest.AllsubsystemsofabuscarryinganErasetransactionwillgettheircopieserased.Thepropaga-tionoftheEraseendswhenitreachesadirectoryinExclusive(orthetop),andtheacknowledgmentissentbackalongthepathmarkedWaiting,changingthestatestoExclusive.Awriteracebetweenanytwoproces-sorsinthehierarchicalDDMhasasolu-tionsimilartothatofasingle-busDDM.ThetwoEraserequestsarepropagatedupthehierarchy.ThefirstErasetrans-actiontoreachthelowestbuscommontobothprocessorsisthewinner,asshowninFigure8.Thelosingattractionmem-ory(inReading-and-Waiting)willre-startanewwriteactionautomaticallyuponreceiptoftheerase.ReplacementinthehierarchicalDDM.ReplacementofaSharediteminthehierarchicalDDMresultsinanOuttransactionpropagatingupthehierar-chyandterminatingwhenitfindsasub-systeminanyofthefollowingstates:Shared,Reading,Waiting,orAnswer-ing.IfthelastcopyofanitemmarkedSharedisreplaced,anOuttransaction48COMPUTERsiveacknowledge(loser).Figure7.AreadrequestfromprocessorP,hasfounditswaytoacopyoftheitemintheattractionmemoryofpro-cessorp,.ItspathismarkedwithstatesReading(R)andAnswering(A),whichwillguidethedatareplybacktoP,.(IindicatesprocessorsintheInvalidstate,SprocessorsintheSharedstate.)Figure8.AwriteracebetweentwoprocessorsP,andP,isresolvedwhentherequestoriginatingfromP,reachesthetopbus(thelowestbuscommontobothprocessors).Thetopcannowsendtheacknowledgment,Exclusive,whichfollowsthepathmarkedwithWs(processorsintheWait-ingstate)backtothewinningprocessorP,.TheWaitingstatesarechangedtoExclusivebytheacknowledgment.TheErasetransactionwillerasethedatainP,andP,forc-ingP,toredoitswriteattempt.thatfailstoterminatewillreachadirec-toryinExclusiveandturnintoanInjecttransaction.ReplacinganiteminEx-clusivegeneratesanInjecttransactionthattriestofindanemptyspaceinaneighboringattractionmemory.InjecttransactionsfirsttrytofindanemptyspaceintheattractionmemoriesofthelocalDDMbus,asinthesingle-busDDM.Unlikeinasingle-busDDM,anInjectfailingtofindanemptyspaceonthelocalDDMbuswillturntoaspecialbus,itshomebus,determinedbytheitemidentifier.Onthehomebus,theInjectwillforceitselfintoanattractionmemory,possiblybythrowingoutafor-eigneroraShareditem.Theitemhomespaceisequallydividedbetweenthebottommostbuses,andthereforespaceisguaranteedonthehomebus.ThepreferredlocationinDDMisdifferentfrommemorylocationinNUMAsinthatanitemseeksahomeonlyatreplacementafterfailingtofindspaceelsewhere.Whentheitemisnotinitshomeplace,otheritemscanuseitsplace.ThehomealsodiffersfromtheNUMAapproachinbeingabus:Anyattractionmemoryonthatbuswilldo.Thedetailsofthedirectoryprotocolsareavailableelsewhere.4Replacementinadirectory.BaerandWangstudiedthemultilevelinclusionproperty,whichhasthefollowingim-plicationsforoursystem:Adirectoryatleveli+1hastobeasupersetofthedirectories,orattractionmemories,atleveli.Inotherwords,thesizeofadirectoryanditsassociativity(numberofways)mustbeB,timesthatoftheunderlyingleveli,whereB,isthebranchfactoroftheunderlyingleveli,andsizemeansthenumberofitems:Size:Dit-,+,=B,4:Dir,Associativity:Dir,+,=B,*:Dir,Evenifimplementable,higherlevelmemorieswouldbecomeexpensiveandslowifthosepropertieswerefulfilledforalargehierarchicalsystem.Howev-er,theeffectsofthemultilevelinclu-sionpropertyarelimitedinDDM.Itstoresonlystateinformationinitsdi-rectoriesanddoesnotreplicatedatainthehigherlevels.Yetanotherwaytolimittheeffectistouse“imperfectdi-rectories”withsmallersets(lowernum-berofways)thanwhatisrequiredformultilevelinclusionandtogivethedi-rectoriestheabilitytoperformreplace-ment,thatis,tomoveallcopiesofanitemoutoftheirsubsystem.Wecankeeptheprobabilityofreplacementatareasonablelevelbyincreasingtheasso-ciativitymoderatelyhigherupinthehierarchy.Ahigherdegreeofsharingalsohelpstokeepthatprobabilitylow.Ashareditemoccupiesspaceinmanyattractionmemories,butonlyonespaceinthedirectoriesabovethem.Theim-plementationofdirectoryreplacementrequiresoneextrastateandtwoextratransactions.4Otherprotocols.Ourprotocolgivestheprogrammerasequentiallyconsis-tentsystem.Itfulfillsthestrongestmem-oryaccessmodel,butperformanceisdegradedbecausetheprocessorhastowaitfortheacknowledgmentbeforeitcanperformthewrite.However,theacknowledgmentissentbythetopmostnodeofthesubsysteminwhichallcop-iesoftheitemreside,insteadofbyeachindividualattractionmemorywithacopy.Thisnotonlyreducestheremotedelay,italsocutsdownthenumberofsystemtransactions.Thewritermightactuallyreceivetheacknowledgmentbeforeallcopiesareerased.Neverthe-less,sequentialconsistencycanbeguar-anteed.xThehierarchicalstructurecanalsoefficientlysupportlooserformsofconsistencyprovidinghigherperfor-mance.Wehavedesignedaprocessor-consistentprotocolXandaprotocolcom-biningprocessorconsistencywithanadaptivewriteupdatestrategy.IncreasingthebandwidthAlthoughmostmemoryaccessestendtobelocalizedinthemachine,thehier-archyshigherlevelsmayneverthelessdemandahigherbandwidththanthelowersystems,creatingabottleneck.Totaketheloadoffthehigherlevels,wecanuseasmallerbranchfactorattheSeptember199249Figure9.Increas-ingtheband-widthofabusbysplittingbuses.AOddIEvenAI+-F-F.&=-$DirectoryDirectoryItopofthehierarchythanlowerdown.Thissolution,however,increasesthelevelsinthehierarchy,resultinginalongerremoteaccessdelayandanin-creasedmemoryoverhead.Instead,wecanwidenthehigherlevelsofthehier-archytoproduceafattree.9Wesplitadirectoryintotwodirectorieshalftheoriginaldirectoryssize.Thetwodirec-toriesdealwithdifferentaddressdo-mains(evenandodd).Thecommunica-tionwithotherdirectoriesisalsosplit,whichdoublesthebandwidth.Wecanperformasplitanynumberoftimesandatanylevelofthehierarchy.Figure9showsthatregardlessofthenumberofsplits,thearchitectureisstillhierarchi-caltoeachspecificaddress.Yetanothersolutionisaheteroge-neousnetwork:WeusethehierarchywithitsadvantagesasfaraspossibleandtieseveralhierarchiestogetheratRelatedactivitiesAttheSwedishInstituteofComputerScience,wearedevelopinganoperat-ingsystemfortheDDMprototype.ThisworkisbasedontheMachoperatingsystemfromCarnegieMellonUniversity,whichwemodifiedtosupportDDMefficiently.Relatedactivitiesinvolveahardwareprefetchingschemethatdy-namicallyprefetchesitemstotheattractionmemory;thisisespeciallyusefulwhenaprocessisstartedormigrated.Wearealsoexperimentingwithalter-nativeprotocols.TheemulatorrunsontheMeikotransputerplatformandmodelsanarchitec-turewithatree-shapedlink-basedstructure,withtransputersasdirectories.Thetransputersfourlinkspermitabranchfactorofthreeateachlevel.Thetransputersattheleavesexecutetheapplication.AllreferencestoglobaldataareinterceptedandhandledinaDDMmannerbysoftware.TheemulatorsDDMprotocolhasadifferentrepresentationsuitedforalink-basedarchitec-turestructuredlikeatree,ratherthanforabus-basedarchitecture.Theim-plementationhascertainsimilaritiestodirectory-basedsystems.ADDMemulatoriscurrentlyunderdevelopmentattheUniversityofBristol.2References1.E.Hagersten.TowardsaScalableCache-OnlyMemoryArchitecture,PhDthesis,SICSDissertationSeries08,SwedishInstituteofComputerScience,Kista,Sweden,1992.2.S.RainaandD.H.D.Warren,“TrafficPatternsinaScalableMultiprocessorThroughTransputerEmulation,”Proc.Hawaiilnt7Conf.SystemSciences,Vol.I,IEEE-CSPress,LosAlamitos,Calif.,OrderNo.2420,1992,pp.267-276.theirtopsbyageneralnetworkwithadirectory-basedprotocol.Thisschemerequiressomechangesintheprotocoltoachievethesameconsistencymodel.TheDDMprototypeprojectAprototypeDDMdesignisnearcom-pletionattheSwedishInstituteofCom-puterScience.Thehardwareimplemen-tationoftheprocessorandattractionmemoryisbasedonthesystemTP881VbyTadpoleTechnology,UK.EachsuchsystemhasuptofourMotorolaMC8810020-MHzprocessors,eachonewithtwoMC8820016-Kbytecachesandmemorymanagementunits;8or32MbytesofDRAM;andinterfacesfortheSCSIbus,Ethernet,andterminals,allcon-nectedbytheMotorolaMbusasshowninFigure10.WearedevelopingaDDMnodecon-trollerboardtohostasingle-portedstatememory.AsFigure10shows,itwillinterfacetheTP881Vnodewiththefirst-levelDDMbus.Thenodecontrol-lersnoopsaccessesbetweentheproces-sorcachesandthememoryoftheTP881Vaccordingtothememory-belowprotocol,andalsosnoopstheDDMbusaccordingtothememory-aboveprotocol.Wehaveintegratedthecopy-backprotocolofmultipleproces-sorcachesintotheprotocolmechanisms.Thenodecontrollerthuschangesthememorysbehaviorintothatofanat-tractionmemory.Readaccessestotheattractionmemorytakeeightcyclespercacheline,whichisonemorethanintheoriginalTP881Vsystem.Writeaccessestotheattractionmemorytake12cyclescomparedwith10cyclesfortheoriginalsystem.Aread/writemixof3/1totheattractionmemoryresultsinanaccesstimetotheattractionmemoryontheaverage16percentslowerthanthattotheoriginalTP881Vmemory.AsTable1shows,aremotereadtoanodeonthesameDDMbustakes55cyclesatbest,mostofwhicharespentmakingMbustransactions(atotaloffouraccesses).Readaccessesclimbingonestepupanddownthehierarchyaddabout45extracycles.Writeaccessestosharedstatetakeatbest40cyclesforoneleveland50cyclesfortwolevels.TheDDMbusispipelinedinfourphases:transactioncode,snoop,selec-tion,anddata.Wedesignedourinitial50c-TCOMPUTERbusconservatively,sincepushingthebusspeedisnotaprimarygoalofthisresearch.TheprototypeDDMbusop-eratesat20MHz,witha32-bitdatabusanda32-bitaddressbus.Itprovidesamoderatebandwidthofabout80Mbytespersecond,whichisenoughforcon-nectinguptoeightnodes-thatis,32processors.Still,thebandwidthhasnotbeenthelimitingfactorinoursimula-tionstudies.Wecanincreasebusband-widthmanytimesbyusingotherstruc-tures.TheslottedringbusproposedbyBarossoandDuboisOhasabandwidthoneorderofmagnitudehigher.Fortranslationstoitemidentifiers,DDMusesthenormalproceduresfortranslatingvirtualaddressestophysicaladdresses,asimplementedinstandardmemorymanagementunits.Thismeansthatanoperatingsystemhasknowledgeofphysicalpages.Anyattractionmemorynodecanhaveaconnecteddisk.Uponapage-in,thenodefirstattractsallthedataofanitempageasbeingtemporarilylockedtoitsattractionmemory.Iftheitemsofthatpagewerenotpresentinthemachineearlier,theyare“born”atthistimethroughtheprotocol.Thenthenodecopies(bydirectmemoryaccess)thepagefromthedisktotheattractionmemory,unlockingthedataatthesametime.Page-outreversestheprocess,copyingadirtypagebacktothedisk.Theoperatingsystemcanpurgetheitemsofunusedpagesformoresharing.itemspace.Becauseeachsetintheattrac-tionmemoryisdivid-edtwowaYs,16itemscanresideinthesameset.Inadditiontothefourbitsneed-edtotellitemsapart,eachitemneedsfourMemoryoverheadStateinDelay,Delay,CPUAttractionOneLevelTwoLevelsAccessMemory(cycles)(cycles)ReadInvalid55100WriteShared4050WriteInvalid80130Itmightseemthatanimplementa-tionofDDMwouldrequirefarmorememorythanalternativearchitectures.Extramemoryisrequiredforstoringstatebitsandaddresskeysfortheset-associativeattractionmemories,aswellasforthedirectories.Wehavecalculatedtheextrabitsneededifallitemsresideinonlyonecopy(worstcase).Weassumeanitemsizeof16bytes-thecachelinesizeoftheMo-torolaMC88200.A32-processorDDM-thatis,aone-levelDDMwithamaximumofeighttwo-wayset-associativeattractionmem-ories-needsfourbitsofaddresstagperitem,regardlessoftheattractionmemorysize.Aswesaidearlier,theitemspaceisnotlargerthanthesumofthesizesoftheattractionmemories,sothesizeofeachattractionmemoryiscontrollermemory1metTP881ViFigure10.AnodeoftheDDMprototypeconsistingoffourprocessorssharingoneattractionmemory.bitsofstate.Thus,anitemsizeof128bitsgivesanover-headof(4+4)/128=6percent.Addinganotherlayerwitheighteight-wayset-associativedirectoriesbringsthemaximumnumberofprocessorsto256.Thesizeofthedirectoriesisthesumofthesizesoftheattractionmem-oriesintheirsubsystems.Adirectoryentryconsistsofsixbitsfortheaddresstagandfourbitsofstateperitem,usingacalculationsimilartotheoneabove.Theoverheadintheattractionmemo-riesislargerthaninthepreviousexam-plebecauseofthelargeritemspace:sevenbitsofaddresstagandfourbitsofstate.Thetotaloverheadperitemis(6+4+7+4)/128=16percent.Alargeritemsizewould,ofcourse,decreasetheseoverheads.Tominimizethememoryoverhead,wecanuseadifferentinterpretationoftheimplicitstatefordifferentpartsoftheitemspace.Inourinitialimplemen-tationofDDM,theabsenceofanentryinadirectoryisinterpretedasInvalid.Thereplacementalgorithmintroducesahomebusforanitem.Ifanitemismostoftenfoundinitshomebusandnowhereelse,theabsenceofanentryinadirectorycouldinsteadbeinterpretedasExclusiveforitemsinitshomesub-system,andasInvalidforitemsfromoutside.Thiswouldd

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论