会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

50-DDM--A Cache-Only Memory Architecture.pdf

  • 资源星级:
  • 资源大小:1.07MB   全文页数:11页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

50-DDM--A Cache-Only Memory Architecture.pdf

DDMACacheOnlvMemoryArchitectureErikHagersten,AndersLandin,andSeifHaridiSwedishInstituteofComputerScienceultiprocessorsprovidingasharedmemoryviewtotheprogrammeraretypicallyimplementedassuchwithasharedmemory.WeintroduceManarchitecturewithlargecachestoreducelatencyandnetworkload.Becauseallsystemmemoryresidesinthecaches,aminimumnumberofnetworkaccessesareneeded.Still,itpresentsasharedmemoryviewtotheprogrammer.Singlebus.Sharedmemorysystemsbasedonasinglebushavesometensofprocessors,eachonewithalocalcache,andtypicallysufferfrombussaturation.Acachecoherenceprotocolineachcachesnoopsthetrafficonthecommonbusandpreventsinconsistenciesincachecontents.ComputersmanufacturedbySequentandEncoreusethiskindofarchitecture.Becauseitprovidesauniformaccesstimetothewholesharedmemory,itiscalledauniformmemoryarchitectureUMA.ThecontentionforthecommonmemoryandthecommonbuslimitsthescalabilityofUMAs.Distributed.ComputerssuchastheBBNButterflyandtheIBMRP3useanarchitecturewithdistributedsharedmemory,knownasanonuniformmemoryarchitectureNUMA.Eachprocessornodecontainsaportionofthesharedmemory,soaccesstimestodifferentpartsofthesharedaddressspacecanvary.NUMAsoftenhavenetworksotherthanasinglebus,andthenetworkdelaycanvarytodifferentnodes.TheearlierNUMAsdidnothavecoherentcachesandlefttheproblemofmaintainingcoherencetotheprogrammer.Today,researchersarestrivingtowardcoherentNUMAswithdirectorybasedcachecoherenceprotocol.Bystaticallypartitioningtheworkanddata,programmerscanoptimizeprogramsforNUMAs.ApartitioningthatenablesprocessorstomakemostoftheiraccessestotheirpartofthesharedmemoryachievesabetterscalabilitythanispossibleinUMA.AnewarchitecturehastheprogrammingparadigmofsharedmemoryarchitecturesCacheonly.InacacheonlymemoryarchitectureCOMA,thememoryorganizationissimilartothatofaNUMAinthateachprocessorholdsaportionofthebutnophysicallysharedmemory.Cachesaddressspace.However,thepartitioningofdataamongthememoriesdoesnothavetobestatic,sincealldistributedmemoriesareorganizedlikelargesecondlevelcaches.Thetaskofsuchamemoryistwofold.Besidesbeingalargecachefortheprocessor,itmayalsocontainsomedatafromthesharedaddressspacethattheprocessorneverhasaccessedinotherwords,itisacacheandavirtualpartofattachedtotheprocessorscontainalltheSystemmemory.4400189162/92/0900004401.0001992I€€€COMPUTERthesharedmemory.Wecallthisintermediateformofmemoryattractionmemory.Acoherenceprotocolattractsthedatausedbyaprocessortoitsattractionmemory.Comparabletoacacheline,thecoherenceunitmovedaroundbytheprotocoliscalledanitem.Onamemoryreference,avirtualaddressistranslatedintoanitemidentifier.Theitemidentifierspaceislogicallythesameasthephysicaladdressspaceoftypicalmachines,butthereisnopermanentmappingbetweenanitemidentifierandaphysicalmemorylocation.Instead,anitemidentifiercorrespondstoalocationinanattractionmemory,whosetagmatchestheitemidentifier.Actually,therearecaseswheremultiplelocationsofdifferentattractionmemoriescouldmatch.ACOMAprovidesaprogrammingmodelidenticaltothatofsharedmemoryarchitectures,butitdoesnotrequirestaticdistributionofexecutionandmemoryusagetorunefficiently.RunninganoptimizedNUMAprogramonaCOMAresultsinaNUMAlikebehavior,sincetheworkspacesofthedifferentprocessorsmigratetotheirattractionmemories.However,aUMAversionofthesameprogramwouldhaveasimilarbehavior,becausethedataareattractedtotheusingprocessorregardlessoftheaddress.ACOMAalsoadaptstoandperformswellforprogramswithamoredynamicorsemidynamicscheduling.Theworkspacemigratesaccordingtoitsusagethroughoutthecomputation.ProgramscanbeoptimizedforaCOMAtotakethispropertyintoaccounttoachieveabetterlocality.ACOMAallowsfordynamicdatausewithoutduplicatingmuchmemory,comparedwithanarchitectureinwhichacacheddatumalsooccupiesspaceinthesharedmemory.Toavoidincreasingthememorycost,theattractionmemoriesshouldbeimplementedwithordinarymemorycomponents.Therefore,weviewtheCOMAapproachasasecondlevel,orhigherlevel,cachetechnique.TheaccessingtimetotheattractionmemoryofaCOMAiscomparabletothattothememoryofacachecoherentNUMA.Figure1comparesCOMAstoothersharedmemoryarchitectures.AnewCOMA.ThisarticledescribesthebasicideasbehindanewCOMA.Thearchitecture,calledtheDataDiffusionMachineDDM,reliesonahierNetworkNetwork1IIProcessorProcessorab4Figure1.SharedmemoryarchitecturescomparedwithCOMAsauniformmemoryarchitectureUMA,bnonuniformmemoryarchitectureNUMA,andccacheonlymemoryarchitectureCOMA.readNotationintransaction/outtransactionPprocessortransactionNnetworktransactionDDirtyIInvalidRResewedVValidPwrite,1Pread0UPreadFigure2.Anexampleofaprotocolsimilartothewriteonceprotocol.archicalnetworkstructure.WeintroducethekeyideasbehindDDMbydescribingasmallmachineanditsprotocol.Wealsodescribealargemachinewithhundredsofprocessors,overviewtheongoingprototypeproject,andprovidesimulatedperformancefigures.CachecoherencestrategiesTheproblemofmaintainingcoherenceamongreadwritedatasharedbydifferentcacheshasbeenstudiedextensively.Eithersoftwareorhardwarecanmaintaincoherence.WebelievehardwarecoherenceisneededinaCOMAforefficiency,sincetheitemmustbesmalltopreventperformancedegradationbyfalsesharing.Infalsesharing,twoprocessorsaccessingdifferentpartsofthesameitemconflictwitheachother,eventhoughtheydonotshareanydata.Wemeasuredaspeedupof50percentwhenfalsesharingwasremovedfromthewindtunnelapplication,MP3DDiff,reportedintheSimulatedperformancesection.Hardwarebasedschemesmaintaincoherencewithoutinvolvingsoftwareandcanbeimplementedmoreefficiently.Examplesofhardwarebasedprotocolsaresnoopingcacheprotocolsanddirectorybasedprotocols.Snoopingcacheprotocolshaveadistributedimplementation.Eachcacheisresponsibleforsnoopingtrafficonthebusandtakingactionstoavoidanincoherence.AnexampleofsuchaprotocolisthewriteonceprotocolintroducedbyGoodmananddiscussedbyStenstrom.AsFigure2shows,inthatprotocol,eachcachelinecanbeinoneoffourstatesInvalid,Valid,Reserved,orDirty.ManycachescanhavethesamecachelineinthestateValidatthesametime,andmayreaditlocally.WhenwritingtoacachelineinValid,thelinechangesSeptember199245statetoReserved,andawriteissentonthecommonbustothecommonmemory.AllothercacheswithlinesinValidsnoopthewriteandinvalidatetheircopies.Atthispointthereisonlyonecachedcopyofthecachelinecontainingthenewlywrittenvalue.Thecommonmemorynowalsocontainsthenewvalue.IfacachealreadyhasthecachelineinReserved,itcanperformawritelocallywithoutanytransactionsonthecommonbus.Itsvaluenowdiffersfromthatinthememory.anditsstateisthereforechangedtoDirty.AnyreadrequestsfromothercachestothatcachelinemustnowbeinterceptedtoprovidethenewvaluemarkedbyinterceptinFigure2.selectionIIIIIIIIIIIProcessorIFigure3.ThearchitectureofasinglebusDDM.Belowtheattractionmemoriesaretheprocessors.Ontopofthebusarearbitrationandselection.SnoopingcachesrelyonbroadcastingandarenotsuitedforgeneralinterconnectionnetworksUnrestrictedbroadcastingwoulddrasticallyreducetheavailablebandwidth,therebyobviatingtheadvantageofgeneralnetworks.Instead.directorybasedschemessendmessagesdirectlybetweennodes.Areadrequestissenttomainmemory,withoutanysnooping.Themainmemoryknowsifthecachelineiscachedandinwhichcacheorcachesandwhetherithasbeenmodified.Ifthelinehasbeenmodified,thereadrequestispassedontothecachewithacopy,whichprovidesacopyfortherequestingcache.Onawritetoasharedcacheline,awriterequestsenttothemainmemorycausesinvalidationmessagestoallcacheswithcopiestobesent.Thecachesrespondwithacknowledgemessages.Toachievesequentialconsistency,allacknowledgmentsmustbereceivedbeforethewriteisperformed.ThecachecoherenceprotocolforaCiMAcanadopttechniquesusedinothercachecoherenceprotocolsandextendthemwiththefunctionalityforfindingadatumonacachereadmissandforhandlingreplacement.Adirectorybasedprotocolcouldhaveapartofthedirectoryinformation,thedirectoryhome,staticallydistributedinaNUMAfashion,whilethedatawouldbeallowedtomovefreely.Retrievingthedataonareadmisswouldthenrequireoneextraindirectaccesstothedirectoryhometofindwheretheitemcurrent46lyresides.Theaccesstime,includingthisextraindirection,wouldbeidenticaltothatrequiredforreadingadirtycachelinenotinaNUMAshomenode.Thedirectoryhomecanalsomakesurethatthelastcopyofanitemisnotlost.Insteadoftheabovestrategy,DDMisbasedonahierarchicalsnoopingbusarchitectureandusesahierarchicalsearchalgorithmforfindinganitem.ThedirectoryinformationinDDMisdynamicallydistributedinthehierarchy.AminimalCOMAWeintroduceDDMbylookingatthesmallestinstanceofthearchitecture,whichcouldbeaCOMAonitsownorasubsystemofalargerCOMA.AsinglebusconnectstheattractionmemoriesoftheminimalDDM.Thedistributionandcoherenceofdataamongtheattractionmemoriesarecontrolledbythesnoopingprotocolmemoryabove,andtheinterfacebetweentheprocessorandtheattractionmemoryisdefinedbytheprotocolmemorybelow.Theprotocolviewsacachelineofanattractionmemory,herecalledanitem,asoneunit.Theattractionmemorystoresonesmallstatefieldperitem.Figure3showsthenodearchitectureinthesinglebusDDM.DDMusesanasynchronoussplittransactionbusThebusisreleasedbetweenarequestingtransactionanditsreply,forexample,betweenareadrequestanditsdatareply.Thedelaybetweentherequestanditsreplycanbeofarbitrarylength,andtheremightbealargenumberofoutstandingrequests.Thereplytransactionwilleventuallyappearonthebusasadifferenttransaction.Unlikeotherbuses,theDDMbushasaselectionmechanismtomakesurethatatmostonenodeisselectedtoservicearequest.Thisguaranteesthateachtransactiononthebusdoesnotproducemorethanonenewtransactionforthebus,arequirementnecessaryfordeadlockavoidance.SinglebusDDMprotocol.Wedevelopedanewprotocol,similarinmanywaystothesnoopingcacheprotocol,limitingbroadcastrequirementstoasmallersubsystemandaddingsupportforreplaement.ThewritecoherencepartoftheprotocolisthewriteinvalidatetypeTokeepdatacoherent,allcopiesoftheitemexcepttheonetobeupdatedareerasedonawrite.InaCOMAwithasmallitemsize,thealternativeapproach,writeupdate,couldalsobeattractiveOnawrite,thenewvalueismulticasttoallcacheswithasharedcopyoftheitem.Theprotocolalsohandlestheattractionofdatareadandreplacementwhenasetinanattractionmemorygetsfull.Thesnoopingprotocoldefinesanewstateandanewtransactiontosendasafunctionofthetransactionappearingonthebus,andthepresentstateoftheitemintheattractionmemoryProtocololdstatextransactionnewstatexnewtransactionAnitemcanbeinoneofsevenstatesthesubsystemistheattractionmemoryInvalid.Thissubsystemdoesnotcontaintheitem.Exclusive.Thissubsystemandnoothercontainstheitem.Shared.Thissubsystemandpossiblyothersubsystemscontaintheitem.Reading.Thissubsystemiswaitingforadatavalueafterhavingissuedaread.COMPUTERI.Waiting.ThissubsystemiswaitingtobecomeExclusiveafterhavingissuedanerase.ReadingandWaiting.Thissubsystemiswaitingforadatavalue,latertobecomeExclusive.Answering.Thissubsystemhaspromisedtoanswerareadrequest.NdatdNeraseNexclusiveIThefirstthreestatesInvalid,Exclusive,andSharedcorrespondtothestatesInvalid,Reserved,andValidinGoodmanswriteonceprotocol.ThestateDirtyinthatprotocolwiththemeaningthatthisistheonlycachedcopyanditsvaluediffersfromthatinthememoryhasnocorrespondenceinaCOMA.NewstatesintheprotocolarethetransientstatesReading,Waiting,ReadingandWaiting,andAnswering.Transientstatesarerequiredbecauseofthesplittransactionbusandtheneedtorememberoutstandingrequests.Thebuscarriesthefollowingtransactions1Nerase/NreadErase.Eraseallcopiesofthisitem.Exclusive.AcknowledgeaneraseRead.Readacopyoftheitem.Data.Carrythedatainreplytoanearlierreadrequest.Inject.Carrytheonlycopyofanitemandlookforasubsystemtomoveintocausedbyareplacement.Out.Carrytheitemonitswayoutofthesubsystemcausedbyareplacement.Itwillterminatewhenanothercopyoftheitemisfound.request.Nexclusive/Nread0PwriteAprocessorwritinganiteminExclusivestateorreadinganiteminExclusiveorSharedstateproceedswithoutinterruption.AsFigure4shows,areadattemptofaniteminInvalidwillresultinaReadrequestandanewstate,Reading.Thebusselectionmechanismwillselectoneattractionmemorytoservicetherequest,eventuallyputtingaDatatransactiononthebus.Therequestingattractionmemory,nowinReading,willgrabtheDatatransaction,changetoShared,andcontinue.ProcessorsareallowedtowriteonlytoitemsinExclusivestate.IftheitemisinShared,allothercopieshavetobeerasedandanacknowledgmentreceivedbeforethewritingisallowed.TheattractionmemorysendsanErasetransactionandwaitsfortheExclusiveacPreadLaNeraselNreadANread/NdataPread/NreadNdataNotationintransaction/outtransactionNerasePwrite/NreadPfromprocessorEExclusiveIinvalidNreadlNdataNfromnetworkRReadingRWReadingSSharedandWaitingPreadWWaitingFigure4.Asimplifiedrepresentationoftheattractionmemoryprotocolnotincludingreplacement.knowledgmentinthenewstate,Waiting.ManysimultaneousattemptstowritethesameitemwillresultinmanyattractionmemoriesinWaiting,allwithanoutstandingErasetransactionintheiroutputbuffers.ThefirstErasetoreachthebusisthewinnerofthewriterace.Allothertransactionsboundforthesameitemareremovedfromthesmalloutputbuffers.Therefore,thebuffersalsohavetosnooptransactions.Theoutputbufferscanbelimitedtoadepthofthree,anddeadlockcanstillbeavoidedwithaspecialarbitrationalgorithm.ThelosingattractionmemoriesinWaitingchangestatetoReadingandWaiting,whileoneofthemputsareadrequestinitsoutputbuffer.EventuallythetopprotocolofthebusreplieswithanExclusiveacknowledgment,tellingtheonlyattractionmemoryleftinWaitingthatitmaynowproceed.WritingtoanitemintheInvalidstateresultsinaReadrequestandanewstate,ReadingandWaiting.UpontheDatareply,thestatechangestoWaitingandanEraserequestissent.Replacement.Likeordinarycaches,theattractionmemorywillrunoutofspace,forcingsomeitemstomakeroomformorerecentlyaccessedones.Ifthesetwhereanitemissupposedtoresideisfull,oneiteminthesetisselectedtobereplaced.Forexample.theoldestiteminShared.ofwhichtheremightbeothercopies,maybeselected.ReplacinganiteminSharedgeneratesanOuttransaction.Thespaceusedbytheitemcannowbereclaimed.IfanOuttransactionseesanattractionmemoryinShared,Reading,Waiting,orReadingandWaiting,itdoesnothingotherwiseitisconvertedtoanInjecttransactionbythetopprotocol.AnInjecttransactioncanalsobeproducedbyreplacinganiteminExclusive.Theinjecttransactionisthelastcopyofanitemtryingtofindanewhomeinanewattractionmemory.Inthesinglebusimplementation,itwilldosofirstbychoosinganemptyspaceInvalidstate,andsecondbyreplacinganiteminSharedstateinotherwords,itwilldecreasetheamountofsharing.Iftheitemidentifierspace,whichcorrespondstothephysicaladdressspaceofconventionalarchitectures,isnotmadelargerthanthesumoftheattractionmemorysizes,itispossibletodeviseasimpleschemethatguaranteesaphysicallocationforeachitem.Oftenaprogramusesonlyaportionofacomputersphysicaladdressspace.Thisisespeciallytrueofoperatingsystemswithafacilityforeagerreclaimingofunusedworkspace.InDDM,theunuseditemspacecanbeusedtoincreasethedegreeofsharingbypurgingtheunuseditems.Theoperatingsystemmightevenchangethedegreeofsharingdynamically.ThehierarchicalDDMSofar,wehavepresentedacachecoherentsinglebusmultiprocessorwithoutphysicallysharedmemory.Instead,theresourcesformhugesecondlevelcachescalledattractionmemories,minimizingthenumberofaccessestotheSeptember199247IIGIGIElDDirectoryPProcessorAMAttractionmemoryIDirectoryiControllerJIDDMbusFigure5.AhierarchicalDDMwiththreelevels.Figure6.Thearchitectureofadirectory.onlysharedresourceleftthesharedbus.Datacanresideinanyormanyoftheattractionmemories.Dataareautomaticallymovedwhereneeded.TomakethesinglebusDDMasubsystemofalargehierarchicalDDM,wereplacethetopwithadirectory,whichinterfacesbetweenthebusandahigherlevelbusofthesametype.Figure5showsthehierarchy.Thedirectoryisasetassociativestatememorythatkeepsinformationforalltheitemsintheattractionmemoriesbelowit,butcontainsnodata.ThedirectorycananswerthesequestionsIsthisitembelowmeandDoesthisitemexistoutsidemysubsystemFromthebusabove,thedirectoryssnoopingprotocoldirectoryabovebehavesverymuchlikethememoryaboveprotocol.Fromthebusbelow,itsdirectorybelowprotocolbehaveslikethetopprotocolforitemsintheExclusivestate.ThismakesoperationsonitemslocaltoabusidenticaltothoseofthesinglebusDDM.Thedirectorypassesthroughonlytransactionsfrombelowthatcannotbecompletedinsideitssubsystemortransactionsfromabovethatneedtobeservicedbyitssubsystem.Inthatsense,thedirectoryactsasafilter.AsFigure6shows,thedirectoryhasasmalloutputbufferaboveittostoretransactionswaitingtobesentonthehigherbus.Transactionsforthelowerbusarestoredinanotheroutputbufferbelow,andtransactionsfromthelowerbusarestoredinaninputbuffer.Adirectoryreadsfromtheinputbufferwhenithasthetimeandspacetodoalookupinitsstatusmemory.Thisisnotpartoftheatomicsnoopingactionofthebus.ThehierarchicalDDManditsprotocolhaveseveralsimilaritieswitharchitecturesproposedbyWilson5andGoodmanandWoest.6DDMis,however,differentinitsuseoftransientstatesintheprotocol,itslackofphysicallysharedmemory,anditsnetworkhigherlevelcachesthatstoresonlystateinformationandnodata.Multilevelread.Ifthesubsystemsconnectedtothebuscannotsatisfyareadrequest,thenexthigherdirectoryretransmitstherequestonthenexthigherbus.ThedirectoryalsochangestheitemsstatetoReading,markingtheoutstandingrequest.Eventually,therequestreachesalevelinthehierarchywhereadirectorycontainingacopyoftheitemisselectedtoanswertherequest.TheselecteddirectorychangestheitemsstatetoAnswering,markinganoutstandingrequestfromabove,andretransmitstheReadrequestonitslowerbus.AsFigure7shows,thetransientstatesReadingandAnsweringinthedirectoriesmarktherequestspaththroughthehierarchy,likeanunwoundredreadthreadthatshowsthewaythroughamaze,appearinginredinFigure7.Aflowcontrolmechanismintheprotocolpreventsdeadlockiftoomanyprocessorstrytounwindareadthreadtothesamesetinadirectory.Whentherequestfinallyreachesanattractionmemorywithacopyoftheitem,itsdatareplysimplyfollowsthereadthreadbacktotherequestingnode,changingallthestatesalongthepathtoShared.CombinedreadsandbroadcastsaresimpletoimplementinDDM.IfaReadrequestfindsthereadthreadunwoundfortherequesteditemReadingorAnsweringstate,itsimplyterminatesandwaitsfortheDatareplythateventuallywillfollowthatpathonitswayback.Multilevelwrite.AnErasefrombelowtoadirectorywiththeiteminExclusivestateresultsinanExclusiveacknowledgmentbeingsentbelow.AnErasethatcannotgetitsacknowledgmentfromthedirectorywillworkitswayupthehierarchy,changingthedirectoriesstatestoWaitingtomarktheoutstandingrequest.AllsubsystemsofabuscarryinganErasetransactionwillgettheircopieserased.ThepropagationoftheEraseendswhenitreachesadirectoryinExclusiveorthetop,andtheacknowledgmentissentbackalongthepathmarkedWaiting,changingthestatestoExclusive.AwriteracebetweenanytwoprocessorsinthehierarchicalDDMhasasolutionsimilartothatofasinglebusDDM.ThetwoEraserequestsarepropagatedupthehierarchy.ThefirstErasetransactiontoreachthelowestbuscommontobothprocessorsisthewinner,asshowninFigure8.ThelosingattractionmemoryinReadingandWaitingwillrestartanewwriteactionautomaticallyuponreceiptoftheerase.ReplacementinthehierarchicalDDM.ReplacementofaSharediteminthehierarchicalDDMresultsinanOuttransactionpropagatingupthehierarchyandterminatingwhenitfindsasubsysteminanyofthefollowingstatesShared,Reading,Waiting,orAnswering.IfthelastcopyofanitemmarkedSharedisreplaced,anOuttransaction48COMPUTERsiveacknowledgeloser.....Figure7.AreadrequestfromprocessorP,hasfounditswaytoacopyoftheitemintheattractionmemoryofprocessorp,.ItspathismarkedwithstatesReadingRandAnsweringA,whichwillguidethedatareplybacktoP,.IindicatesprocessorsintheInvalidstate,SprocessorsintheSharedstate.Figure8.AwriteracebetweentwoprocessorsP,andP,isresolvedwhentherequestoriginatingfromP,reachesthetopbusthelowestbuscommontobothprocessors.Thetopcannowsendtheacknowledgment,Exclusive,whichfollowsthepathmarkedwithWsprocessorsintheWaitingstatebacktothewinningprocessorP,.TheWaitingstatesarechangedtoExclusivebytheacknowledgment.TheErasetransactionwillerasethedatainP,andP,,forcingP,toredoitswriteattempt.thatfailstoterminatewillreachadirectoryinExclusiveandturnintoanInjecttransaction.ReplacinganiteminExclusivegeneratesanInjecttransactionthattriestofindanemptyspaceinaneighboringattractionmemory.InjecttransactionsfirsttrytofindanemptyspaceintheattractionmemoriesofthelocalDDMbus,asinthesinglebusDDM.UnlikeinasinglebusDDM,anInjectfailingtofindanemptyspaceonthelocalDDMbuswillturntoaspecialbus,itshomebus,determinedbytheitemidentifier.Onthehomebus,theInjectwillforceitselfintoanattractionmemory,possiblybythrowingoutaforeigneroraShareditem.Theitemhomespaceisequallydividedbetweenthebottommostbuses,andthereforespaceisguaranteedonthehomebus.ThepreferredlocationinDDMisdifferentfrommemorylocationinNUMAsinthatanitemseeksahomeonlyatreplacementafterfailingtofindspaceelsewhere.Whentheitemisnotinitshomeplace,otheritemscanuseitsplace.ThehomealsodiffersfromtheNUMAapproachinbeingabusAnyattractionmemoryonthatbuswilldo.Thedetailsofthedirectoryprotocolsareavailableelsewhere.4Replacementinadirectory.BaerandWangstudiedthemultilevelinclusionproperty,whichhasthefollowingimplicationsforoursystemAdirectoryatleveli1hastobeasupersetofthedirectories,orattractionmemories,atleveli.Inotherwords,thesizeofadirectoryanditsassociativitynumberofwaysmustbeB,timesthatoftheunderlyingleveli,whereB,isthebranchfactoroftheunderlyingleveli,andsizemeansthenumberofitemsSizeDit,,B,4Dir,AssociativityDir,,B,Dir,Evenifimplementable,higherlevelmemorieswouldbecomeexpensiveandslowifthosepropertieswerefulfilledforalargehierarchicalsystem.However,theeffectsofthemultilevelinclusionpropertyarelimitedinDDM.Itstoresonlystateinformationinitsdirectoriesanddoesnotreplicatedatainthehigherlevels.Yetanotherwaytolimittheeffectistouseimperfectdirectorieswithsmallersetslowernumberofwaysthanwhatisrequiredformultilevelinclusionandtogivethedirectoriestheabilitytoperformreplacement,thatis,tomoveallcopiesofanitemoutoftheirsubsystem.Wecankeeptheprobabilityofreplacementatareasonablelevelbyincreasingtheassociativitymoderatelyhigherupinthehierarchy.Ahigherdegreeofsharingalsohelpstokeepthatprobabilitylow.Ashareditemoccupiesspaceinmanyattractionmemories,butonlyonespaceinthedirectoriesabovethem.Theimplementationofdirectoryreplacementrequiresoneextrastateandtwoextratransactions.4Otherprotocols.Ourprotocolgivestheprogrammerasequentiallyconsistentsystem.Itfulfillsthestrongestmemoryaccessmodel,butperformanceisdegradedbecausetheprocessorhastowaitfortheacknowledgmentbeforeitcanperformthewrite.However,theacknowledgmentissentbythetopmostnodeofthesubsysteminwhichallcopiesoftheitemreside,insteadofbyeachindividualattractionmemorywithacopy.Thisnotonlyreducestheremotedelay,italsocutsdownthenumberofsystemtransactions.Thewritermightactuallyreceivetheacknowledgmentbeforeallcopiesareerased.Nevertheless,sequentialconsistencycanbeguaranteed.xThehierarchicalstructurecanalsoefficientlysupportlooserformsofconsistencyprovidinghigherperformance.WehavedesignedaprocessorconsistentprotocolXandaprotocolcombiningprocessorconsistencywithanadaptivewriteupdatestrategy.IncreasingthebandwidthAlthoughmostmemoryaccessestendtobelocalizedinthemachine,thehierarchyshigherlevelsmayneverthelessdemandahigherbandwidththanthelowersystems,creatingabottleneck.Totaketheloadoffthehigherlevels,wecanuseasmallerbranchfactorattheSeptember199249Figure9.Increasingthebandwidthofabusbysplittingbuses.AOddIEvenAIFF...DirectoryDirectoryItopofthehierarchythanlowerdown.Thissolution,however,increasesthelevelsinthehierarchy,resultinginalongerremoteaccessdelayandanincreasedmemoryoverhead.Instead,wecanwidenthehigherlevelsofthehierarchytoproduceafattree.9Wesplitadirectoryintotwodirectorieshalftheoriginaldirectoryssize.Thetwodirectoriesdealwithdifferentaddressdomainsevenandodd.Thecommunicationwithotherdirectoriesisalsosplit,whichdoublesthebandwidth.Wecanperformasplitanynumberoftimesandatanylevelofthehierarchy.Figure9showsthatregardlessofthenumberofsplits,thearchitectureisstillhierarchicaltoeachspecificaddress.YetanothersolutionisaheterogeneousnetworkWeusethehierarchywithitsadvantagesasfaraspossibleandtieseveralhierarchiestogetheratRelatedactivitiesAttheSwedishInstituteofComputerScience,wearedevelopinganoperatingsystemfortheDDMprototype.ThisworkisbasedontheMachoperatingsystemfromCarnegieMellonUniversity,whichwemodifiedtosupportDDMefficiently.Relatedactivitiesinvolveahardwareprefetchingschemethatdynamicallyprefetchesitemstotheattractionmemorythisisespeciallyusefulwhenaprocessisstartedormigrated.Wearealsoexperimentingwithalternativeprotocols.TheemulatorrunsontheMeikotransputerplatformandmodelsanarchitecturewithatreeshapedlinkbasedstructure,withtransputersasdirectories.Thetransputersfourlinkspermitabranchfactorofthreeateachlevel.Thetransputersattheleavesexecutetheapplication.AllreferencestoglobaldataareinterceptedandhandledinaDDMmannerbysoftware.TheemulatorsDDMprotocolhasadifferentrepresentationsuitedforalinkbasedarchitecturestructuredlikeatree,ratherthanforabusbasedarchitecture.Theimplementationhascertainsimilaritiestodirectorybasedsystems.ADDMemulatoriscurrentlyunderdevelopmentattheUniversityofBristol.2References1.E.Hagersten.TowardsaScalableCacheOnlyMemoryArchitecture,PhDthesis,SICSDissertationSeries08,SwedishInstituteofComputerScience,Kista,Sweden,1992.2.S.RainaandD.H.D.Warren,TrafficPatternsinaScalableMultiprocessorThroughTransputerEmulation,Proc.Hawaiilnt7Conf.SystemSciences,Vol.I,IEEECSPress,LosAlamitos,Calif.,OrderNo.2420,1992,pp.267276.theirtopsbyageneralnetworkwithadirectorybasedprotocol.Thisschemerequiressomechangesintheprotocoltoachievethesameconsistencymodel.TheDDMprototypeprojectAprototypeDDMdesignisnearcompletionattheSwedishInstituteofComputerScience.ThehardwareimplementationoftheprocessorandattractionmemoryisbasedonthesystemTP881VbyTadpoleTechnology,UK.EachsuchsystemhasuptofourMotorolaMC8810020MHzprocessors,eachonewithtwoMC8820016Kbytecachesandmemorymanagementunits8or32MbytesofDRAMandinterfacesfortheSCSIbus,Ethernet,andterminals,allconnectedbytheMotorolaMbusasshowninFigure10.WearedevelopingaDDMnodecontrollerboardtohostasingleportedstatememory.AsFigure10shows,itwillinterfacetheTP881VnodewiththefirstlevelDDMbus.ThenodecontrollersnoopsaccessesbetweentheprocessorcachesandthememoryoftheTP881Vaccordingtothememorybelowprotocol,andalsosnoopstheDDMbusaccordingtothememoryaboveprotocol.Wehaveintegratedthecopybackprotocolofmultipleprocessorcachesintotheprotocolmechanisms.Thenodecontrollerthuschangesthememorysbehaviorintothatofanattractionmemory.Readaccessestotheattractionmemorytakeeightcyclespercacheline,whichisonemorethanintheoriginalTP881Vsystem.Writeaccessestotheattractionmemorytake12cyclescomparedwith10cyclesfortheoriginalsystem.Aread/writemixof3/1totheattractionmemoryresultsinanaccesstimetotheattractionmemoryontheaverage16percentslowerthanthattotheoriginalTP881Vmemory.AsTable1shows,aremotereadtoanodeonthesameDDMbustakes55cyclesatbest,mostofwhicharespentmakingMbustransactionsatotaloffouraccesses.Readaccessesclimbingonestepupanddownthehierarchyaddabout45extracycles.Writeaccessestosharedstatetakeatbest40cyclesforoneleveland50cyclesfortwolevels.TheDDMbusispipelinedinfourphasestransactioncode,snoop,selection,anddata.Wedesignedourinitial50cTCOMPUTERbusconservatively,sincepushingthebusspeedisnotaprimarygoalofthisresearch.TheprototypeDDMbusoperatesat20MHz,witha32bitdatabusanda32bitaddressbus.Itprovidesamoderatebandwidthofabout80Mbytespersecond,whichisenoughforconnectinguptoeightnodesthatis,32processors.Still,thebandwidthhasnotbeenthelimitingfactorinoursimulationstudies.Wecanincreasebusbandwidthmanytimesbyusingotherstructures.TheslottedringbusproposedbyBarossoandDuboisOhasabandwidthoneorderofmagnitudehigher.Fortranslationstoitemidentifiers,DDMusesthenormalproceduresfortranslatingvirtualaddressestophysicaladdresses,asimplementedinstandardmemorymanagementunits.Thismeansthatanoperatingsystemhasknowledgeofphysicalpages.Anyattractionmemorynodecanhaveaconnecteddisk.Uponapagein,thenodefirstattractsallthedataofanitempageasbeingtemporarilylockedtoitsattractionmemory.Iftheitemsofthatpagewerenotpresentinthemachineearlier,theyarebornatthistimethroughtheprotocol.Thenthenodecopiesbydirectmemoryaccessthepagefromthedisktotheattractionmemory,unlockingthedataatthesametime.Pageoutreversestheprocess,copyingadirtypagebacktothedisk.Theoperatingsystemcanpurgetheitemsofunusedpagesformoresharing.itemspace.BecauseeachsetintheattractionmemoryisdividedtwowaYs,16itemscanresideinthesameset.Inadditiontothefourbitsneededtotellitemsapart,eachitemneedsfourMemoryoverheadStateinDelay,Delay,CPUAttractionOneLevelTwoLevelsAccessMemorycyclescyclesReadInvalid55100WriteShared4050WriteInvalid80130ItmightseemthatanimplementationofDDMwouldrequirefarmorememorythanalternativearchitectures.Extramemoryisrequiredforstoringstatebitsandaddresskeysforthesetassociativeattractionmemories,aswellasforthedirectories.Wehavecalculatedtheextrabitsneededifallitemsresideinonlyonecopyworstcase.Weassumeanitemsizeof16bytesthecachelinesizeoftheMotorolaMC88200.A32processorDDMthatis,aonelevelDDMwithamaximumofeighttwowaysetassociativeattractionmemoriesneedsfourbitsofaddresstagperitem,regardlessoftheattractionmemorysize.Aswesaidearlier,theitemspaceisnotlargerthanthesumofthesizesoftheattractionmemories,sothesizeofeachattractionmemoryiscontrollermemory1metTP881ViFigure10.AnodeoftheDDMprototypeconsistingoffourprocessorssharingoneattractionmemory.bitsofstate.Thus,anitemsizeof128bitsgivesanoverheadof44/1286percent.Addinganotherlayerwitheighteightwaysetassociativedirectoriesbringsthemaximumnumberofprocessorsto256.Thesizeofthedirectoriesisthesumofthesizesoftheattractionmemoriesintheirsubsystems.Adirectoryentryconsistsofsixbitsfortheaddresstagandfourbitsofstateperitem,usingacalculationsimilartotheoneabove.Theoverheadintheattractionmemoriesislargerthaninthepreviousexamplebecauseofthelargeritemspacesevenbitsofaddresstagandfourbitsofstate.Thetotaloverheadperitemis6474/12816percent.Alargeritemsizewould,ofcourse,decreasetheseoverheads.Tominimizethememoryoverhead,wecanuseadifferentinterpretationoftheimplicitstatefordifferentpartsoftheitemspace.InourinitialimplementationofDDM,theabsenceofanentryinadirectoryisinterpretedasInvalid.Thereplacementalgorithmintroducesahomebusforanitem.Ifanitemismostoftenfoundinitshomebusandnowhereelse,theabsenceofanentryinadirectorycouldinsteadbeinterpretedasExclusiveforitemsinitshomesubsystem,andasInvalidforitemsfromoutside.Thiswoulddrasticallyreduceadirectoryssize.Thetechniquewouldbepracticalonlytoalimitedextent.Toosmalldirectoriesrestrictthenumberofitemsmovingoutoftheirsubsystemsandthuslimitsharingandmigration,resultingindrawbackssimilartothoseofNUMAs.Itemspaceisslightlysmallerthanthesumoftheattractionmemoriesbecauseofsharinginthesystem.Thisintroducesamemoryoverheadnottakenintoaccountintheabovecalculations.However,inaCOMAacacheditemOCCUpiesonlyonespace,whileinothersharedmemoryarchitecturesitrequirestwospacesoneinthecacheandoneinthesharedmemory.SimulatedperformanceWeusedanexecutiondrivensimulationenvironmentthatletsusstudylargeprogramsrunningonmanyprocessorsinareasonableamountoftime.WeparameterizedtheDDMsimulationSeptember199251modelwithdatafromourongoingprototypeproject.ThemodelaccuratelydescribesDDMbehavior,includingthecompromisesintroducedbytakinganexistingcommercialproductasastartingpoint.Themodelalsodescribespartsofthevirtualmemoryhandlingsystem.Weusedtwoway1Mbyteattractionmemoriesandaprotocolsimilartotheonedescribedhere,providingsequentialconsistency.Forarepresentationofapplicationsfromengineeringandsymboliccomputing,westudiedparallelexecutionoftheStanfordParallelApplicationsforSharedMemorySplash,theORparalle1PrologsystemMuse,andamatrixmultiplicationprogram.AllprogramswereoriginallywrittenforUMAarchitecturesSequentSym10080Q6oUa,QI40200/Lineartrix500x500020406080100ProcessorsFigure11.Speedupcurvesforsomeofthereportedprograms.TheSplashWaterprogramsimulatesthemovementsofwatermolecules.ItsexecutiontimeisOrnZ,wherernisthenumberofmolecules.Therefore,itisoftensimulatedwithasmallworkingset.Weused192moleculesandaworkingsetof320Kbytes.Eachofthe96processorsinFigure11handlesonlytwomolecules.Mostofthelocalityinthesmallworkingsetcanbeexploitedontheprocessorcache,andonlyabout44percentofthetransactionsreachingtheattractionmemorywillhit.Arealsizeworkingsetwouldstillhavethesamegoodlocalityandwouldbenefitmorefromthelargeattractionmemoriestomaintainthespeedup.Wetestedthishypothesiswithasinglerunwith384molecules,asshowninTable2.metryorEncoreMultimaxcomputersandusestaticordynamicscheduleralgorithms.TheyadaptwelltoaCOMAwithoutanychanges.AllprogramstakeontheorderofoneCPUminutetorunsequentially,withoutanysimulations,memory.onaSunSparcstation.ThespeedupsreportedinFigure11andTable2arerelativetotheexecutionofasingleDDMnodewithoneprocessor,assuminga100percenthitrateintheattractionTheSplashMP3Dprogramisawindtunnelsimulatorwithwhichagoodspeedupishardertoachievebecauseofahighinvalidationfrequencyresultinginapoorhitrate.TheprogramisoftenrunwiththememoryfilledwithdataTable2.StatisticsfromDDMsimulations.Hitratestatisticsarefordataonly,exceptwithMuse,whereweusedaunifiedIDcache.Theremoteaccessrateisthepercentageofthedataaccessesissuedbyaprocessorthatcreateremotecoherencetraffic.AnincreasedworkingsetresultsinlessloadonthebusesforWaterandCholesky.WaterMP3DMP3DDiffCholeskyMatrixMuseInputdataColdstartincludedDDMtopologyHitratedatapercentDcacheAttractionmemoryRemoteaccessrateBusutilizationpercentMbusLowerDDMbusTopDDMbusSpeeduppernumberofprocessors192384moleculesmolecules999944650.60.431263930252052/6453/6475,00075,000particlesparticlesnono869240888.41.o8654882466136/3219/32m14smallyes2x829663.870807010132m15largeYes2X8X289742.860664917132500x500PunditYesno8x44x49298.59891O.160.2029/32I1652COMPUTERr

注意事项

本文(50-DDM--A Cache-Only Memory Architecture.pdf)为本站会员(baixue100)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5