会员注册 | 登录 | 微信快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

50-DDM--A Cache-Only Memory Architecture.pdf50-DDM--A Cache-Only Memory Architecture.pdf -- 5 元

宽屏显示 收藏 分享

页面加载中... ... 广告 0 秒后退出

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

DDMACacheOnlvMemoryArchitectureErikHagersten,AndersLandin,andSeifHaridiSwedishInstituteofComputerScienceultiprocessorsprovidingasharedmemoryviewtotheprogrammeraretypicallyimplementedassuchwithasharedmemory.WeintroduceManarchitecturewithlargecachestoreducelatencyandnetworkload.Becauseallsystemmemoryresidesinthecaches,aminimumnumberofnetworkaccessesareneeded.Still,itpresentsasharedmemoryviewtotheprogrammer.Singlebus.Sharedmemorysystemsbasedonasinglebushavesometensofprocessors,eachonewithalocalcache,andtypicallysufferfrombussaturation.Acachecoherenceprotocolineachcachesnoopsthetrafficonthecommonbusandpreventsinconsistenciesincachecontents.ComputersmanufacturedbySequentandEncoreusethiskindofarchitecture.Becauseitprovidesauniformaccesstimetothewholesharedmemory,itiscalledauniformmemoryarchitectureUMA.ThecontentionforthecommonmemoryandthecommonbuslimitsthescalabilityofUMAs.Distributed.ComputerssuchastheBBNButterflyandtheIBMRP3useanarchitecturewithdistributedsharedmemory,knownasanonuniformmemoryarchitectureNUMA.Eachprocessornodecontainsaportionofthesharedmemory,soaccesstimestodifferentpartsofthesharedaddressspacecanvary.NUMAsoftenhavenetworksotherthanasinglebus,andthenetworkdelaycanvarytodifferentnodes.TheearlierNUMAsdidnothavecoherentcachesandlefttheproblemofmaintainingcoherencetotheprogrammer.Today,researchersarestrivingtowardcoherentNUMAswithdirectorybasedcachecoherenceprotocol.Bystaticallypartitioningtheworkanddata,programmerscanoptimizeprogramsforNUMAs.ApartitioningthatenablesprocessorstomakemostoftheiraccessestotheirpartofthesharedmemoryachievesabetterscalabilitythanispossibleinUMA.AnewarchitecturehastheprogrammingparadigmofsharedmemoryarchitecturesCacheonly.InacacheonlymemoryarchitectureCOMA,thememoryorganizationissimilartothatofaNUMAinthateachprocessorholdsaportionofthebutnophysicallysharedmemory.Cachesaddressspace.However,thepartitioningofdataamongthememoriesdoesnothavetobestatic,sincealldistributedmemoriesareorganizedlikelargesecondlevelcaches.Thetaskofsuchamemoryistwofold.Besidesbeingalargecachefortheprocessor,itmayalsocontainsomedatafromthesharedaddressspacethattheprocessorneverhasaccessedinotherwords,itisacacheandavirtualpartofattachedtotheprocessorscontainalltheSystemmemory.4400189162/92/0900004401.0001992I€€€COMPUTERthesharedmemory.Wecallthisintermediateformofmemoryattractionmemory.Acoherenceprotocolattractsthedatausedbyaprocessortoitsattractionmemory.Comparabletoacacheline,thecoherenceunitmovedaroundbytheprotocoliscalledanitem.Onamemoryreference,avirtualaddressistranslatedintoanitemidentifier.Theitemidentifierspaceislogicallythesameasthephysicaladdressspaceoftypicalmachines,butthereisnopermanentmappingbetweenanitemidentifierandaphysicalmemorylocation.Instead,anitemidentifiercorrespondstoalocationinanattractionmemory,whosetagmatchestheitemidentifier.Actually,therearecaseswheremultiplelocationsofdifferentattractionmemoriescouldmatch.ACOMAprovidesaprogrammingmodelidenticaltothatofsharedmemoryarchitectures,butitdoesnotrequirestaticdistributionofexecutionandmemoryusagetorunefficiently.RunninganoptimizedNUMAprogramonaCOMAresultsinaNUMAlikebehavior,sincetheworkspacesofthedifferentprocessorsmigratetotheirattractionmemories.However,aUMAversionofthesameprogramwouldhaveasimilarbehavior,becausethedataareattractedtotheusingprocessorregardlessoftheaddress.ACOMAalsoadaptstoandperformswellforprogramswithamoredynamicorsemidynamicscheduling.Theworkspacemigratesaccordingtoitsusagethroughoutthecomputation.ProgramscanbeoptimizedforaCOMAtotakethispropertyintoaccounttoachieveabetterlocality.ACOMAallowsfordynamicdatausewithoutduplicatingmuchmemory,comparedwithanarchitectureinwhichacacheddatumalsooccupiesspaceinthesharedmemory.Toavoidincreasingthememorycost,theattractionmemoriesshouldbeimplementedwithordinarymemorycomponents.Therefore,weviewtheCOMAapproachasasecondlevel,orhigherlevel,cachetechnique.TheaccessingtimetotheattractionmemoryofaCOMAiscomparabletothattothememoryofacachecoherentNUMA.Figure1comparesCOMAstoothersharedmemoryarchitectures.AnewCOMA.ThisarticledescribesthebasicideasbehindanewCOMA.Thearchitecture,calledtheDataDiffusionMachineDDM,reliesonahierNetworkNetwork1IIProcessorProcessorab4Figure1.SharedmemoryarchitecturescomparedwithCOMAsauniformmemoryarchitectureUMA,bnonuniformmemoryarchitectureNUMA,andccacheonlymemoryarchitectureCOMA.readNotationintransaction/outtransactionPprocessortransactionNnetworktransactionDDirtyIInvalidRResewedVValidPwrite,1Pread0UPreadFigure2.Anexampleofaprotocolsimilartothewriteonceprotocol.archicalnetworkstructure.WeintroducethekeyideasbehindDDMbydescribingasmallmachineanditsprotocol.Wealsodescribealargemachinewithhundredsofprocessors,overviewtheongoingprototypeproject,andprovidesimulatedperformancefigures.CachecoherencestrategiesTheproblemofmaintainingcoherenceamongreadwritedatasharedbydifferentcacheshasbeenstudiedextensively.Eithersoftwareorhardwarecanmaintaincoherence.WebelievehardwarecoherenceisneededinaCOMAforefficiency,sincetheitemmustbesmalltopreventperformancedegradationbyfalsesharing.Infalsesharing,twoprocessorsaccessingdifferentpartsofthesameitemconflictwitheachother,eventhoughtheydonotshareanydata.Wemeasuredaspeedupof50percentwhenfalsesharingwasremovedfromthewindtunnelapplication,MP3DDiff,reportedintheSimulatedperformancesection.Hardwarebasedschemesmaintaincoherencewithoutinvolvingsoftwareandcanbeimplementedmoreefficiently.Examplesofhardwarebasedprotocolsaresnoopingcacheprotocolsanddirectorybasedprotocols.Snoopingcacheprotocolshaveadistributedimplementation.Eachcacheisresponsibleforsnoopingtrafficonthebusandtakingactionstoavoidanincoherence.AnexampleofsuchaprotocolisthewriteonceprotocolintroducedbyGoodmananddiscussedbyStenstrom.AsFigure2shows,inthatprotocol,eachcachelinecanbeinoneoffourstatesInvalid,Valid,Reserved,orDirty.ManycachescanhavethesamecachelineinthestateValidatthesametime,andmayreaditlocally.WhenwritingtoacachelineinValid,thelinechangesSeptember199245statetoReserved,andawriteissentonthecommonbustothecommonmemory.AllothercacheswithlinesinValidsnoopthewriteandinvalidatetheircopies.Atthispointthereisonlyonecachedcopyofthecachelinecontainingthenewlywrittenvalue.Thecommonmemorynowalsocontainsthenewvalue.IfacachealreadyhasthecachelineinReserved,itcanperformawritelocallywithoutanytransactionsonthecommonbus.Itsvaluenowdiffersfromthatinthememory.anditsstateisthereforechangedtoDirty.AnyreadrequestsfromothercachestothatcachelinemustnowbeinterceptedtoprovidethenewvaluemarkedbyinterceptinFigure2.selectionIIIIIIIIIIIProcessorIFigure3.ThearchitectureofasinglebusDDM.Belowtheattractionmemoriesaretheprocessors.Ontopofthebusarearbitrationandselection.SnoopingcachesrelyonbroadcastingandarenotsuitedforgeneralinterconnectionnetworksUnrestrictedbroadcastingwoulddrasticallyreducetheavailablebandwidth,therebyobviatingtheadvantageofgeneralnetworks.Instead.directorybasedschemessendmessagesdirectlybetweennodes.Areadrequestissenttomainmemory,withoutanysnooping.Themainmemoryknowsifthecachelineiscachedandinwhichcacheorcachesandwhetherithasbeenmodified.Ifthelinehasbeenmodified,thereadrequestispassedontothecachewithacopy,whichprovidesacopyfortherequestingcache.Onawritetoasharedcacheline,awriterequestsenttothemainmemorycausesinvalidationmessagestoallcacheswithcopiestobesent.Thecachesrespondwithacknowledgemessages.Toachievesequentialconsistency,allacknowledgmentsmustbereceivedbeforethewriteisperformed.ThecachecoherenceprotocolforaCiMAcanadopttechniquesusedinothercachecoherenceprotocolsandextendthemwiththefunctionalityforfindingadatumonacachereadmissandforhandlingreplacement.Adirectorybasedprotocolcouldhaveapartofthedirectoryinformation,thedirectoryhome,staticallydistributedinaNUMAfashion,whilethedatawouldbeallowedtomovefreely.Retrievingthedataonareadmisswouldthenrequireoneextraindirectaccesstothedirectoryhometofindwheretheitemcurrent46lyresides.Theaccesstime,includingthisextraindirection,wouldbeidenticaltothatrequiredforreadingadirtycachelinenotinaNUMAshomenode.Thedirectoryhomecanalsomakesurethatthelastcopyofanitemisnotlost.Insteadoftheabovestrategy,DDMisbasedonahierarchicalsnoopingbusarchitectureandusesahierarchicalsearchalgorithmforfindinganitem.ThedirectoryinformationinDDMisdynamicallydistributedinthehierarchy.AminimalCOMAWeintroduceDDMbylookingatthesmallestinstanceofthearchitecture,whichcouldbeaCOMAonitsownorasubsystemofalargerCOMA.AsinglebusconnectstheattractionmemoriesoftheminimalDDM.Thedistributionandcoherenceofdataamongtheattractionmemoriesarecontrolledbythesnoopingprotocolmemoryabove,andtheinterfacebetweentheprocessorandtheattractionmemoryisdefinedbytheprotocolmemorybelow.Theprotocolviewsacachelineofanattractionmemory,herecalledanitem,asoneunit.Theattractionmemorystoresonesmallstatefieldperitem.Figure3showsthenodearchitectureinthesinglebusDDM.DDMusesanasynchronoussplittransactionbusThebusisreleasedbetweenarequestingtransactionanditsreply,forexample,betweenareadrequestanditsdatareply.Thedelaybetweentherequestanditsreplycanbeofarbitrarylength,andtheremightbealargenumberofoutstandingrequests.Thereplytransactionwilleventuallyappearonthebusasadifferenttransaction.Unlikeotherbuses,theDDMbushasaselectionmechanismtomakesurethatatmostonenodeisselectedtoservicearequest.Thisguaranteesthateachtransactiononthebusdoesnotproducemorethanonenewtransactionforthebus,arequirementnecessaryfordeadlockavoidance.SinglebusDDMprotocol.Wedevelopedanewprotocol,similarinmanywaystothesnoopingcacheprotocol,limitingbroadcastrequirementstoasmallersubsystemandaddingsupportforreplaement.ThewritecoherencepartoftheprotocolisthewriteinvalidatetypeTokeepdatacoherent,allcopiesoftheitemexcepttheonetobeupdatedareerasedonawrite.InaCOMAwithasmallitemsize,thealternativeapproach,writeupdate,couldalsobeattractiveOnawrite,thenewvalueismulticasttoallcacheswithasharedcopyoftheitem.Theprotocolalsohandlestheattractionofdatareadandreplacementwhenasetinanattractionmemorygetsfull.Thesnoopingprotocoldefinesanewstateandanewtransactiontosendasafunctionofthetransactionappearingonthebus,andthepresentstateoftheitemintheattractionmemoryProtocololdstatextransactionnewstatexnewtransactionAnitemcanbeinoneofsevenstatesthesubsystemistheattractionmemoryInvalid.Thissubsystemdoesnotcontaintheitem.Exclusive.Thissubsystemandnoothercontainstheitem.Shared.Thissubsystemandpossiblyothersubsystemscontaintheitem.Reading.Thissubsystemiswaitingforadatavalueafterhavingissuedaread.COMPUTERI.Waiting.ThissubsystemiswaitingtobecomeExclusiveafterhavingissuedanerase.ReadingandWaiting.Thissubsystemiswaitingforadatavalue,latertobecomeExclusive.Answering.Thissubsystemhaspromisedtoanswerareadrequest.NdatdNeraseNexclusiveIThefirstthreestatesInvalid,Exclusive,andSharedcorrespondtothestatesInvalid,Reserved,andValidinGoodmanswriteonceprotocol.ThestateDirtyinthatprotocolwiththemeaningthatthisistheonlycachedcopyanditsvaluediffersfromthatinthememoryhasnocorrespondenceinaCOMA.NewstatesintheprotocolarethetransientstatesReading,Waiting,ReadingandWaiting,andAnswering.Transientstatesarerequiredbecauseofthesplittransactionbusandtheneedtorememberoutstandingrequests.Thebuscarriesthefollowingtransactions1Nerase/NreadErase.Eraseallcopiesofthisitem.Exclusive.AcknowledgeaneraseRead.Readacopyoftheitem.Data.Carrythedatainreplytoanearlierreadrequest.Inject.Carrytheonlycopyofanitemandlookforasubsystemtomoveintocausedbyareplacement.Out.Carrytheitemonitswayoutofthesubsystemcausedbyareplacement.Itwillterminatewhenanothercopyoftheitemisfound.request.Nexclusive/Nread0PwriteAprocessorwritinganiteminExclusivestateorreadinganiteminExclusiveorSharedstateproceedswithoutinterruption.AsFigure4shows,areadattemptofaniteminInvalidwillresultinaReadrequestandanewstate,Reading.Thebusselectionmechanismwillselectoneattractionmemorytoservicetherequest,eventuallyputtingaDatatransactiononthebus.Therequestingattractionmemory,nowinReading,willgrabtheDatatransaction,changetoShared,andcontinue.ProcessorsareallowedtowriteonlytoitemsinExclusivestate.IftheitemisinShared,allothercopieshavetobeerasedandanacknowledgmentreceivedbeforethewritingisallowed.TheattractionmemorysendsanErasetransactionandwaitsfortheExclusiveacPreadLaNeraselNreadANread/NdataPread/NreadNdataNotationintransaction/outtransactionNerasePwrite/NreadPfromprocessorEExclusiveIinvalidNreadlNdataNfromnetworkRReadingRWReadingSSharedandWaitingPreadWWaitingFigure4.Asimplifiedrepresentationoftheattractionmemoryprotocolnotincludingreplacement.knowledgmentinthenewstate,Waiting.ManysimultaneousattemptstowritethesameitemwillresultinmanyattractionmemoriesinWaiting,allwithanoutstandingErasetransactionintheiroutputbuffers.ThefirstErasetoreachthebusisthewinnerofthewriterace.Allothertransactionsboundforthesameitemareremovedfromthesmalloutputbuffers.Therefore,thebuffersalsohavetosnooptransactions.Theoutputbufferscanbelimitedtoadepthofthree,anddeadlockcanstillbeavoidedwithaspecialarbitrationalgorithm.ThelosingattractionmemoriesinWaitingchangestatetoReadingandWaiting,whileoneofthemputsareadrequestinitsoutputbuffer.EventuallythetopprotocolofthebusreplieswithanExclusiveacknowledgment,tellingtheonlyattractionmemoryleftinWaitingthatitmaynowproceed.WritingtoanitemintheInvalidstateresultsinaReadrequestandanewstate,ReadingandWaiting.UpontheDatareply,thestatechangestoWaitingandanEraserequestissent.Replacement.Likeordinarycaches,theattractionmemorywillrunoutofspace,forcingsomeitemstomakeroomformorerecentlyaccessedones.Ifthesetwhereanitemissupposedtoresideisfull,oneiteminthesetisselectedtobereplaced.Forexample.theoldestiteminShared.ofwhichtheremightbeothercopies,maybeselected.ReplacinganiteminSharedgeneratesanOuttransaction.Thespaceusedbytheitemcannowbereclaimed.IfanOuttransactionseesanattractionmemoryinShared,Reading,Waiting,orReadingandWaiting,itdoesnothingotherwiseitisconvertedtoanInjecttransactionbythetopprotocol.AnInjecttransactioncanalsobeproducedbyreplacinganiteminExclusive.Theinjecttransactionisthelastcopyofanitemtryingtofindanewhomeinanewattractionmemory.Inthesinglebusimplementation,itwilldosofirstbychoosinganemptyspaceInvalidstate,andsecondbyreplacinganiteminSharedstateinotherwords,itwilldecreasetheamountofsharing.Iftheitemidentifierspace,whichcorrespondstothephysicaladdressspaceofconventionalarchitectures,isnotmadelargerthanthesumoftheattractionmemorysizes,itispossibletodeviseasimpleschemethatguaranteesaphysicallocationforeachitem.Oftenaprogramusesonlyaportionofacomputersphysicaladdressspace.Thisisespeciallytrueofoperatingsystemswithafacilityforeagerreclaimingofunusedworkspace.InDDM,theunuseditemspacecanbeusedtoincreasethedegreeofsharingbypurgingtheunuseditems.Theoperatingsystemmightevenchangethedegreeofsharingdynamically.ThehierarchicalDDMSofar,wehavepresentedacachecoherentsinglebusmultiprocessorwithoutphysicallysharedmemory.Instead,theresourcesformhugesecondlevelcachescalledattractionmemories,minimizingthenumberofaccessestotheSeptember199247IIGIGIElDDirectoryPProcessorAMAttractionmemoryIDirectoryiControllerJIDDMbusFigure5.AhierarchicalDDMwiththreelevels.Figure6.Thearchitectureofadirectory.onlysharedresourceleftthesharedbus.Datacanresideinanyormanyoftheattractionmemories.Dataareautomaticallymovedwhereneeded.TomakethesinglebusDDMasubsystemofalargehierarchicalDDM,wereplacethetopwithadirectory,whichinterfacesbetweenthebusandahigherlevelbusofthesametype.Figure5showsthehierarchy.Thedirectoryisasetassociativestatememorythatkeepsinformationforalltheitemsintheattractionmemoriesbelowit,butcontainsnodata.ThedirectorycananswerthesequestionsIsthisitembelowmeandDoesthisitemexistoutsidemysubsystemFromthebusabove,thedirectoryssnoopingprotocoldirectoryabovebehavesverymuchlikethememoryaboveprotocol.Fromthebusbelow,itsdirectorybelowprotocolbehaveslikethetopprotocolforitemsintheExclusivestate.ThismakesoperationsonitemslocaltoabusidenticaltothoseofthesinglebusDDM.Thedirectorypassesthroughonlytransactionsfrombelowthatcannotbecompletedinsideitssubsystemortransactionsfromabovethatneedtobeservicedbyitssubsystem.Inthatsense,thedirectoryactsasafilter.AsFigure6shows,thedirectoryhasasmalloutputbufferaboveittostoretransactionswaitingtobesentonthehigherbus.Transactionsforthelowerbusarestoredinanotheroutputbufferbelow,andtransactionsfromthelowerbusarestoredinaninputbuffer.Adirectoryreadsfromtheinputbufferwhenithasthetimeandspacetodoalookupinitsstatusmemory.Thisisnotpartoftheatomicsnoopingactionofthebus.ThehierarchicalDDManditsprotocolhaveseveralsimilaritieswitharchitecturesproposedbyWilson5andGoodmanandWoest.6DDMis,however,differentinitsuseoftransientstatesintheprotocol,itslackofphysicallysharedmemory,anditsnetworkhigherlevelcachesthatstoresonlystateinformationandnodata.Multilevelread.Ifthesubsystemsconnectedtothebuscannotsatisfyareadrequest,thenexthigherdirectoryretransmitstherequestonthenexthigherbus.ThedirectoryalsochangestheitemsstatetoReading,markingtheoutstandingrequest.Eventually,therequestreachesalevelinthehierarchywhereadirectorycontainingacopyoftheitemisselectedtoanswertherequest.TheselecteddirectorychangestheitemsstatetoAnswering,markinganoutstandingrequestfromabove,andretransmitstheReadrequestonitslowerbus.AsFigure7shows,thetransientstatesReadingandAnsweringinthedirectoriesmarktherequestspaththroughthehierarchy,likeanunwoundredreadthreadthatshowsthewaythroughamaze,appearinginredinFigure7.Aflowcontrolmechanismintheprotocolpreventsdeadlockiftoomanyprocessorstrytounwindareadthreadtothesamesetinadirectory.Whentherequestfinallyreachesanattractionmemorywithacopyoftheitem,itsdatareplysimplyfollowsthereadthreadbacktotherequestingnode,changingallthestatesalongthepathtoShared.CombinedreadsandbroadcastsaresimpletoimplementinDDM.IfaReadrequestfindsthereadthreadunwoundfortherequesteditemReadingorAnsweringstate,itsimplyterminatesandwaitsfortheDatareplythateventuallywillfollowthatpathonitswayback.Multilevelwrite.AnErasefrombelowtoadirectorywiththeiteminExclusivestateresultsinanExclusiveacknowledgmentbeingsentbelow.AnErasethatcannotgetitsacknowledgmentfromthedirectorywillworkitswayupthehierarchy,changingthedirectoriesstatestoWaitingtomarktheoutstandingrequest.AllsubsystemsofabuscarryinganErasetransactionwillgettheircopieserased.ThepropagationoftheEraseendswhenitreachesadirectoryinExclusiveorthetop,andtheacknowledgmentissentbackalongthepathmarkedWaiting,changingthestatestoExclusive.AwriteracebetweenanytwoprocessorsinthehierarchicalDDMhasasolutionsimilartothatofasinglebusDDM.ThetwoEraserequestsarepropagatedupthehierarchy.ThefirstErasetransactiontoreachthelowestbuscommontobothprocessorsisthewinner,asshowninFigure8.ThelosingattractionmemoryinReadingandWaitingwillrestartanewwriteactionautomaticallyuponreceiptoftheerase.ReplacementinthehierarchicalDDM.ReplacementofaSharediteminthehierarchicalDDMresultsinanOuttransactionpropagatingupthehierarchyandterminatingwhenitfindsasubsysteminanyofthefollowingstatesShared,Reading,Waiting,orAnswering.IfthelastcopyofanitemmarkedSharedisreplaced,anOuttransaction48COMPUTERsiveacknowledgeloser.....Figure7.AreadrequestfromprocessorP,hasfounditswaytoacopyoftheitemintheattractionmemoryofprocessorp,.ItspathismarkedwithstatesReadingRandAnsweringA,whichwillguidethedatareplybacktoP,.IindicatesprocessorsintheInvalidstate,SprocessorsintheSharedstate.Figure8.AwriteracebetweentwoprocessorsP,andP,isresolvedwhentherequestoriginatingfromP,reachesthetopbusthelowestbuscommontobothprocessors.Thetopcannowsendtheacknowledgment,Exclusive,whichfollowsthepathmarkedwithWsprocessorsintheWaitingstatebacktothewinningprocessorP,.TheWaitingstatesarechangedtoExclusivebytheacknowledgment.TheErasetransactionwillerasethedatainP,andP,,forcingP,toredoitswriteattempt.thatfailstoterminatewillreachadirectoryinExclusiveandturnintoanInjecttransaction.ReplacinganiteminExclusivegeneratesanInjecttransactionthattriestofindanemptyspaceinaneighboringattractionmemory.InjecttransactionsfirsttrytofindanemptyspaceintheattractionmemoriesofthelocalDDMbus,asinthesinglebusDDM.UnlikeinasinglebusDDM,anInjectfailingtofindanemptyspaceonthelocalDDMbuswillturntoaspecialbus,itshomebus,determinedbytheitemidentifier.Onthehomebus,theInjectwillforceitselfintoanattractionmemory,possiblybythrowingoutaforeigneroraShareditem.Theitemhomespaceisequallydividedbetweenthebottommostbuses,andthereforespaceisguaranteedonthehomebus.ThepreferredlocationinDDMisdifferentfrommemorylocationinNUMAsinthatanitemseeksahomeonlyatreplacementafterfailingtofindspaceelsewhere.Whentheitemisnotinitshomeplace,otheritemscanuseitsplace.ThehomealsodiffersfromtheNUMAapproachinbeingabusAnyattractionmemoryonthatbuswilldo.Thedetailsofthedirectoryprotocolsareavailableelsewhere.4Replacementinadirectory.BaerandWangstudiedthemultilevelinclusionproperty,whichhasthefollowingimplicationsforoursystemAdirectoryatleveli1hastobeasupersetofthedirectories,orattractionmemories,atleveli.Inotherwords,thesizeofadirectoryanditsassociativitynumberofwaysmustbeB,timesthatoftheunderlyingleveli,whereB,isthebranchfactoroftheunderlyingleveli,andsizemeansthenumberofitemsSizeDit,,B,4Dir,AssociativityDir,,B,Dir,Evenifimplementable,higherlevelmemorieswouldbecomeexpensiveandslowifthosepropertieswerefulfilledforalargehierarchicalsystem.However,theeffectsofthemultilevelinclusionpropertyarelimitedinDDM.Itstoresonlystateinformationinitsdirectoriesanddoesnotreplicatedatainthehigherlevels.Yetanotherwaytolimittheeffectistouseimperfectdirectorieswithsmallersetslowernumberofwaysthanwhatisrequiredformultilevelinclusionandtogivethedirectoriestheabilitytoperformreplacement,thatis,tomoveallcopiesofanitemoutoftheirsubsystem.Wecankeeptheprobabilityofreplacementatareasonablelevelbyincreasingtheassociativitymoderatelyhigherupinthehierarchy.Ahigherdegreeofsharingalsohelpstokeepthatprobabilitylow.Ashareditemoccupiesspaceinmanyattractionmemories,butonlyonespaceinthedirectoriesabovethem.Theimplementationofdirectoryreplacementrequiresoneextrastateandtwoextratransactions.4Otherprotocols.Ourprotocolgivestheprogrammerasequentiallyconsistentsystem.Itfulfillsthestrongestmemoryaccessmodel,butperformanceisdegradedbecausetheprocessorhastowaitfortheacknowledgmentbeforeitcanperformthewrite.However,theacknowledgmentissentbythetopmostnodeofthesubsysteminwhichallcopiesoftheitemreside,insteadofbyeachindividualattractionmemorywithacopy.Thisnotonlyreducestheremotedelay,italsocutsdownthenumberofsystemtransactions.Thewritermightactuallyreceivetheacknowledgmentbeforeallcopiesareerased.Nevertheless,sequentialconsistencycanbeguaranteed.xThehierarchicalstructurecanalsoefficientlysupportlooserformsofconsistencyprovidinghigherperformance.WehavedesignedaprocessorconsistentprotocolXandaprotocolcombiningprocessorconsistencywithanadaptivewriteupdatestrategy.IncreasingthebandwidthAlthoughmostmemoryaccessestendtobelocalizedinthemachine,thehierarchyshigherlevelsmayneverthelessdemandahigherbandwidththanthelowersystems,creatingabottleneck.Totaketheloadoffthehigherlevels,wecanuseasmallerbranchfactorattheSeptember199249Figure9.Increasingthebandwidthofabusbysplittingbuses.AOddIEvenAIFF...DirectoryDirectoryItopofthehierarchythanlowerdown.Thissolution,however,increasesthelevelsinthehierarchy,resultinginalongerremoteaccessdelayandanincreasedmemoryoverhead.Instead,wecanwidenthehigherlevelsofthehierarchytoproduceafattree.9Wesplitadirectoryintotwodirectorieshalftheoriginaldirectoryssize.Thetwodirectoriesdealwithdifferentaddressdomainsevenandodd.Thecommunicationwithotherdirectoriesisalsosplit,whichdoublesthebandwidth.Wecanperformasplitanynumberoftimesandatanylevelofthehierarchy.Figure9showsthatregardlessofthenumberofsplits,thearchitectureisstillhierarchicaltoeachspecificaddress.YetanothersolutionisaheterogeneousnetworkWeusethehierarchywithitsadvantagesasfaraspossibleandtieseveralhierarchiestogetheratRelatedactivitiesAttheSwedishInstituteofComputerScience,wearedevelopinganoperatingsystemfortheDDMprototype.ThisworkisbasedontheMachoperatingsystemfromCarnegieMellonUniversity,whichwemodifiedtosupportDDMefficiently.Relatedactivitiesinvolveahardwareprefetchingschemethatdynamicallyprefetchesitemstotheattractionmemorythisisespeciallyusefulwhenaprocessisstartedormigrated.Wearealsoexperimentingwithalternativeprotocols.TheemulatorrunsontheMeikotransputerplatformandmodelsanarchitecturewithatreeshapedlinkbasedstructure,withtransputersasdirectories.Thetransputersfourlinkspermitabranchfactorofthreeateachlevel.Thetransputersattheleavesexecutetheapplication.AllreferencestoglobaldataareinterceptedandhandledinaDDMmannerbysoftware.TheemulatorsDDMprotocolhasadifferentrepresentationsuitedforalinkbasedarchitecturestructuredlikeatree,ratherthanforabusbasedarchitecture.Theimplementationhascertainsimilaritiestodirectorybasedsystems.ADDMemulatoriscurrentlyunderdevelopmentattheUniversityofBristol.2References1.E.Hagersten.TowardsaScalableCacheOnlyMemoryArchitecture,PhDthesis,SICSDissertationSeries08,SwedishInstituteofComputerScience,Kista,Sweden,1992.2.S.RainaandD.H.D.Warren,TrafficPatternsinaScalableMultiprocessorThroughTransputerEmulation,Proc.Hawaiilnt7Conf.SystemSciences,Vol.I,IEEECSPress,LosAlamitos,Calif.,OrderNo.2420,1992,pp.267276.theirtopsbyageneralnetworkwithadirectorybasedprotocol.Thisschemerequiressomechangesintheprotocoltoachievethesameconsistencymodel.TheDDMprototypeprojectAprototypeDDMdesignisnearcompletionattheSwedishInstituteofComputerScience.ThehardwareimplementationoftheprocessorandattractionmemoryisbasedonthesystemTP881VbyTadpoleTechnology,UK.EachsuchsystemhasuptofourMotorolaMC8810020MHzprocessors,eachonewithtwoMC8820016Kbytecachesandmemorymanagementunits8or32MbytesofDRAMandinterfacesfortheSCSIbus,Ethernet,andterminals,allconnectedbytheMotorolaMbusasshowninFigure10.WearedevelopingaDDMnodecontrollerboardtohostasingleportedstatememory.AsFigure10shows,itwillinterfacetheTP881VnodewiththefirstlevelDDMbus.ThenodecontrollersnoopsaccessesbetweentheprocessorcachesandthememoryoftheTP881Vaccordingtothememorybelowprotocol,andalsosnoopstheDDMbusaccordingtothememoryaboveprotocol.Wehaveintegratedthecopybackprotocolofmultipleprocessorcachesintotheprotocolmechanisms.Thenodecontrollerthuschangesthememorysbehaviorintothatofanattractionmemory.Readaccessestotheattractionmemorytakeeightcyclespercacheline,whichisonemorethanintheoriginalTP881Vsystem.Writeaccessestotheattractionmemorytake12cyclescomparedwith10cyclesfortheoriginalsystem.Aread/writemixof3/1totheattractionmemoryresultsinanaccesstimetotheattractionmemoryontheaverage16percentslowerthanthattotheoriginalTP881Vmemory.AsTable1shows,aremotereadtoanodeonthesameDDMbustakes55cyclesatbest,mostofwhicharespentmakingMbustransactionsatotaloffouraccesses.Readaccessesclimbingonestepupanddownthehierarchyaddabout45extracycles.Writeaccessestosharedstatetakeatbest40cyclesforoneleveland50cyclesfortwolevels.TheDDMbusispipelinedinfourphasestransactioncode,snoop,selection,anddata.Wedesignedourinitial50cTCOMPUTERbusconservatively,sincepushingthebusspeedisnotaprimarygoalofthisresearch.TheprototypeDDMbusoperatesat20MHz,witha32bitdatabusanda32bitaddressbus.Itprovidesamoderatebandwidthofabout80Mbytespersecond,whichisenoughforconnectinguptoeightnodesthatis,32processors.Still,thebandwidthhasnotbeenthelimitingfactorinoursimulationstudies.Wecanincreasebusbandwidthmanytimesbyusingotherstructures.TheslottedringbusproposedbyBarossoandDuboisOhasabandwidthoneorderofmagnitudehigher.Fortranslationstoitemidentifiers,DDMusesthenormalproceduresfortranslatingvirtualaddressestophysicaladdresses,asimplementedinstandardmemorymanagementunits.Thismeansthatanoperatingsystemhasknowledgeofphysicalpages.Anyattractionmemorynodecanhaveaconnecteddisk.Uponapagein,thenodefirstattractsallthedataofanitempageasbeingtemporarilylockedtoitsattractionmemory.Iftheitemsofthatpagewerenotpresentinthemachineearlier,theyarebornatthistimethroughtheprotocol.Thenthenodecopiesbydirectmemoryaccessthepagefromthedisktotheattractionmemory,unlockingthedataatthesametime.Pageoutreversestheprocess,copyingadirtypagebacktothedisk.Theoperatingsystemcanpurgetheitemsofunusedpagesformoresharing.itemspace.BecauseeachsetintheattractionmemoryisdividedtwowaYs,16itemscanresideinthesameset.Inadditiontothefourbitsneededtotellitemsapart,eachitemneedsfourMemoryoverheadStateinDelay,Delay,CPUAttractionOneLevelTwoLevelsAccessMemorycyclescyclesReadInvalid55100WriteShared4050WriteInvalid80130ItmightseemthatanimplementationofDDMwouldrequirefarmorememorythanalternativearchitectures.Extramemoryisrequiredforstoringstatebitsandaddresskeysforthesetassociativeattractionmemories,aswellasforthedirectories.Wehavecalculatedtheextrabitsneededifallitemsresideinonlyonecopyworstcase.Weassumeanitemsizeof16bytesthecachelinesizeoftheMotorolaMC88200.A32processorDDMthatis,aonelevelDDMwithamaximumofeighttwowaysetassociativeattractionmemoriesneedsfourbitsofaddresstagperitem,regardlessoftheattractionmemorysize.Aswesaidearlier,theitemspaceisnotlargerthanthesumofthesizesoftheattractionmemories,sothesizeofeachattractionmemoryiscontrollermemory1metTP881ViFigure10.AnodeoftheDDMprototypeconsistingoffourprocessorssharingoneattractionmemory.bitsofstate.Thus,anitemsizeof128bitsgivesanoverheadof44/1286percent.Addinganotherlayerwitheighteightwaysetassociativedirectoriesbringsthemaximumnumberofprocessorsto256.Thesizeofthedirectoriesisthesumofthesizesoftheattractionmemoriesintheirsubsystems.Adirectoryentryconsistsofsixbitsfortheaddresstagandfourbitsofstateperitem,usingacalculationsimilartotheoneabove.Theoverheadintheattractionmemoriesislargerthaninthepreviousexamplebecauseofthelargeritemspacesevenbitsofaddresstagandfourbitsofstate.Thetotaloverheadperitemis6474/12816percent.Alargeritemsizewould,ofcourse,decreasetheseoverheads.Tominimizethememoryoverhead,wecanuseadifferentinterpretationoftheimplicitstatefordifferentpartsoftheitemspace.InourinitialimplementationofDDM,theabsenceofanentryinadirectoryisinterpretedasInvalid.Thereplacementalgorithmintroducesahomebusforanitem.Ifanitemismostoftenfoundinitshomebusandnowhereelse,theabsenceofanentryinadirectorycouldinsteadbeinterpretedasExclusiveforitemsinitshomesubsystem,andasInvalidforitemsfromoutside.Thiswoulddrasticallyreduceadirectoryssize.Thetechniquewouldbepracticalonlytoalimitedextent.Toosmalldirectoriesrestrictthenumberofitemsmovingoutoftheirsubsystemsandthuslimitsharingandmigration,resultingindrawbackssimilartothoseofNUMAs.Itemspaceisslightlysmallerthanthesumoftheattractionmemoriesbecauseofsharinginthesystem.Thisintroducesamemoryoverheadnottakenintoaccountintheabovecalculations.However,inaCOMAacacheditemOCCUpiesonlyonespace,whileinothersharedmemoryarchitecturesitrequirestwospacesoneinthecacheandoneinthesharedmemory.SimulatedperformanceWeusedanexecutiondrivensimulationenvironmentthatletsusstudylargeprogramsrunningonmanyprocessorsinareasonableamountoftime.WeparameterizedtheDDMsimulationSeptember199251modelwithdatafromourongoingprototypeproject.ThemodelaccuratelydescribesDDMbehavior,includingthecompromisesintroducedbytakinganexistingcommercialproductasastartingpoint.Themodelalsodescribespartsofthevirtualmemoryhandlingsystem.Weusedtwoway1Mbyteattractionmemoriesandaprotocolsimilartotheonedescribedhere,providingsequentialconsistency.Forarepresentationofapplicationsfromengineeringandsymboliccomputing,westudiedparallelexecutionoftheStanfordParallelApplicationsforSharedMemorySplash,theORparalle1PrologsystemMuse,andamatrixmultiplicationprogram.AllprogramswereoriginallywrittenforUMAarchitecturesSequentSym10080Q6oUa,QI40200/Lineartrix500x500020406080100ProcessorsFigure11.Speedupcurvesforsomeofthereportedprograms.TheSplashWaterprogramsimulatesthemovementsofwatermolecules.ItsexecutiontimeisOrnZ,wherernisthenumberofmolecules.Therefore,itisoftensimulatedwithasmallworkingset.Weused192moleculesandaworkingsetof320Kbytes.Eachofthe96processorsinFigure11handlesonlytwomolecules.Mostofthelocalityinthesmallworkingsetcanbeexploitedontheprocessorcache,andonlyabout44percentofthetransactionsreachingtheattractionmemorywillhit.Arealsizeworkingsetwouldstillhavethesamegoodlocalityandwouldbenefitmorefromthelargeattractionmemoriestomaintainthespeedup.Wetestedthishypothesiswithasinglerunwith384molecules,asshowninTable2.metryorEncoreMultimaxcomputersandusestaticordynamicscheduleralgorithms.TheyadaptwelltoaCOMAwithoutanychanges.AllprogramstakeontheorderofoneCPUminutetorunsequentially,withoutanysimulations,memory.onaSunSparcstation.ThespeedupsreportedinFigure11andTable2arerelativetotheexecutionofasingleDDMnodewithoneprocessor,assuminga100percenthitrateintheattractionTheSplashMP3Dprogramisawindtunnelsimulatorwithwhichagoodspeedupishardertoachievebecauseofahighinvalidationfrequencyresultinginapoorhitrate.TheprogramisoftenrunwiththememoryfilledwithdataTable2.StatisticsfromDDMsimulations.Hitratestatisticsarefordataonly,exceptwithMuse,whereweusedaunifiedIDcache.Theremoteaccessrateisthepercentageofthedataaccessesissuedbyaprocessorthatcreateremotecoherencetraffic.AnincreasedworkingsetresultsinlessloadonthebusesforWaterandCholesky.WaterMP3DMP3DDiffCholeskyMatrixMuseInputdataColdstartincludedDDMtopologyHitratedatapercentDcacheAttractionmemoryRemoteaccessrateBusutilizationpercentMbusLowerDDMbusTopDDMbusSpeeduppernumberofprocessors192384moleculesmolecules999944650.60.431263930252052/6453/6475,00075,000particlesparticlesnono869240888.41.o8654882466136/3219/32m14smallyes2x829663.870807010132m15largeYes2X8X289742.860664917132500x500PunditYesno8x44x49298.59891O.160.2029/32I1652COMPUTERr
编号:201401051948406822    大小:1.07MB    格式:PDF    上传时间:2014-01-05
  【编辑】
5
关 键 词:
工业、机械、能源、设计、建模、模具、工学
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:7次
baixue100上传于2014-01-05

官方联系方式

客服手机:17625900360   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

工业、机械、能源、设计、建模、模具、工学  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
[email protected] 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5