会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf

  • 资源星级:
  • 资源大小:950.29KB   全文页数:8页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf

USINGCACHEMEMORYTOREDUCEPROCESSORMEMORYTRAFFICJamesR.GoodmanDepartmentofComputerSciencesUniversityofWisconsinMadisonMadison,WI58706ABSTRACTTheimportanceofreducingprocessormemorybandwidthisrecognizedintwodistinctsituationssingleboardcomputersystemsandmicroprocessorsofthefuture.Cachememoryisinvestigatedasawaytoreducethememoryprocessortraffic.Weshowthattraditionalcacheswhichdependheavilyonspatiallocalitylookaheadfortheirperformanceareinappropriateintheseenvironmentsbecausetheygeneratelargeburstsofbustraffic.Acacheexploitingprimarilytemporallocalitylookbehindisthenproposedanddemonstratedtobeeffectiveinanenvironmentwhereprocessswitchesareinfrequent.Wearguethatsuchanenvironmentispossibleifthetraffictobackingstoreissmallenoughthatmanyprocessorscanshareacommonmemoryandifthecachedataconsistencyproblemissolved.Wedemonstratethatsuchacachecanindeedreducetraffictomemorygreatly,andintroducee.relegantsolutiontothecachecoherencyproblem.1.IntroductionBecausetherearestraightforwardwaystoconstructpowerful,costeffectivesystemsusingrandomaccessmemoriesandsinglechipmicroprocessors,semiconductortechnologyhas,untilnow,hadthegreatestimpactthroughthesecomponents.Highperformanceprocessors,however,arestillbeyondthecapabilityofasinglechipimplementationandarenoteasilypartitionedinawaywhichcaneffectivelyexploitthetechnologyandeconomiesofVLSI.Aninterestingphenomenonhasoccurredinthepreviousdecadeasaresultofthisdisparity.Memorycostshavedroppedradicallyandconsistentlyforcomputersystemsofallsizes.WhilethecomponentcostofaCPUsinglechipimplementationsexcludedhasdeclinedsignificantlyoverthesameperiod,thereductionhasbeenlessdramatic.Aresultisthattheamountofmemorythoughttoheappropriateforagivenspeedprocessorhasgrowndramaticallyinrecentyears.Todaysmallminicomputershavememoryaslargeasthatofthemostexpensivemachinesofadecadeago.Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.TheimpactofVLSIhasbeenverydifferentinmicroprocessorapplications.Herememoryisstillregardedasanexpensivecomponentinthesystem,andthosefamiliarprimarilywithaminicomputerormainframeenvironmentareoftenscornfulofthetroubletowhichmicroprocessorusersgotoconservememory.Thereason,ofcourse,isthateventhesmallmemoryinamicroprocessorisamuchlargerportionofthetotalsystemcostthanthemuchlargermemoryonatypicalmainframesystem.Thisresultsfromthefactthatmemoryandprocessorsareimplementedinthesametechnology.1.1.ASuperCPUWiththeadvancestoVLSIoccurringnowandcontinuingoverthenextfewyears,itwillbecomepossibletofabricatecircuitsthatareonetotwoordersofmagnitudemorecomplexthancurrentlyavailablemicroprocessors.ItwillsoonbepossibletofabricateanextremelyhighperformanceCPUonasinglechip,IftheentirechipisdevotedtotheCPU,however,itisnotagoodidea.Extrapolatinghistoricaltrendstopredictfuturecomponentdensities,wemightexpectthatwithinafewyearsweshouldbeabletopurchaseasinglechipprocessorcontainingatleasttentimesasmanytransistorsasoccurin,say,theMC68000.FortheempiricalruleknownasGroschslaw\Grosch53\,PkCg,wherePissomemeasureofperformance,Cisthecost,andkandgareconstants,Knight\Knight66\concludedthatgisatleast2,andSolomon\Solomon66\hassuggestedthatg1.47.FortheIBMSystem/B70family,Siewiorekdeterminedthatgal.8\Siewiorek82\.WhileGroschslawbreaksdowninthecomparisonofprocessorsusingdifferenttechnologyorarchitectures,itisrealisticforpredictingimprovementswithinasingletechnology.Siewiorekinfactsuggeststhatitholdsbydefinition.Assumingg1.5andusingprocessormemorybandwidthasourmeasureofperformance,Groschslawpredictsthataprocessorcontainingi0timesasmanytransistorsasacurrentmicroprocessorwouldrequire30timesthememorybandwidth.TheMotorolaMC68000,runningat10MHz,accessesdatafrommemoryatamaximumrateof5millionbytespersecond,usingmorethanhalfitspinstoachievethisrate.Althoughpackagingtechnologyisrapidlyincreasingthepinsavailabletoachip,itisunlikelythattheincreasewillhe30foldthe68000has84pins.Wewouldsuggestafactoroftwoisrealistic.Althoughsometechniquesareclearlypossibletoincreasethetransferrateintoandoutofthe68000,supplyingsuchaprocessorwithdataasfastasneededisasevereconstraint.Oneofthedesignersofthe88000,hasstatedthatallmodernmicroprocessorsthe680001Thisisaconservativeestimate,infact,becauseitignorespredictabledecreasesinatedelays.©1983ACM01497111/83/0600/0124501.00124includedarealreadybuslimited\TredennickB2\.1.2.On.chipMemoryOnealternativeforincreasedperformancewithoutproportionatelyincreasingprocessormemorybandwidthistointroducememoryonthesamechipwiththeCPU.Withtheabilitytofabricatechipscontainingonetotwomilliontransistors,itshouldbepossibleusingonlyaportionofthechiptobuildaprocessorsignificantlymorepowerfulthananycurrentlyavailablesinglechipCPU.WhiledevotingtheentirechiptotheCPUcouldresultinastillmorepowerfulprocessor,introducingonchipmemoryoffersareductioninmemoryaccesstimeduetotheinherentlysmallerdelaysascomparedtointerchipdatatransfers.Ifmostaccesseswereonchip,itmightactuallyperformasfastasthemorepowerfulprocessor.Ideally,thechipshouldcontainasmuchmemoryastheprocessorneedsformainstorage.Conventionalwisdomtodaysaysthataprocessorofthespeedofcurrentmicroprocessorsneedsatleast1/¢megabytesofmemory\LindsayB1\.Thisiscertainlymorethanisfeasibleonchip,thoughahighperformanceprocessorcouldprobablyusesubstantiallymorethanthat.ClearlyalltheprimarymemoryfortheprocessorcannotbeplacedonthesamechipwithapowerfulCPU.Whatisneededisthetopelementofamemoryhierarchy,1.3.CacheMemoryTheuseofcachememory,however,hasoftenaggravatedthebandwidthproblemratherthanreduceit.Smith\SmithB2\saysthatoptimizingthedesignhasfourgeneralaspects1maximizingthehitratio,2minimizingtheaccesstimetodatainthecache,3minimizingthedelayduetoamiss,and4minimizingtheoverheadsofupdatingmainmemory,maintainingmulticacheconsistency,etc.Theresultisoftenalargerburstbandwidthrequirementfrommainstoragetothecachethanwouldbenecessarywithoutacache.Forexample,thecacheontheIBMSystem/370model16B,iscapableofreceivingdatafrommainmemoryatarateof100megabytespersecond\IBM76\,ItsuppliesdatatotheCPUatlessthan1/3thatrate.Thereasonisthattoexploitthespatiallocalityinmemoryreferences,thedatatransferredfrombackingstoreintothecacheisfetchedinlargeblocks,resultinginrequirementsofveryhighbandwidthburstsofdata.WehavemeasuredtheaveragebandwidthonanIBMSystem/370model155,andconcludedthattheflyeragebackingstoretocachetrafficislessthanthecaehetoCPUtraffic.Thedesignofcachememoryforminicomputersdemandedgreaterconcernforbusbandwidth.ThedesignersofthePDP11models80and70clearlyrecognizedthatsmallblocksizeswerenecessarytokeepmainmemorytraffictoaminimum\Bell78\.Loweringthebandwidthfrombackingstoretothecachecanbeaccomplishedinoneoftwoways1smallblocksofdataarebroughtfrombackingstoretothecache,or2longdelaysoccurwhileablockisbeingbroughtin,independentofandinadditiontotheaccesstimeofthebackingstore.Whileitispossibletobringinthewordrequestedinitiallyreadthrough,thusreducingthewaitonagivenreference,thelowbandwidthmemoryinterfacewillremainbusylongaftertheinitialtransferiscompleted,resultinginlongdelaysifasecondbackingstorageoperationisrequired.Wethereforehaveexploredtheeffectivenessofacachewhichexploitsprimarilyorexclusivelytemporallocality,i.e.,theblocksfetchedfrombackingstoreareonlythesizeneededbytheCPUorpossiblyslightlylarger.Inconsideringwaystoevaluatethisstrategy,weidentifiedacommercialenvironmentthatcontainedmanyofthesameconstraintsandseemedamenabletothesamekindsofsolutions.ThisenvironmentisthemarketplaceofthesingleboardcomputerrunningonastandardbussuchasMultibusorVersabus.sWehavechosentostudythisenvironmentinanattempttogaininsightintotheoriginal,generalscheme.2.TheSingleBoardComputerApplicationAsingleboardcomputertypicallycontainsamicroprocessorandasubstantialamountofmemory,thoughsmallenoughthatitmustbeusedcarefully.Ifneeded,accesstoadditionalrandomaccessmemoryisthroughthebus,whichisdesignedforgeneralityandsimplicity,notforhighperformance.Multibus,inparticular,wasdefinedintheearly70stoofferaninexpensivemeansofcommunicationamongavarietyofsubsystems.AlthoughoriginallyintroducedbyIntelCorporation,ithasfoundwideacceptance,havingbeenproposedinaslightlymodifiedformastheIEEEP796busstandard\/EEEB0\.Currently,severalhundredvendorsofferMultibuscompatiblecards.Whilethemarkethasrapidlydevelopedforproductsusingthisbus,itsapplicationsarelimitedbythesevereconstraintimposedbythebandwidthofMultibus.Clearlythebusbandwidthcouldbeincreasedbyincreasingthenumberofpins,andbymodifyingtheprotocol.Itsbroadpopularityandtheavailabilityofcomponentstoimplementitsprotocolmean,however,thatitislikelytosurvivemanyyearsinitspresentform.Thusalargemarketexistsforacomputeronacardwhich,muchasifitwereallonasinglechip.hasseverelimitationsonitscommunicationswiththerestofthesystem.WedecidedtodetermineifacachememorysystemcouldbeimplementedeffectivelyintheMultibusenvironment.Tothatendwehavedesignedacachetobeusedwithacurrentgenerationmicroprocessor.Inaddition,wehavedoneextensivesimulationofcacheperformance,drivenbymemorytracedata.WehaveidentifiedanewcomponentwhichisparticularlysuitedforVLSIimplementationandhavedemonstrateditsfeasibilitybydesigningit\RavishankarB3\.Thiscomponent,whichimplementsthetagmemoryforadynamicRAMcacheintendedforamicroprocessor,issimilarinmanyrespectstotherecentlyannouncedTMS2150\Tree\.Multibussystemshavegenerallydealtwiththeproblemoflimitedbusbandwidthbyremovingmostoftheprocessormemoryaccessesfromthebus.Eachprocessorcardhasitsownlocalmemory,whichmaybeaddressabletoothersthroughtheMultibus.Whilethisapproachhasmuchincommonwithours,webelievethattheallocationofmemorylocalorremoteshouldbehandledbythesystem,freeingtheprogrammerofthistask.IntypicalMultibusapplications,considerableeffortisexpendedguaranteeingthattheprogramrunningisprimarilyresidentonboard.Thisapproachisviableforastaticpartitioningoftasks.Resultstodatehavebeenmuchlesssatisfactory,however,forthemoregeneralsituationwhereanumberofprocessorsaredynamicallyallocated.Forefficiencyreasonsitalsoprecludestheuseofsharedcodesegments.eMultibusisatrademarkofIntelCorporation.SVersabusisatrademarkofMotorola.125Inmanyenvironments,asimplp,dynamichardwareallocationschemecanefficientlydeterminewhatmemorylocationsarebeingaccessedfrequentlyandshouldthereforebekeptinlocalmemorybetterthantheprogrammerwhooftenhaslittleinsightintothedynamiccharacteristicsofhisprogram.Thereareenvironmentswheretheprogrammerisintimatelyfamiliarwiththebehaviorofhisprogramandcangeneratecodetotakeadvantageofit.Inthisenvironmentthetimespentrunningaprogramisoftenmuchmoresubstantialthanthetimedevelopingtheprogram.Thisexplains,forexample,whyaninvisiblecacheisnotappropriateontheCRAY1.Webelievethatfreeingtheprogrammerfromconcernaboutmemoryallocationisessentialwhereprogrammerproductivityiscritical.B.i.ASingleBoardComputerwithCacheToevaluateourapproach,weproposedasingleboardcomputercontaining,possiblyalongwithotherthingsaCPUandnolocalmemoryexceptacache,withbackingstoreprovidedthroughMultibus.ThuswepickedanimportantprobleminitsownrightCanwebuildacachethatworkswithaMultibussystemsupportingmultipleprocessorsInparticular,howmanyprocessorscanwesupportrunninginparallelonMultibusWebelievethatasystemwhichcouldreasonablysupportfiveto10processorswouldbeasignificantadvance.ThiscantbecompareddirectlyagainstcurrentsystemsbecauseasingleprocessoroverloadstheMultibus.Thuslocalmemoriesmustbeheavilyexploitedifperformanceisimportant.Earlieranalyses\Kaplan73,Bell74,Rao78,Pate182\haveusedthecachehitratioorsomethingcloselyrelatedtomeasureperformance.Theimportantcriterionhereistomaximizeuseofthebus,notthehitratio,orevennecessarilytooptimizeprocessorperformance.Weoptimizesystemperformancebyoptimizingbusutilization,achievinghigherperformancebyminimizingindividualprocessorsbusrequirements,andtherebysupportingmoreprocessorsreasonablywell.Weallowindividualprocessorstositidleperiodicallyratherthantieupthebusfetchingdatawhichtheymightnotuse.Thisimpliesthatthecachestaledataproblemmustbesolvedeffectively.Wepresentanewsolutioninsection3.2.2.SwitchingContextsWherebusbandwidthislimited,ataskswitchisamajordisturbance,sincethecachemusteffectivelybereloadedatthistime.Theprocessorismomentarilyreducedtoaccessesattherateatwhichthebuscansupplythem.Whilethisproblemseemsunavoidable,itneednotbeseriousiftaskswitchingisminimized.Weareprovidinganenvironmentwhichallowsmanyprocessorstoworkoutofasinglemonolithicmemoryinparallel.Ifmoreparalleltasksarerequired,moreprocessorscanbeused.WepointoutthatthecurrentMultibusalternativeistomovetheprogramintolocalmemory,anoperationwhichalsoswampsthebus.Thetaskswitchmerelymakesthisoperationimplicit,andavoidsbringingacrossthebusdatawhichareneveractuallyused.Writingtheolddataoutisalsonoworsethanthealternative,sinceweonlywritethatwhichhasbeenchangedandwhichhasnotbeenalreadypurged.Theremaybecertaincasesaninterrupthandlingprogram,forexamplewhereaparticularprogramdoesnotflushthecache,butusesonlyasmallportionofit.Provisionscouldbemadetoallowsuchaprogramtobelockedinthecache.Alternatively,aseparatecachemightbeprovidedforsuchaprogramOurstudiesindicatethatarelativelysmallcachecanbeeffectiveforasingleprogram,soitmaybepossibletokeepseparatecachesaroundforindividualprocessesifthenumberissmall.Wewouldsug,,,tmgthisonestepfurtherandprovidinganadditionalprocessorforeachcache.Aninterestingquestionthenarisesastothecostofdynamicallyassigningprocessestoprocessors.Ourproposalallowsthisassignment,thoughclearlyatsomeperformancepenalty.3.CacheCoherencyItiswellknownthatmultiplecachespresentseriousproblemsbecauseoftheredundancyofstorageofasinglelogicalmemorylocation\Tang76,Censier78,Rao78\.Themostcommonmethodamongcommercialproductsfordealingwiththis,thestaledataproblem,istocreateaspecial,highspeedbusonwhlehaddressesaresentwheneverawriteoperationisperformedbyanyprocessor.Thissolutionhasweaknesses\CeLisier78\whichhavegenerallylimitedcommercialimplementationstotwoprocessors.Inthesinglechipprocessororsingleboardcomputerenvironments,ithastheaddedweaknessthatitrequiresanumberofextraI/0pins.Analternativeapproach,implementedinC,mmp\Hoogendoorn77\andproposedbyNorton\Norton82\.istorequiretheoperatingsystemtorecognizewheninconsistenciesmightoccurandtakestepstopreventthemunderthosecircumstances.Thissolutionisunappealingbecausethecacheisnormallyregardedasanarchitectureindependentfeature,invisibletothesoftware.Athirdapproach,variationsofwhichhavebeenproposedbyCensierandFeautrier\Censier78\,Tang\Tang76\,Widdoes\Widdoes79\,andYenandFu\Yen82\,istousesomeformoftaggedmainmemory,keepingtrackofindividualblocksinthiswaytopreventinconsistency.Individualblocksaretemporarilydesignatedasprivateforaparticularprocessorsothatitmaymodifyitrepeatedlywithoutreferencetomainmemory.Thetagmustbesetwheneversuchacriticalsectionisenteredandresetwheneverthecriticalsectionisleft,i.e.,themodifiedwordiswrittenbacktomainstorage.Thisapproachrequiressubstantialhardware,andappearsinfeasibleforalargenumberofcaches,sinceanoperationinacentralplaceisrequiredattheentryorexitofanycriticalsection.Ourapproachhasmuchincommonwiththethirdapproach,butallowsthecriticalsectioninformationtobedistributedamongthecaches,whereitalreadyresides.Inaddition,weusethenormalreadandwriteoperations,withnotagbitsinmainmemory,toaccomplishthesynchronization.Arelatedscheme\Arndah182\whichusesaspecialbustoconveythenoticeofentryorexitfromacriticalsection,hasbeenimplementedinacommercialproduct,buthasnotbeenpublishedtoourknowledge.Wecallourschemewr/teonce.3.1.WriteThroughorWriteBackWhilethechoicebetweenwriLethroughalsoknownasstorethroughandwritebackalsoknownasstorebackorcopybackhasnobearingonthereadhitratio,ithasamajorimpactonbustraffic,particularlyasthehitratioapproaches100.Inthelimit,whenthehitratioisi00.writebackresultsinnobustrafficatall.whilewritethroughrequiresatleastonebuscycleforeachwriteoperation.Norton\Norton82\concludedthatusingwritebackinsteadofwritethroughforahypotheticalprocessortypicallywouldreducethebustrafficbymorethan50andiftheprocessesrantocompletionbustrafficwouldbedecreasedbyafactorof8.Fortypicalreadtowriteandhitratiosandwhentaskswitchingisinfrequent,oursimulationshavegivenstrongevidencethatwritebackgeneratessubstantiallylessbustrafficthanwritethrough.126Butwritebackhasmoreseverecoherencyproblemsthanwritethrough,sinceevenmainmemorydoesnotalwayscontainthecurrentversionofaparticularmemorylocation.3.2.ANewWriteStrategyWriteOnceWeproposeanewwritestrategywhichsolvesthestaledataproblemandproducesminimalbustraffic.Thereplacementtechniquerequiresthefollowingstructure.AssociatedwitheachblockinthecachearetwobitsdefiningoneoffourstatesfortheassociateddataInvalidThereisnodataintheblock.VainThereisdataintheblockwhichhasbeenreadfrombackingstoreandhasnotbeenmodified.ReservedThedataintheblockhasbeenlocallymodifiedexactlyoncesinceitwasbroughtintothecacheandthechangehasbeentransmittedtobackingstore.D/tryThedataintheblockhasbeenlocallymodifiedmorethanoncesinceitwasbroughtintothecacheandthelatestchangehasnotbeentransmittedtobackingstore.WriteoncerequiresrapidaccesstotheaddresstagsandstatebitpairsconcurrentlywithaccessestotheaddresstagsbytheCPU.Thiscanmosteasilybeachievedbycreatingtwoidenticalcopiesofthetagmemory.Censier\Censier78\claimsthatduplicationistheusualwayoutforresolvingcollisionsbetweencacheinvalidationrequestsandnormalcachereferences.Thisisnotalargecost,sinceasinglechipdesignofthispartofthecacheusingpresenttechnologyisquitefeasible.Further,wehavediscoveredawaytoreducesubstantiallythenumberoftagsrequired.Inaddition,thesamechiptypecouldbeusedforbothinstances.ThisisanaturalwaytopartitionthecacheinVLSIbecauseitresultsinamaximallogictopinratio.Wehavedesignedandsubmittedforfabricationsuchachip\Ravisha_nkar83\.Thetwocopiesalwayscontainexactlythesameaddressdata,becausetheyarealwayswrittensimultaneously.WhileoneunitisusedintheconventionalwaytosupportaccessesbytheCPU,asecondmonitorsallaccessestomemoryviatheMultibus.Foreachsuchoperation,itchecksfortheaddressinthelocalcache.Ifamatchisfoundonawriteoperation,itnot\tiesthecachecontroller,andtheappropriateblockinthecacheismarkedinvalid.Ifamatchisfoundonareadoperation,nothingisdoneunlesstheblockhasbeenmodified,i.e.,itsstateisreservedordirty.Ifitisjustreserved,thestateischangedtovalid.Ifitisdirty,thelocalsystemsinhibitsthebackingstorefromsupplyingthedata.Itthensuppliesthedataitself.4Onthesamebusaccessorimmediatelyfollowingit,thedatamustbewrittentobackingstore.Inaddition,foreitherreservedordirtydata,thestateischangedtovalid.Thisschemeachievescoherencyinthefollowingway.Initiallywritethroughisemployed.However,anadditionalgoalisachieveduponwriting.Allothercachesarepurgedoftheblockbeingwritten,sothecachewritingthroughthebusnowisguaranteedtheonlycopyexceptforbackingstore.Itissoidentifiedbybeingmarkedreserved.Ifitispurgedatthispoint,nowriteisnecessarytobackingstore,sothisisessentiallywritethrough.\fanotherwriteoccurs,theblockismarkeddirty.Nowwritebackisemployedand,onpurging,thedatamustberewrittentobackingstore.4ThereisamechanisminMultibuswhichallowsthiscapability.Unfortunately,itisrarelyused,notwelldefined,andrequiresthat\ocalcachesrespondveryrapidly.Versabushasamuchcleanermechanismbywhichthisendcanbeaccomplished.Writeoncehasthedesirablefeaturethatunitsaccessingbackingstoreneednothaveacache,andneednotknowwhetherothersdoornot.Acacheisresponsibleformaintainingconsistencyexactlyforthosecaseswhereitmightcreateaviolation,i.e.,wheneveritwritestoalocation.ThusitispossibletomixinanarbitrarywaysystemswhichemployacacheandthosewhichdonotthelatterwouldprobablybeI/0devices.Considerablecaremustbeexercised,however,whenawriteoperationoverthebusmodifieslessthananentireblock.4.SimulationWedesignedacachememorysystemtoworkonMultibus.Tovalidateourdesignbeforebuildingitwedidextensivesimulationusingmemorytracedata.Todatewehaveperformedextensivesimulationsforsixtraces,allrunningunderUNIXEDCTheUNIXeditoredrunningascript.ROFFASTheoldUNIXtextprocessorprogramroff.TRACETheprogram,writteninassemblylanguage,whichgeneratedtheabovetracesforthePDP11.NROFFTheprogramtroFinterpretingtheBerkeleymacropackageme.CACHEThetracedrivencachesimulatorprogram.COMPACTAprogramusinganonlinealgorithmwhichcompressesfilesusinganadaptiveHuffmancode.ThefirstthreetracesareforaPDP11,whilethelatterthreeareforaVAX.WhilethePDP11doesnotrunonMultibus,itsinstructionsetissimilartomanymicroprocessorswhichdo,andtheprogramsusedfortracingwereofthekindweenvisionforsuchasystem.ThePDP11issimilarinmanywaystotheMC58000,andhasincommonwiththe8086alimitedaddressingcapability.WhiletheVAXalsodoesnotrunonMultibus,itisanexampleofamoderninstructionsetand,thereforeisareasonableexampleofthekindofprocessorlikelytoappearinasinglechipCPUinthefuture.Italsohasalargeraddressspacewhich,asshowninsection4.3,issignificant.Weareactuallyusingvirtualaddresses,butalloftheprogramsweranaresmallenoughtofitintomainmemory.Sincewearetracingonlyasingleprocess,weconcludethatthereisnosignificantdifferencebetweenvirtualandrealaddresses.Inadditiontocacheparameters,missratiosvarygreatlydependingontheprogramrunning.Fortheeachoftheabovetraces,awideandunpredictablevariationoccurredaswevariedasingleparameter.Thusplottingparametersfortheindividualtraceswasoftennotenlightening.Averagingoverthethreetracesineachcategorygavemuchmorerevealingresults,providingdatathatsuggestedacontinuousfunctionformanyofthevariablesstudied.Thusallourresultsareactuallytheaverageofthreeprograms,eachrunningalone.4.1.EffectofWriteStrategyonBusTrallieAitheughwritethroughnormallygenerateslessbustrafficthanwriteback,thelattercanbeworseifthehitratioislowandtheblocksizeislarge.Underwriteback,whenadirtyblockispurged,theentireblockmustbewrittenout.Withwritethrough,onlythatportionwhichwasmodifiedmustbewritten.Wefoundthatwritebackisdecisivelysuperiortowritethroughexcept1whencacheblocksareverylarge,or2whenthecachesizeisverysmall.SUNIXandNROFFaretrademarksofBellLaboratories.127Writeonceresultsinbustrafficroughlyequaltothebetterofthetwo.Wehavefoundcacheparametersforwhichitactuallyperformsbetterontheaveragethaneitherwritethroughorwritebackforanumberofprograms.Thiswasasurprisingresult,sincewriteoncewasdevelopedtoassurecoherency,nottominimizebustraffic.Thereplacementschemeoutperformsbothwritethroughandwritebackwheneverthetotalnumberofsetsisabout16.Forexample,fora4waysetassociative,2048bytecachewithablocksizeof32bytes,theaveragebustrafficforthreePDP11programsforwhichwehavetraceswas30.768Zforwritethrough,17.55Zforwriteback,and17.387oforwriteonce,4.2.ColdStartvs.WarmStartAnimportantconsiderationindeterminingcachehitratioandbustrafficisthecoldstartperiodknownasthelifetimefunction\Easton78\,duringwhichtimemanymissesoccurbecausethecacheisempty.Thisisdefinedastheperioduntilasmanycachemisseshaveoccurredasthetotalnumberofblocksinthecache.Thisinitialburstofmissesisamortizedoverallaccesses,sothelongerthetraceanalyzed,thelowerthemissratioobtained.Inadditiontotheinitiationofaprogramandoccasionalswitchesofenvironments,acoldstartgenerallyoccurswheneverthereisataskswitch.Thusanimportantassumptionintraditionalcacheevaluationisthefrequencyoftaskswitching.Wehavearguedthattaskswitchingmustbeveryinfrequentinoursystem.Thuswecanmorenearlyapproachinpracticethewarmstarthitratios,andthusitisappropriatetouseverylongtracesofasingleprogram,andassumeawarmstart.Wedidthatinitially,usingthefulllengthofthePDP11tracesavailabletous1,256,570memoryaccesses.Wenoted,however,thatforevenmuchshortertracesthanwewererunning,therewaslittledifferencebetweenwarmstartandcoldstartstatistics.Sincecoldstartstatisticsareeasiertogenerate,wenormallyusedthem.Unlessstatedotherwise,ourresultsarefromcoldstart,butatleast10timesthelifetimefunctionintotallength.4.3.CacheSizeIngeneral,weweresurprisedattheeffectivenessofasmallcache.ForthePDP11traceswithacacheof2Kbytesorlarger,wediscoveredthatessentiallynomissesoccurredafterthecoldstartperiod.Thesearenottrivialprograms,butwererunonamachinewhichhasonly64Kbytesforbothinstructionsanddata.Theprogramsareveryfrugalintheiruseofmemory,andtheentireworkingsetapparentlycanfitinthecache.TheVAXtracesdonotexhibitthesamelocalityobservedwiththePDP11,anda64Kbytecachewasnotlargeenoughtocontaintheentireworkingsetoftheprogram.Thismaybearesultofthelargeraddressspaceavailable,themorecomplexinstructionset,ormorecomplexprograms.InallcasestheprogramswerespreadoutoveramuchlargermemoryspacethanforthePDP11traces.Forthiscomparisonweusedasmallblocksizeof4bytes.ThismayhavehadagreaterimpactontheVAXthanonPDP11traces.Wefoundthatreducingthesizeofthecachebelow4KbytesforthePDP11increasedthemissratioandthebustrafficingeneralthetwocorrelatewellwithrespecttothisparameter.Fig.1showstheaveragemissratioandbustrafficasafunctionoftotalcachesizeforthePDP11traces.Forthisandallresultsgiven,themissratioincludeswrites.Thebustrafficisgivenasapercentofthenumberofaccessesthatwouldberequiredifnocachewerepresent.Fig.2showsthesamedatafortheVAXtraces.4.4.HIockSizeOurcachedesignincorporatesextremelysmallblocks,dependingheavilyontemporallocality.EastonandFagin\Easton78\claimthatpagesizeandmissratiosareindependentforwarmstart,buthighlydependentforcoldstart.Iftrue,thiscanbeexplainedbytheobservationthathitsinthecacheoncoldstartdependheavilyonspatiallocality,whiletemporallocalityprovidesmanyhitswhenitiswarm.Spatiallocality,however,isstronglycorrelatedtoblocksize.beingdirectlyproportionalintheextremecaseofstrictlysequentialmemoryaccesses.OursimulationspartiallyconfirmEastonsobservation.Inparticular,wefoundthat,asblocksizeisincreased,missratiosgenerallydeclineuptoapoint,thenincreaseforeitherwarmorcoldstarts.However,forsmallblocksizes,thewarmstartmissratioismarginallylowerthanforthecoldstartcase,whileforlargeblocksizes,thetwonumbersarenearlyidentical.Seefigs.37.Thisisencouragingsincewehavearguedforrestrictedtaskswitchesourenvironmentismorethatofawarmstartthanisthetraditionalenvironment.Inmanysimulationswewereabletogetveryhighhitratiosoncethecoldstartperiodended.Forsmallblockstransferred,however,thisperiodthelifetimefunctionislonger.Oursimulationsshowveryclearlythatreducingtheblocksizedowntoasingletransferacrossthebusdramaticallydecreasesthehitratio,particularlyforcoldstarts,butalsodecreasesbustrafficsignificantly.Ingeneral,weobservedthatincreasingthetransferblocksizefromonebuscycletotwotypicallydecreasesthemissratioby30to50,whileincreasingthebustrafficby10to20.Thisrelationholdsforthefirsttwodoublings.Theseresults,e.g.,fig.3,arerelativelymorepessimisticforsmallblocksizesthanthosereportedbyStreckerin\Bell78\.Wehavemadetheassumptionthataccesstimeisrelatedlinearlytoblocksize.Inmanycasesthisisnottrue.ItisessentiallytruefortheMultibus,sinceonlytwobytescanbefetchedatatime,andarbitrationisoverlappedwithbusoperations.Forasinglechipimplementation,itwouldalmostcertainlybeworthwhiletoprovidethecapabilityforefficientmultipletransfersoverasetofwiresintotheprocessor.Thishasnotbeenincorporatedinouranalysis,butwillundoubtedlysuggestasomewhatlargertransferblocksize.4.4.1.LoweringtheOverheadof\nallBlocksSmallblocksarecostlyinthattheygreatlyincreasetheoverheadofthecacheanaddresstagandthetwostatebitsarenormallystoredinthecacheforeachblocktransferred.Wereducedthisoverheadbysplittingthenotionofblockintotwoparts1Thetransferblockistheamountofdatatransferredfrombackingstoreintothecacheonareadmiss.2Theaddressblockisthequantumofstoragefor.whichatagismaintainedinthecache.Itisalwaysapoweroftwolargerthanatransferblock.Aneffectivecachecanbeimplementedbykeepingthetransferblocksmallbutmakingtheaddressblocklarger.Formostcommercialproductscontainingacache,theaddressblocksizeisthesameasthetransferblocksize,thoughweknowofoneexample\IBM74\wheretheaddressblockcontainedtwotransferblocks.TheIBMSystem/360Model85\Liptay68\infactisaspecialcaseofthis,v/z..,adirectmappedcache,wheretheModel85sector,consistingof1Kbytes,correspondstoouraddressblock.Eachsectorcontains16transferblocks,whichwerecalledsimplyblocks.4.4.2.TheEffectofLargeAddressBlocksTheuseofaddressblockslargerthantransferblocksmeansthatonlydatafromoneaddressblockin128backingstorecanoccupyanyofthetlocksmakingupanaddressblockinthecache.Therearecaseswheretheappropriatetransferblockisempty,butothertransferblocksinthesameaddressblockmustbepurgedsothatthenewaddressblockcanbeallocated.Weexaminedthisforvarioussizesofaddressblocksandfoundthatthemissratioincreasedveryslowlyuptoapoint.Forthesituationshowninfig.7,themissratiohadonlyrisenbyabout30whentheaddressblockcontained64bytesforthePDP11traces.ThatpointwasreachedfortheVAXtraceswhenitcontained32byteaddressblocks.Wepredictedthatthebustrafficwouldcorrelatewellwithmissratiowithrespecttothisparameter.Tooursurprise,thebustrafficactuallydeclinedinitiallyasweincreasedtheaddressblocksize.Thedeclinewassmall,butconsistentforthePDP11tracetapes,eventuallyclimbingoverthebaselinewhentheaddressblockwas16or32transferblocks.Thephenomenonwassmaller,butdiscerniblefortheVAXtracesaswell,thoughinallcasesthebustrafficstartedincreasingsooner.Thissituationisshowninfig.8.Weinitiallysuspectedthatoursimulationmightbefaulty.Thatwasnotthecase,andeventuallywewereabletoexplainitandverifyit.Thewriteoncealgorithmrequiresabusoperationwheneverablockismodifiedinitiallysettoreserved.However,reservationscouldbemadeonthebasisofeithertransferblocksoraddressblocks.Wehadputthechoiceintothesimulator,buthadnotexperimentedwithit,reservingattheaddressblocklevel.Thisinfactreducesthenumberofbuswritesnecessaryforreservationbecauseofspatiallocalityofwritesanaddressblockalreadyfeservedneedonlybemarkeddirtywhenanytransferblockwithinitismodified.Thiswouldseemtoincreasegreatlythetrafficwhenevertheblockispurgedfromthecache,butinfacttheeffectissmallonlythosetransferblockswhichhaveactuallybeenmodifiedneedbewrittenback.Wedemonstratedthatthiswasindeedresponsibleforthebehaviornotedbychangingthereservationstothetransferblocklevel.Thesimulationresultsthenexhibitedtheoriginallypredictedbehavior,correlatingcloselywiththemissratio.Weconcludethatminimumbustrafficisgeneratedwithminimumtransferblocksizes.Themissratiomaybesubstantiallyimprovedbyusingslightlylargertransferblocks,inwhichcasebustrafficdoesnotincreasegreatly.Usinglargeraddressblocksreducesthecostofthetagmemoryconsiderably.Itinitiallyhasonlyaminoreffectonmissratio,whichismorethanoffsetbythesavingsinwritesduetothemoreefficientreservationofmodifiedblocks4.5.OtherDesignAspectsStudied4.5.1.WriteAllocationWriteallocation,alsoknownasfetchonmte,meansthatablockisallocatedinthecacheonawritemissaswellasonareadmiss.Whileitseemsnaturalforwriteback,ittypicallyisnotusedwithwritethrough.Itisessentialforwriteoncetoassurecoherency.Ourearlysimulationsshowedthatitwashighlydesirableforwritebackandwriteonce,andsuperiorevenforwritethroughwithsmallblocks.Thiswastrueusingboththemeasuresofmissratioandbustraffic.Inallresultspresented,writeallocationwasemployed.4.5..AssociaUvityWerananumberofsimulationsvaryingtheassociativityallthewayfromdirectmappedtofullyassociative.Whilethisisclearlyanimportantparameter,wehavelittlenewtoreport,i.e.,fullyassociativecacheisthebest,but2waysetassociativeisnotmuchworse,and4waysetassociativeissomewhereinbetween.Butsee\Smith83\.Wehadhopedtofindthatahighdegreeofassoeiativitywouldimproveperformance,becausesuchanorganizationismuchmorefeasibleintheVLSIdomain,butresultswerenegative.Forresultsreportedherewehaveassumeda4way,setassociativecache.4.5.3.ReplacementAlgorithmReplacementstrategyhasbeenthesubjectofanotherstudyusingthesamesimulatorandtraces\Smith83\.Inordertolimititssignificance,whichseemstobeorthogonaltotheissuesraisedhere,wehaveassumedtrueLRUreplacementamongtheelementsofeachsetinallcases.4.5.4.BusWidthThewidthofthedatapathsbetweenunitsisanimportantparameterinthatitiscloselyrelatedtobandwidth.WehavethecapabilitytospecifythebuswidthbothfrombackingstoretocacheandfromcachetoCPU.Forthepurposesofthisstudy,wehaveassumedinallcasesthattheVAXmemorysupplies4bytestothecacheinonebuscycle,whilethePDP11memorysupplies2bytes.An8bytetransferthereforeiscountedastwocyclesfortheVAXand4forthePDP11.Weassumedthatthecachesuppliedoneword16bitsforthePDP11and32bitsfortheVAXtotheCPUoneachrequest.However,theintelligenceoftheprocessordetermineshowoftenthesamewordmustbefetched.Thetracetapescontainallmemoryreferences.Wefilteredthesewiththeassumptionthatoninstructionfetchesthesamewordwouldnotbefetchedwithoutaninterveninginstructionfetch.Nofilteringwasdoneondatafetches,5.SummaryOursimulationsindicatethatasingleboardcomputerwitha4Kbytecachecanperformreasonablywellwithlessthan10oftheaccessesrequiredtoitsprimarymemorywithoutacache.ThePDP11tracessuggestanumberaslowas3.WhiletheVAXnumbersarehigher,additionaldeclineswillbeexperiencedbyincreasingthesizeofthecachebeyond4Kbytes.Animportantresultistheuseofthewriteoncealgorithmtoguaranteeconsistentdataamongmultipleprocessors.Wehaveshownthatthisalgorithmcanbeimplementedinawaythatdegradesperformanceonlytriviallyignoringactualcollisions,whicharerare,andperformsbetterthaneitherpurewritebackorwritethroughinmanyinstances.Theuseofsmalltransferblocksizescanbecoupledwithlargeaddressblockstobuildaninexpensivecachewhichperformseffectivelyintheabsenceoffrequentprocessswitches.Thelowbusutilizationandthesolutiontothestaledataproblemmakepossibleanenvironmentforwhichthisconditionismet.Eventhoughthemissratioincreases,bustrafficinitiallydeclinesastheaddressblockisenlarged,holdingthetransferblockconstant.Thereforelargeraddressblocksshouldbeusedforreservingmemoryformodificationevenifsmallblocksareusedfortransferofdata.Theapproachadvocatedhereisappropriateonlyforasystemcontainingasinglelogicalmemory.Thisis.significantbecauseitdependsontheserializationofmemoryaccessestoassureconsistency.Ithasapplicationsbeyondthosestudiedhere,however.Forexample,theaccesspathtomemorycouldbeviaaringnetwork,oranyothertechniqueinwhicheveryrequestpasseseveryprocessor.Thisextensionseemsparticularlyapplicableattheformaintainingconsistencyforafilesystemoracommonvirtualmemorybeingsuppliedto129multipleprocessorsthroughacommonbussuchasEthernet.Clearlytherearemanyenvironmentsforwhichthismodelisinappropriateresponsetoindividualtasksmaybeunpredictable,forexample.However,webelievethatsuchaconfigurationhasmanypotentialapplicationsandcanheexploitedeconomicallyiftheappropriateVLSIcomponentsaredesigned.WehaveinvestigatedthedesignofsuchcomponentsandbelievethattheyarebothfeasibleandwellsuitedforVLSI\Ravishankar83\.OuranalysisindicatesthatthecacheapproachisreasonableforasystemwherebandwidthbetweentheCPUandmostofitsmemoryisseverelylimited.Wehavedemonstratedthroughsimulationofrealprogramsthatacachememorycanbeusedtosignificantlyreducetheamountofcommunicationaprocessorrequires.Whitewewereinterestedinthisforasinglechipmicrocomputerofthefuture,wehavealsodemonstratedthatsuchanapproachisfeasibleforoneormorecurrentlypopularcommercialmarkets.6.AcknowledgementsThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderGrantMCS8202952.WethankDr.A.J.SmithforprovidingthePDP11tracetapesuponwhichmuchofourearlyworkdepended.WealsowishtothankT.H.YangfordevelopingtheVAXtracefacility.P.VitaleandT.Doylecontributedmuchthroughdiscussionsandbycommentingonanearlydraftofthemanuscript.7.References\Amdahl82\C.Amdahl,privatecommuvticttiovt,March82.\Bell74\J.Bell,D.Casasent,andC.G.Bell,Aninvestigationofalternativecacheorganizations,IEEETrans.onComputers,Vol.C23,No.4,April1974,pp.348351.\Bell78\C.BeU,J.Judge,J.McNamara,ComputerengineeriyaDECviezuofttardaresystemdesign,DigitalPress,Bedford,Mass.,1978.\Censier78\L.M.CensierandP.Feautrier,Anewsolutiontocoherenceproblemsinmulticachesystems,IEEETrans.onComputers,Vol.C27,No.12,December1978,pp.11121118.\Easton78\M.C.EastonandR.Fagin,Coldstartvs.warmstartmissratios,CACM,Vol.21,No.10,October1978,pp.888872.\Groseh53\H.A.Grosch,HighSpeedArithmetictheDigitalComputerasaResearchTool,JournsdoftheOpticalSocietyofAmerica,Vol.43,No.4,April1953.\Hoogendoorn77\C.H.Hoogendoorn,Reductionofmemoryinterferenceinmultiprocessorsystems,Proc.4thAnnualSyrup.Comput.Arch.,1977,pp.179183.\IBM74\System/370model155theoryofoperation/diagramsmanualvolume5buffercontrolunit,IBMSystemProductsDivision,Poughkeepsie,N.Y.,1974.\IBM76\System/370model168theoryofoperation/diagramsmanualvolume1,DocumentNo.SY2269313,IBMSystemProductsDivision,Poughkeepsie,N.Y.,1978.\IEEE80\ProposedmicrocomputersysteLflusstandardP796bus,IEEEComputerSocietySubcommitteeMicrocomputerSystemGroup,0eLober1980.\Kaplan73\K.R.KaplanandR.0.Winder,Cachebasedcomputersystems,Computer,March1973,pp.3036.\Knight86\J.R.Knight,Changesinc.mputerperformance,Datamation,VoL12,No.9,September1966,pp.4054.\Lindsay81\CacheMemoryforMicroprocessors,ComputerArchitectureNevgs,ACMSIGARCH,Vol.9,No.5,August1981,pp.613.\Liptay68\J.S.Liptay,StructuralaspectsoftheSystem/360Model85,PartlIthecache,IBMSyst.J.,Vol.7,No.1,1968,pp.1521.\Norton82\R.L.NortonandJ.L.Abraham,Usingwritebackcachetoimproveperformanceofmultiusermultiprocessors,1982Int.Conf.onPar.Prec.,IEEEcat.no.82CH17947,1982,pp.326331.\Patel82\Analysisofmultiprocessorwithprivatecachememories,J.H.Patel,IEEETrans.onComputers,Vol.C31,No.4,April1982,pp.296304.\Rao78\G.S.Rao,PerformanceAnalysisofCacheMemories,JournaloftheACM,Vol.25,July1978,pp.378395.\Ravishankar83\C.V.RashankarandJ.Goodman,Cacheimplementationformultiplemicroprocessors,DigestofPapers,SpringCOMPCON83,IEEEComputerSocietyPress,March1983.\Siewiorek82\D.P.Siewiorek,C.G.Bell,andA.Newell,ComputerStructurestWgnciplesandExamples,McGrawHill,NewYork,N.Y.,1982.\Smith82\A.J.Smith,Cachememories,ComputingSurveys,Vol.14,No.3,September1982,pp.473530.\Smith83\J.E.SmithandJ.R.Goodman,Astudyofinstructioncacheorganizationsandreplacementpolicies,TenthAnnualSymposiumonComputerArchitecture,June1983.\Solomon68\M.B.Solomon,Jr.,EconomiesofScaleandtheIBMSystem/360,CACM,Vol.9,No.6,June1968,pp.435440.\Tang78\C.K.Tang,Cachesystemdesigninthetightlycoupledmulttproeessorsystem,AFIPSProc.,NCC,Vol.45,pp.749753,1976.\TI82\TexasInstrumentsMOSMemoryDtaBook,TexasInstruments,Inc.,MemoryDivision,Houston,Texas,pp.106111,1982.\Tredennick82\N.Tredennick,TheIBMmicro/370project,publiclectureforDistinguishedLecturerSeries,ComputerSciencesDepartment,UniversityofWisconsinMadison,March31,1982.\Widdoes79\L.C.Widdoes,S1MultiprocessorarchitectureMULT2,1979AnnualReporttheS1Project,Volume1Architecture,LawrenceLivermoreLaboratories,Tech.ReportUCID18819,1979.\Yen82\W.C.YenandK.S.Fu,Coherenceprobleminamuiticachesystem,1982Int.ConflonPar.Proc.,IEEEcat.no.82CH17947,1982,pp.332339.1301oilOS.10101........i................l......1010m1CP10Cae.he1sebytesg.I.BusTra.mderandMissRatiosvs.CacheSizeblocksare4bytesPDP11traces.Thebustransarratioisthenumberoftransfersbetweencacheandmainstorerelativetothosenecessary.iftherewerenocache.10il01t/Q10e.........1...................IO10s110s110CsQheglzeb71esFig.2.BusTransferandMissRatiosvs.CacheSize4bytebloclcsVAX11traces.i10.10l........i1010BlookSJLiIsbytesFit.3.MissRatiovs.BlockSizeforwarmandcoldstartsPDPI1traces.1010.....llol.....BlockebytesFig,4.MissRatiovs.BlockSizeforwarmandcoldstartsVACI1traces.2JJACeld0Wluets10a,1010jmoo,kellobyternFig.5.BusTransferRatiovs.BlockSizeforwarmandcoldstartsPDP11traces.10Ei110a,iownStartI0°100.......110BISLsobytelFig.6.BusTranserRatiovs.BlockSizeforwarmandcoldstartsVAXIItraces.10ilOJAaeoWnaLJ10................1010110AddLreloBloctkSlsebFt,esFig.7.Missratiovs,AddressBlockSizeforwarmandcoldstarts,1C10xWOa.VA.X11WOT\PDP11lo1010AdeLTemmBIQokSJLebyternFig.a.BusTransferRatiovs.AddressBlockSizeforwarmandcoldstartsWOAaddressblocksarereserved.WOTira_riskerblockarereserved.131

注意事项

本文(32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf)为本站会员(baixue100)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5