36-Organization and performance of a two-level virtual-real cache hierarchy.pdf_第1页
36-Organization and performance of a two-level virtual-real cache hierarchy.pdf_第2页
36-Organization and performance of a two-level virtual-real cache hierarchy.pdf_第3页
36-Organization and performance of a two-level virtual-real cache hierarchy.pdf_第4页
36-Organization and performance of a two-level virtual-real cache hierarchy.pdf_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

OrganizationandPerformanceofaTwo-LevelVirtual-RealCacheHierarchyWen-HannWang,Jean-LoupBaerandHenryM.LevyDepartmentofComputerScience,FR-35UniversityofWashingtonSeattle,WA98195AbstractWeproposeandanalyzeatwo-levelcacheorganizationthatprovideshighmemorybandwidth.Thefirst-levelcacheisac-cesseddirectlybyvirtualaddresses.Itissmall,fast,and,with-outtheburdenofaddresstranslation,caneasilybeoptimizedtomatchtheprocessorspeed.Thevirtually-addressedcacheisbackedupbyalargephysically-addressedcache;thissecond-levelcacheprovidesahighhitratioandgreatlyreducesmem-orytraffic.Weshowhowthesecond-levelcachecanbeeasilyextendedtosolvethesynonymproblemresultingfromtheuseofavirtually-addressedcacheatthefirstlevel.Moreover,thesecond-levelcachecanbeusedtoshieldthevirtually-addressedfirst-levelcachefromirrelevantcachecoherenceinterference.Finally,simulationresultsshowthatthisorganizationhasaperformanceadvantageoverahierarchyofphysically-addressedcachesinamultiprocessorenvironment.Keywords:Caches,VirtualMemory,Multiprocessors,Mem-oryHierarchy,CacheCoherence.1IntroductionVirtually-addressedcachesarebecomingcommonplaceinhigh-performancemultiprocessorsduetotheneedforrapidcacheac-cessill,3,171.Avirtually-addressedcachecanbeaccessedmorequicklythanaphysically-addressedcachebecauseitdoesnotre-quireaprecedingvirtual-t&physicaladdresstranslation.How-ever,virtually-addressedcacheshaveseveralproblemsaswell.Forexample:1.Theymustbecapableofhandlingsynonyms,thatis,multi-plevirtualaddressesthatmaptothesamephysicaladdress.2.Whileaddresstranslationisnotrequiredbeforeavirtualcachelookup,addresstranslationisstillneededfollowingamiss.3.Inamultiprocessorsystem,theuseofavirtually-addressedcachemaycomplicatecachecoherencebecausebusaddressesarephysical,thereforeareversetranslationmaybere-quired.Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommer-cialadvantage,theACMcopyrightnoticeandthetitleofthepublicationaaditsdateappear,andnoticeisenthatcopy;nPisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.4.I/Odevicesusephysicaladdressesaswell,alsorequiringreversetranslation.5.Avirtualcachemayneedtobeinvalidatedonacontextswitchbecausevirtualaddressesareuniquetoasinglepro-cess.Noneoftheseproblemsisinsolvablebyitself,andseveralschemeshavebeenproposedformanagingvirtualcaches.Forexample,dualtagsets,onevirtualandonephysical,canbeusedforeachcacheentry7,61.Asanotherexample,theSPURsys-temrestrictstheuseofaddressspace,prohibitscachingofI/Obuffers,andrequiresbustransmissionofbothvirtualandphys-icaladdressesll.However,theseschemestendtohaveperfor-manceshortcomingsorunpleasantimplicationsforsystemsoft-ware.Virtually-addressedcachesarefundamentallycomplicated,andthistimeorspacecomplexityreducestheabilityofthecachetomatchtheever-increasingneedsofmodernprocessors.Toattackthisproblem,weproposeatwo-levelcacheorganizationinvolvingavirtually-addressedfirst-levelcacheandaphysically-addressedsecond-levelcache(recentstudiesoftwo-levelunipro-cessorandmultiprocessorcachescanbefoundin4,5,12,131).Thesmallfirst-levelcache.canbefasttomeettherequirementsofhigh-speedprocessors;itisvirtuallyaddressedtoavoidtheneedforaddresstranslation.Thelargesecond-levelcachewillreducemissratiosandmemorytraffic;itisphysicallyaddressedtosimplifytheI/Oandmultiprocessorcoherenceproblems.Fur-thermore,weshowhowthesecond-levelcachecanbeutilizedtosolvethesynonymproblemandtoshieldthefirst-levelcachefromirrelevantcachecoherencetraffic.Overall,webelievethatthistwo-levelvirtual-realorganizationrsimplifiesthedesignofthefirst-level,whereperformanceiscrucial,whilesolvingsomeofthedifficultproblemsatthesecondlevel,wheretimeandspacearemoreeasilyavailable.Ourorganizationinvolvestheuseofpointersinthetwocachestokeeptrackofthemappingsbetweenvirtualcacheandphysicalcacheentries7.Wealsoprovideatranslationbufferatthesec-ondlevelwhichoperatesinparallelwithfirst-levelcachelookupsincaseamissrequiresreversetranslation.Trace-drivensimula-tionsareusedtodemonstratetheadvantagesofatwo-levelV-R(virtual-real)cacheoverahierarchyofreal-addressedcachesinamultiprocessorenvironment.Therestofthispaperisorganizedasfollows.Section2describestheapproachestakeninsolvingvariousproblemsrelatedtovir-tualaddresscachesandpresentssomedesignchoicesforhighperformancemultiprocessorcaches.Section3givesthespecificorganizationofaV-Rtwo-levelcachehierarchyanditsdetailedoperationaldescription.Section4presentsperformanceresultsfromsimulations,andconclusionsaredrawninsection5.01989ACM0884-7495/89/0000/0140$01.501402Designissuesoftwo-levelV-RcachesforhighperformancemultiprocessorsThissectionaddressessomeimportantissuesinthedesignoftwolevelV-Rcachesandmotivatesourdesignchoices.Amoredetailedoperationaldescriptionofourapproachisgiveninthefollowingsection.Theproposedarchitectureforthisevaluationisashared-busmultiprocessorwhereeachprocessorhasaprivate,twolevel,V-cache-R-cachehierarchyasshowninFigure1.RCdleR-C&eEV-CacheV-CacheP.pFigure1:Shared-busorganizationWritepoliciesForatwo-levelcache,thewritepolicycanbeselectedindepen-dentlyateachlevel.Intheliterature,write-throughhasbeenproposedasthemostreasonablewritepolicyforthefirst-levelcacheinatwolevelhierarchy,whilewrite-backisadvocatedforthesecondlevello,8,131.Amajormotivationforthechoiceofwrite-throughatthefirstlevelisthatcachecoherencecontrolissimplified.Inthiscase,thefirst-andsecond-levelcacheswillalwayscontainidenticalvalues.Thereareseveralproblems,however,withusingafirst-levelwrite-throughcache.First,assumingnowrite-allocate,write-throughcacheswillhavesmallerhitratiosthanwrite-backcaches.Sec-ond,awritetakeslongerunderwrite-throughbecausethesecond-levelcachemustbeupdatedaswell;primarymemorymayalsoneedtobeupdateddependingonthewritepolicyforthesecondlevel.Thereducedwritelatencywithwrite-throughcanbegreatlyhid-denbytheuseofwritebuffersbetweenthefirstandsecondlevels,butseveralwritebuffersmaybeneeded.Table1,forexample,showsthatintheexecutionoftheVAXprogrampops(cf.sec-tion4),30%ofwritesareduetoprocedurecalls,eachofwhichtypicallygeneratessixormoresuccessivewrites.Table2showstheinter-writeintervaldistributionforasnapshot(411,237refer-ences)ofthesametraceusinga16Kdirect-mappedcachewithaM-byteblocksize.Ascanbeseen,thehighpercentageofshortinter-writeintervalsconfirmstheneedforseveralbuffers.Unfortunately,whilewritebufferscanreducethewritelatencyofthefirst-levelcache,theyre-introduceacomplexitythatWrite-throughwasintendedtoavoid,namelycachecoherence.Writebufferscanholdmodifieddataforwhichotherprocessorsmightencounteramiss.Thus,cachecoherencycontrolmustbepro-videdforthewritebuffersoneverycachecoherencetransaction.Thesedifficultiesleadustofavorthewrite-backpolicyforourvirtually-addressedcacheatthefirstlevel.no.ofwr.percallcounttotalwrites1332243004285210Table1:Nulmberofwritesduetoprocedmrec:a116481973510andlarger3245Table2:Inter-writeintervals(snapshotof411,237references)ThesynonymproblemAspreviouslynoted,atwo-levelV-Rorganizationcanbeusedtosolvethesynonymproblem.Thesolutionrequirestheuseofareversetranslationtable15fordetectingsynonyms,andanaturalplacetoputthattableisatthesecondlevel.Ourtwo-levelorganizationpermitsanddetectssynonyms,butguaranteesthatatmostonecopyofadataelementexistsintheV-cacheatanytime.Eachsecond-levelcacheblockwillhaveapointertoitsfirst-levelchildblock,ifoneexists.Ifweguaranteeaninclusionproperty,wheretheR-cachecontainsasupersetofthetagsintheV-cache,thereversetranslationinformationcanbestoredinlog(V-cachesize/pagesize)supersetbitsineachR-cacheblock.ForeachentryintheR-cachewithachildintheV-cache,theseextrabits,togetherwiththepageoffset,providetheV-cachelocationofitschild.WhenamissoccursintheV-cache,thevirtualaddressistrans-lated(usingasecond-leveltranslationbuffer)andtheR-cacheisaccessed.IfanR-cachehitoccurs,theR-cachecheckswhetherthedataisalsointheV-cacheunderanothervirtualaddress(asynonym).Ifso,itsimplyinvalidatesthatV-cachecopyandmovesthedatatothenewvirtualaddressintheV-cache.Thus,whileadataelementcanhavesynonyms,itisalwaysstoredintheV-cacheusingthelastvirtualaddresswithwhichitwasaccessed.NotethatourapproachindealingwiththesynonymproblemhassomesimilaritiestoGoodmansapproach7.Onecanviewourapproachasmov-ingGoodmansrealdirectoryfrombeingjustforsnoopingtobeingassociatedwiththeleveltwocache.Thismoveprovidestwobenefits.First,ithidesthecostofGoodmansextra,realdirectorybymakingittheleveltwocachedirectory.Second,itreducesthemissescausedbyreal-addresscollisionsviamakingtherealdirectorymuchbigger.ContextswitchingInamultiprogrammingenvironment,addressesareuniquetoeachprocessandthereforetheV-cachemustbeflushedwhen-everacontextswitchoccurs.Thismightbecostlyforalargevirtually-addressedcache.Forsmallcacheswebelievethepenaltyonhitratioswillbenegligibleandthisisconfirmedbyoursim-ulationresults(cf.Section4).However,ifawrite-backpolicyisusedfortheV-cache,asubstantialnumberofwrite-backsmayoccurateachcontextswitch,whichgreatlyincreasescontext-switchlatency.Anothersolutiontoavoidtheaddressmappingconflictistoat-tachaprocessidentifiertoeachtagentryoftheV-cache.ThisapproachdoesnotimprovethehitratioforasmallV-cacheI,butcanavoidthelargenumberofwrite-backsatcontextswitchtime.Unfortunately,thisapproachincreasesthecomplexityofatwo-levelhierarchybecausetheV-cacheneedstobepurgedorselectivelyflushedwhenaTLBentryofaninactiveprocessisreplacedbyanentryoftheactiveprocess,oraprocess-idisreassigned.Wewishtohavethebenefitsofredncedcontext-switchlatencywithoutneedingtoflushtheV-cachewhenaTLBentrychanges.OurapproachmeetsthesegoalsbyinvalidatingallV-cacheblocksonacontextswitchbutnotwritingthembackatthattime.Instead,eachblockiswrittenbackonlywhenitisreplaced,thatis,whenanewblockisreadintothatcacheslot.Thewritesarethusdistributedintimewherethelatencycanbehiddenusingwrite-backbuffers.Toimplementthisscheme,weaddtwonewfieldstoeachV-cacheblock.First,weaddaswapped-validbit,whichissetforeachV-cacheblockonacontextswitch.Uponareplacement,iftheV-cachefindsablockwithswapped-validset,itcheckswhetherthatblockisalsomarkedbothdirtyandvalid;ifso,thatblockmustbewrittenback.Second,weaddanr-pointer,whichisthelow-orderbitsofthepagenumber,toeachV-cacheblock.Ther-pointer,togetherwiththepageoffset,issufficienttolinkaV-cacheentrytoitscorrespondinglocationintheR-cache.Thislinkagemakesawrite-backorastatecheckefficient,sincethereisnoneedforanaddresstranslation.Thisapproachusesspacecomparabletothatoftheprocessidentifierscheme,butwithoutitsdisadvantages.Table3showstheeffectoftheswapped-validbit;herewe.see.theinter-writeintervalfromthesamebenchmarkasTable2whentheswapped-validbitisused.Becauseswappedwrite-backsaretypicallyfarapartfromother(swapped)write-backs,asinglewrite-backbufferissufficienttooverlapswappedwrite-backswithprocessorexecution.Oursimulationsshowthatwithasinglebuffertheamountofstallingonaswappedwrite-backisindeednegligible.Ontheotherhand,iftheincrementalwrite-backisnotusedweneedtowritebackoverahundredblocksatcontextswitchingtimeforthisspecificbenchmark.Noticethatthenumberofwrite-backsneededduetocontextswitchingisafunctionofcachesize,cacheorganization,thedurationoftherunningstateofaprocess,andtheworkload.CachecoherenceWhiletwo-levelcachesareattractive,cachecoherencecontroliscomplicatedbyatwo-levelscheme.Withoutspecialattentiontothecoherenceproblem,thefirst-levelcachewillbedisturbedbyeverycoherencyrequestonthebus.Asolutiontothisproblemistousethesecond-levelcacheasafiltertoshieldthefirst-levelcachefromirrelevantinterference.Inordertoachievethis,weneedtoimposeaninclusionpropertywherethetagsoftheTable3:Writeintervalwithwrite-backandswappedwrite-back(snapshotof411,237references)second-levelcacheareasupersetofthetagsofitschildcache.Wesaythatamultilevelcachehierarchyhas,theinclusionpropertyifthissupersetrelationholds.Imposinginclusionisalsoessentialforsolvingthesynonymproblemasstatedabove.Inamultiprocessorenvironment,theinclusionpropertycannotbeheldevenwithaglobalLRUreplacement4.In5thefollow-ingreplacementalgorithmwasproposedasbneoftheconditionstoimposetheinclusion.lFirstlevel:Anyreplacementalgorithmwilldo(e.g.,LRU).Notifythesecondlevelcacheoftheblockbeingreplaced.lSecondlevel:Replaceablockwhichdoesnotexistinthefirstlevel(thisisdonebycheckinganinclusionbit;thereisoneinclusionbitperblocktoindicatewhethertheblockispresentinthefirstlevel).Thegeneratproblemwithinclusionisitsimplicationsforalargesetsizeinthesecondlevel(i.e.,highassociativity).Byfollowingthesameapproachasin(51,andlettingS;bethenumberofsets,Bibetheblocksize,and&e(i)bethecachesizeofalevelicache,wecanshowthatinordertoimposeinclusionundertheabovereplacementalgorithm,theset-associativityofthesecond-levelcacheA2mustbe:undertheusualpracticalsituationswhereS2S,Bz_&,size(2)size(l)andBlSl2pagesizez.Inpracticalcases,thisconstraintcanbetoostricttobefeasible.Forexample,iftheV-cacheis16Kbytes,thepagesizeis4Kbytes,andBzis4timesaslargeasB1,evenwithadirect-mappedV-cacheweneeda16-wayR-cachetoa,chievetheinclusion.Torelaxthestrictconstraintontheset-associativityoftheR-cache,wechangethereplacementruleoftheR-cachetooperateasfollows:replaceablockwiththeinclusionbitclearifthereisone;otherwisereplaceablockaccordingtosomepredefinedreplacementalgorithmandinvalidatethecorrespondingV-cacheblock.NotethatthelatterwonthappenveryoftensincetheR-cacheismuchlargerthantheV-cache.Forexample,theanalysisofthemultiprocessortrace,pops(over3millionmemoryrefer-ences),showsthatonly21inclusioninvalidationsareneedediftheVcacheis16Kbytes,a-wayset-associativewitha16byteblocksizeandtheRcacheis256Kbyteswithsamesetsizeandblocksize.ifBISIBr)intheV-cache.4PerformanceInthissection,wecomparetherelativeperformanceofvirtual-real(V-R)andreal-real(R-R)two-levelcaches.Wealsoexaminethemeritsofsplittingthefirst-levelvirtually-addressedcacheintoIandDcaches.Finally,wemeasuretheeffectoftheR-cacheinshieldingtheV-cachefromirrelevantcachecoherenceinterference.Togathertheperformancefigures,weusetrace-drivensimula-tionsandthreeparallelprogramtraces:pops,thorandabaqus2,141.Inpopsandthor,contextswitchesoccurrarelywhiletheyarefrequentinabaqus.Table5givesasummaryofsomecharacteristicsofthesetraces.RelativeperformanceofV-RandR-Rtwo-levelcachesTocomparetheperformanceofV-RandR-Rtwo-levelcaches,wegatherthehitratiosatdifferentlevels;thehitratiosarethenusedingenericmemoryaccesstimeequationstopredictrelativeperformances.WeassumethattheinclusionpropertydefinedpreviouslyalsoholdsfortheR-Rtwo-levelcache.Forsimplicity,weconsideronlydirect-mappedcachesatbothlevels.Thegenericaccesstimeequationofatwo-levelcachehierarchyisasfollows:xc,=Prob(hitatlevel1)xaccesstimeatlevel1+Prob(hitatlevel2&r&satlevel1)xaccesstimeatlevel2tprob(missatlevel1and2)xmemoryaccesstimethatis:Tecc=htlt(1-h)hzt2+(1-hr-(1-hr)h2)t,wherehr,hzarehitratiosatlevels1and2,trandt2areac-cesstimesatthetwolevels,andt,isthememoryaccesstimeincludingthebusoverhead.Becausethesecond-levelcachesarethesameforbothV-RandR-Rorganizations,andbecauseinclusionholds,thenumberofmissesandthetrafficfromthesecond-levelcachearethesameinbothorganizations.Thereforethethirdtermintheaboveequa-tionisthesameforbothV-RandR-Rorganizations.Assumingthathandlingasynonymhasacostequivalentofhandlingamissinthefirst-levelcachethathitsinthesecond-levelcache,therel-ativeperformancewherethereisahitinthehierarchycanbeestimatedsolelyonthefirsttwotermsoftheaboveequation.Table6showsthehitratiosatbothlevelsofV-RandR-Rorganizationsforthethreetracesunderthreedifferentpairsoffirstandsecond-levelcachesizes.Figures4,5and6depicttherela-tiveperformanceofthetwoorganizationsunderdifferentdegreesofassumedR-cachedegradationduetoaddresstranslationover-head.Thesefiguresplottherelativeperformanceofthetwohierarchieswitht2=4tlvs.thepercentageofslowdownduetoaddresstranslationforvariousfirst-level/second-levelcachesizes.Thepointsonthey-axiscorrespondtonoslowdownatall.Fromthesefigureswecandrawthefollowingconclusions.LetUSassumethatthereisnotime,penaltyinvolvedinper-formingavirtual-realaddresstranslationinconjunctionwiththeaccesstothefirstlevelcache.Whencontextswitchesoccurrarely,asisthecaseforthefirsttwotraces(Figures4and5),theperformancesoftheV-RandR-Rhierarchiesarealmostin-distinguishable(thepointsonthey-axisarethesame).Whencontextswitchesarefrequent,asinthethirdtrace(Figure6),theV-Rhierarchyisslowerby2to6%dependingonthesizeoftheV-cache(alargerV-cacheseemstoimplyalargerrelativedegradation).Now,letusassumeatimepenaltyforthetranslation.Therearetwopossiblereasonsforthispenalty.ThefirstisthatTLBaccessandcacheaccesscannotbecompletelyoverlappedmsoonathecachesizeislargerthanthepagesizemultipliedbythesetssociativity.Second,evenifthereweretotaloverlap,therewouldstillbeanextracomparisonnecessarytocheckthevalidityofacachehit.Fromtheobservation5ofthepreviousparagraph,itisclearthattheV-Rhierarchywillperformbetterinthecase.ofrarecontext-switches.Therelativeimprovementisapprox-imatelyequaltotheoverheadofaddresstranslation.WhatisinterestingIstoseethecross-overpointforthecaseoffrequentcontext-switches.FromFigure6,weseethattheV-Rhierar-thywillhaveabetterperformancewhentheaddresstranslation5105downthefirstlevelR-cacheaccessby6%ormore.Since6%isaconservativefigureforthepenaltyduetotheinser-tionofaTLBatthefirstlevel,itappearsthattheV-Rhierarchyisabettersolution.ItsperformanceisagoodathatofanR-RhierarchyanditscostislesssincetheTLBdoesnothavetobetracenum.ofcpustotalrefsinstrcountdatareaddatawritecontextswitchcountthor43283k1517k139Ok376k21POPS43286k1718k1285k283k7abaqus21196k514k600k82k292Table5:Characteristicsoftraces145Table6:hitratiosTable7:Hitratiosforsmallfirst-levelcachesFl4:AverageusxsstimeVS.slow-downofR-Eache(thor).,._.,._.,!,/,lm4d;/-,.q,/;j/.”,.i,._._._._k“.,4:_._.4._.,61218Fist-levelR-acheslowdownpacmtageFigure5:AveragewxsstimeVS.slow-downofR-cache(pp),w_._._._.I_._._._.-.;_._._.i1d061218Fist-levelR-cacheslowdownpercentage612.I8First-levelRcacheslowdownpercentweimplementedinfastlogic.AnotheradvantageisthatproblemssuchasTLBcoherencecanalsobehandledatthesecondlevel.Theresultspresentedaboveassumed4Kto16Kfirst-levelcaches,whichmaybeimpracticalforsomeadvancedtechnologies,suchasGaAs.However,webelievethattheV-Rorganizationisevenmoreattractiveforhierarchieswithsmallerfirst-levelcaches.OurresultsinTable7showthatforsmallerfimt-levelcaches(e.g.,.5Kto2K),thefirst-levelhitratiosofV-RandR-Rorganizationsarenearlyidentical.Therefore,performanceofaV-RhierarchywillbesuperiorgivenanypenaltyforaTLBlookup.Inaddition,fortechnologiesinwhichspaceisatapremium,wecantradethefirst-levelTLBofanR-Rhierarchyforalargerfirst-levelcacheinaV-Rhierarchy.Thisinturnprovideslargerhitratiosandhencesmalleraverageaccesstime.Splittingthefirst-levelvirtually-addressedcacheThereareanumberofreasonswhyitisadvantageoustosplitthefirst-levelcacheintoseparateIandDcaches.First,theband-widthcanalmostbedoubledforpipelinedprocessorswhereaninstructionfetchcanoccuratthesametimeasadatafetchofapreviousinstruction(e.g.,theIBM801andMotorola88000).Second,eachIandDcacheissmallerandhasthepotentialtobeoptimizedforitsspeed.Third,andthispertainsmostlytoV-caches,theIcacheissimplerthantheDcachesinceitdoesnotneedtohandlethesynonymandthecachecoherenceprob-lemsprovidedthatself-modifyingprogramsarenotpermitted.Adisadvantage,however,isthatweneedmorewiringsorpinsfortheprocessorandcachemodule.Itisimportanttoassess,how-ever,ifsplittingthecacheintoI&Dcomponentswillimproveperformance.OurresultsinTable8,9and10showthatthehitratiosofsplitI&DcachesareveryclosetothatofaunifiedI&Dcacheandarenotnecessarilyworse.Inthesetables,theIandDseparatecachesareofequalsizes(i.e.,inthe4KexampletheI-cacheandtheD-cacheareeach2K).Similarresultshavebeenfoundin9,131.Thus,wewouldadvocatesuchasplitforaV-Rhierarchy.thor4K/64K8K/128K16K/256Kdatareadsplit0.9240.9370.945unified0.9130.9380.950datawritesplit0.9520.9620.969unified0.9460.9660.972instructionsplit0.9570.9630.989unified0.9300.9730.984overallsplit0.9420.9520.968unified0.9250.9570.968Table8:Hitratiosoflevel1cachesforthethortracePOPS4K/64K8K/128K16K/256Kdatareadsplit0.9020.9120.923unified0.9000.9150.926datawritesplit0.9360.9460.955-unified0.93710.94810.958instructionsplit10.94710.96610.978unified0.9480.9630.974overallepht0.9280.9440.955111111IunifiedI0.928I0.943I0.954ITable9:Hitratiosoflevel1cachesforthepopstraceabaqus4K/64K8K/128KlSK/256Kdatareadsplit0.7950.8180.837unified0.8060.8290.845datawritesplit0.8410.8610.875unified0.8470.8570.895instructionsplit0.9200.9470.949unified0.9070.9260.938overallsplit0.8520.8760.888unified0.8520.8730.888Table10:Hitratiosoflevel1cachesfortheabaqustraceShieldingcachecoherenceinterferenceAnimportantadvantageofthetwo-levelapproachisthattheR-cachecanshieldtheV-cachefromirrelevantcachecoherenceinterference.Forexample,onareadmissbusrequest,theR-cacheneedstosendaflushrequesttoitsV-cacheonlywhentheV-cachecontainsamodifiedcopyofthedata;otherwisetheV-cachewillnotbedisrupted.NotethatthisshieldingeffectisachievedbecausetheinclusionpropertyholdsinourV-Rtwo-levelcache.ImposinginclusionmightnotseemtobeessentialforanR-Rtwo-levelhierarchybecausethesynonymproblemisnotpresent.However,theresultsinTables11,12and13,whichgivethenumberofcoherencemessagesbeingpercolatedtoeachfirst-levelcache,showthataV-Rtwo-levelcachehasmuchlesscoherenceinterferenceatthefirstlevelthanthatofanR-Rtwo-levelcachewithoutinclusion.TheresultsalsoshowthatinclusionisimportantinanR-Rtwo-levelcachesinceitresultsinapproximatelythesamesavingsincoherencemessagestothefirst-levelcache.4Webelievethattheshieldingeffectoncachecoherencewillbemoreprominentasthenumberofprocessorsincreases.Thisisduetothefactthatmorebuscoherencerequestswillbegeneratedfromalargernumberofprocessors,andwithouttheshielding,afirst-levelcachewillbedisruptedmoreoften.OurresultsinTables11,12(4cpus)and13(2cpus)reflectthiseffect.Forexample,ontheaverage,thefirst-levelcacheofaV-RhierarchyencountersabouthalfthecoherencemessagesthanthatoftheR-Rhierarchywithoutinclusionforthetwoprocessortrace(cf.Table13),whereasforfourprocessortracesthefirst-levelcacheoftheV-Rhierarchyencountersfromthreetosixtimesfewercoherencemessages.Weplantofurtherconfirmthisobservationwhenweareinpossessionoflarger-scaletraces.5ConclusionsOneofthemostchallengingissuesincomputerdesignisthesup-portofhighmemorybandwidth.Inthispaper,wehaveproposedWenoticethatRRwithinclusionhasover10%fewercoherencemessagesthanthatofVRfortheabaqustrace.Thisdiscrepancyisduetoalargeamountofinclusioninvalidationsincurredinthisspecifictraceduetoalargenumberofcontextswitchings.Table11:Numberofcoherencemessagestothefirst-levelcacheTable12:Numberofcoherencemessagestothefirst-levelcacheabaqus4K/64K8K/128K16K/256KCPUVRRR(inc1)RR(noincl)VRRR(inc1)RR(noincl)VRRR(inc1)RR(noincl)010961843618855116779379212951106798532260311052780292072610547952824202105991002826845Table13:Numberofcoherencemessagestothefirst-levelcache147atwo-levelcachehierarchytoaddressthisissue.Wehavearguedthatthefirstlevelcacheisbestaccesseddirectlybyvirtualad-dresses.Webackupthesmallvirtually-addressedcachebyalargesecond-levelcache.Avirtually-addressedfirst-levelcachedoesnotrequireaddresstranslationandcanbeoptimizedtomatchtheprocessorspeed.Throughtheuseofaswapped-validbit,weavoidtheclusteringofwrite-backsatcontextswitchingtime.Thedistributionofthesewrite-backsismoreevenlyspreadovertime.Thelargesecond-levelcacheprovidesahighhitratioandreducesalargeamountofmemorytraffic.Wehaveshownhowthesecond-levelcachecanbeeasilyextendedtosolvethesynonymproblemresultingfromtheuseofavirtually-addressedcacheatthefirstlevel.Furthermore,thesecond-levelcachecanbeusedeffectivelytoshieldthevirtually-addressedfirst-levelcachefromirrelevantcachecoherenceinterference.Oursimulationresultsshowthatwhencontextswitchesarerare,thevirtually-addressedcacheoptionhascomparableperformancetoitsphysically-addressedcounterpart,evenassumingnoad-dresstranslationoverhead.Whencontextswitchesoccurfre-quently,thevirtually-addressedcacheoptionhasaperformanceedgewhenasmalladdresstranslationpenaltyistakenintoac-count,andthesmallerthevirtually-addressedcachethelargertherelativeperformanceedge.Wealsoadvocatesplittingthevirtually-addressedcacheintoseparatedinstructionanddatacaches.Thisapproachhasthepotentialofdoublingthememorybandwidthsinceourresultsshowthatthehitratiosofsplitin-structionanddatacachesareveryclosetothatofasingleI&Dcache.Asafinalremark,wenotethatcacheperformanceisworkloaddependent.InthisstudywehaveconfinedourselvestoalimitedVAXmultiprocessorworkload.Weplantoenlargeourworkloadsampleassoona8weareinpossessionofothermultiprocessortraces.AcknowledgmentThisworkwassupportedinpartbyNationalScienceFoun

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论