56-The Future of Microprocessors.pdf_第1页
56-The Future of Microprocessors.pdf_第2页
56-The Future of Microprocessors.pdf_第3页
56-The Future of Microprocessors.pdf_第4页
56-The Future of Microprocessors.pdf_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

26September2005QUEUErants:TheFutureofMicroprocessorsKUNLEOLUKOTUNANDLANCEHAMMOND,STANFORDUNIVERSITYQUEUESeptember200527morequeue:heperformanceofmicroprocessorsthatpowermoderncomputershascontinuedtoincreaseexponentiallyovertheyearsfortwomainreasons.First,thetransis-torsthataretheheartofthecircuitsinallprocessorsandmemorychipshavesimplybecomefasterovertimeonacoursedescribedbyMooreslaw,1andthisdirectlyaffectstheperformanceofprocessorsbuiltwiththosetransistors.Moreover,actualprocessorper-formancehasincreasedfasterthanMooreslawwouldpredict,2becauseprocessordesignershavebeenabletoharnesstheincreasingnumbersoftransistorsavail-ableonmodernchipstoextractmoreparallelismfromsoftware.Thisisdepictedingure1forIntelsprocessors.Aninterestingaspectofthiscontinualquestformoreparallelismisthatithasbeenpursuedinawaythathasbeenvirtuallyinvisibletosoftwareprogrammers.Sincetheywereinventedinthe1970s,microprocessorshavecontinuedtoimplementtheconven-tionalvonNeumanncomputationalmodel,withveryfewexceptionsormodications.Toaprogrammer,eachcomputerconsistsofasingleprocessorexecutingastreamofsequentialinstructionsandconnectedtoamonolithic“memory”thatholdsalloftheprogramsdata.Becausetheeconomicbenetsofbackwardcompatibilitywithearliergenerationsofprocessorsaresostrong,hardwaredesignershaveessentiallybeenlimitedtoenhancementsthathavemaintainedthisabstractionfordecades.Onthememoryside,thishasresultedinprocessorswithlargercachememories,tokeepfrequentlyaccessedportionsoftheconceptual“memory”insmall,fastmemoriesthatarephysi-callyclosertotheprocessor,andlargeregisterlestoholdmoreactivedatavaluesinanChipmultiprocessorspromiseofhugeperformancegainsisnowareality.MultiprocessorsFOCUS28September2005QUEUErants:extremelysmall,fast,andcompiler-managedregionof“memory.”Withinprocessors,thishasresultedinavarietyofmodicationsdesignedtoachieveoneoftwogoals:increasingthenumberofinstructionsfromtheproces-sorsinstructionsequencethatcanbeissuedoneverycycle,orincreasingtheclockfrequencyoftheprocessorfasterthanMooreslawwouldnormallyallow.Pipelin-ingofindividualinstructionexecutionintoasequenceofstageshasalloweddesignerstoincreaseclockratesasinstructionshavebeenslicedintolargernumbersofincreasinglysmallsteps,whicharedesignedtoreducetheamountoflogicthatneedstoswitchduringeveryclockcycle.Instructionsthatoncetookafewcyclestoexecuteinthe1980snowoftentake20ormoreintodaysleading-edgeprocessors,allowinganearlyproportionalincreaseinthepossibleclockrate.Meanwhile,superscalarprocessorsweredevelopedtoexecutemultipleinstructionsfromasingle,conventionalinstructionstreamoneachcycle.Thesefunctionbydynamicallyexaminingsetsofinstructionsfromtheinstructionstreamtondonescapableofparallelexecutiononeachcycle,andthenexecutingthem,oftenoutoforderwithrespecttotheoriginalprogram.Bothtechniqueshaveourishedbecausetheyallowinstructionstoexecutemorequicklywhilemaintainingthekeyillu-sionforprogrammersthatallinstructionsareactuallybeingexecutedsequen-tiallyandinorder,insteadofoverlappedandoutoforder.Ofcourse,thisillusionisnotabsolute.Performancecanoftenbeimprovedifprogrammersorcompilersadjusttheirinstructionschedulinganddatalayouttomapmoreefcientlytotheunderlyingpipelinedorparal-lelarchitectureandcachememories,buttheimportantpointisthatoldoruntunedcodewillstillexecutecor-rectlyonthearchitecture,albeitatless-than-peakspeeds.Unfortunately,itisbecomingincreasinglydifcultforprocessordesignerstocontinueusingthesetechniquestoenhancethespeedofmodernprocessors.Typicalinstructionstreamshaveonlyalimitedamountofusableparallelismamonginstructions,3sosuperscalarprocessorsthatcanissuemorethanaboutfourinstructionspercycleachieveverylittleadditionalbenetonmostapplica-tions.Figure2showshoweffectiverealIntelprocessorshavebeenatextractinginstructionparallelismovertime.Thereisaatregionbeforeinstruction-levelparallelismwaspursuedintensely,thenasteepriseasparallelismwasutilizedusefully,followedbyataperingoffinrecentyearsastheavailableparallelismhasbecomefullyexploited.Complicatingmattersfurther,buildingsuperscalarprocessorcoresthatcanexploitmorethanafewinstruc-tionspercyclebecomesveryexpensive,becausethecomplexityofalltheadditionallogicrequiredtondparallelinstructionsdynamicallyisapproximatelypro-portionaltothesquareofthenumberofinstructionsthatcanbeissuedsimultaneously.Similarly,pipeliningpastabout10-20stagesisdifcultbecauseeachpipelinestagebecomestooshorttoperformevenaminimalamountofIntelPerformanceOverTimerelativeperformanceyear0.101.0010.00100.001000.0010000.001985198719891991199319951997199920012003FIG1TheFutureofMicroprocessorsMultiprocessorsFOCUSQUEUESeptember200529morequeue:logic,suchasaddingtwointegerstogether,beyondwhichthedesignofthepipelineissignicantlymorecomplex.Inaddition,thecircuitryoverheadfromaddingpipelineregistersandbypasspathmultiplexerstotheexistinglogiccombineswithperformancelossesfromeventsthatcausepipelinestatetobeushed,primarilybranches.Thisoverwhelmsanypotentialperformancegainfromdeeperpipeliningafterabout30stages.Furtheradvancesinbothsuperscalarissueandpipelin-ingarealsolimitedbythefactthattheyrequireever-largernumbersoftransistorstobeintegratedintothehigh-speedcentrallogicwithineachprocessorcoresomany,infact,thatfewcompaniescanaffordtohireenoughengineerstodesignandverifytheseprocessorcoresinreasonableamountsoftime.Thesetrendshaveslowedtheadvanceinprocessorperformancesomewhatandhaveforcedmanysmallervendorstoforsakethehigh-endprocessorbusiness,astheycouldnolongeraffordtocompeteeffectively.Today,however,allprogressinconventionalprocessorcoredevelopmenthasessentiallystoppedbecauseofasimplephysicallimit:power.Asprocessorswerepipe-linedandmadeincreasinglysuperscalaroverthecourseofthepasttwodecades,typicalhigh-endmicroprocessorpowerwentfromlessthanawatttoover100watts.Eventhougheachsiliconprocessgenerationpromisedareduc-tioninpower,astheever-smallertransistorsrequiredlesspowertoswitch,thiswastrueinpracticeonlywhenexistingdesignsweresimply“shrunk”tousethenewprocesstechnology.Processordesigners,however,keptusingmoretransistorsintheircorestoaddpipeliningandsuperscalarissue,andswitchingthemathigherandhigherfrequencies.Theoveralleffectwasthatexpo-nentiallymorepowerwasrequiredbyeachsubsequentprocessorgeneration(asillustratedingure3).Unfortunately,coolingtechnologydoesnotscaleexponentiallynearlyaseasily.Asaresult,processorswentfromneedingnoheatsinksinthe1980s,tomoderate-sizeheatsinksinthe1990s,totodaysmonstrousheatsinks,oftenwithoneormorededicatedfanstoincreaseairowovertheprocessor.Ifthesetrendsweretocontinue,thenextgenerationofmicroprocessorswouldrequireveryexoticcoolingsolutions,suchasdedicatedwatercool-ing,thatareeconomicallyimpracticalinallbutthemostexpensivesystems.Thecombinationoflimitedinstructionparallelismsuitableforsuperscalarissue,practicallimitstopipelin-ing,anda“powerceiling”limitedbypracticalcoolinglimitationshaslimitedfuturespeedincreaseswithinconventionalprocessorcorestothebasicMooreslawimprovementrateoftheunderlyingtransistors.Thislimitationisalreadycausingmajorprocessormanufactur-erssuchasIntelandAMDtoadjusttheirmarketingfocusawayfromsimplecoreclockrate.Althoughlargercachememorieswillcontinuetoimproveperformancesomewhat,byspeedingaccesstothesingle“memory”intheconventionalmodel,thesimplefactisthatwithoutmoreradicalchangesinpro-cessordesign,microproces-sorperformanceincreaseswillslowdramaticallyinthefuture.Processordesignersmustndnewwaystoeffectivelyutilizetheincreasingtransis-torbudgetsinhigh-endsiliconchipstoimproveperformanceinwaysthatminimizebothadditionalpowerusageanddesigncomplexity.Themarketformicroprocessorshasbecomestratiedintoareaswithdifferentperformancerequirements,soitisusefultoexaminetheproblemfromthepointofviewofthesedifferentperfor-mancerequirements.IntelPerformancefromILPrelativeperformance/cycleyear198519871989199119931995199719992001200300.000.250.300.350.400.45FIG230September2005QUEUErants:THROUGHPUTPERFORMANCEIMPROVEMENTWiththeriseoftheInternet,theneedforserverscapableofhandlingamultitudeofindependentrequestsarrivingrapidlyoverthenetworkhasincreaseddramatically.Sinceindividualnetworkrequestsaretypicallycompletelyindependenttasks,whetherthoserequestsareforWebpages,databaseaccess,orleservice,theyaretypicallyspreadacrossmanyseparatecomputersbuiltusinghigh-performanceconventionalmicroprocessors(gure4a),atechniquethathasbeenusedatplaceslikeGoogleforyearstomatchtheoverallcomputationthroughputtotheinputrequestrate.4Asthenumberofrequestsincreasedovertime,moreserverswereaddedtothecollection.Ithasalsobeenpossibletoreplacesomeoralloftheseparateserverswithmultiprocessors.Mostexistingmultiprocessorsconsistoftwoormoreseparateprocessorsconnectedusingacommonbus,switchhub,ornetworktosharedmemoryandI/Odevices.Theoverallsystemcanusuallybephysi-callysmalleranduselesspowerthananequiva-lentsetofuniprocessorsystemsbecausephysicallylargecomponentssuchasmemory,harddrives,andpowersuppliescanbesharedbysomeoralloftheprocessors.Pressurehasincreasedovertimetoachievemoreperformanceperunitvolumeofdata-centerspaceandperwatt,sincedatacentershaveniteroomforserversandtheirelectricbillscanbestagger-ing.Inresponse,theservermanufacturershavetriedtosavespacebyadoptingdenserserverpackagingsolutions,suchasbladeserversandswitchingtomul-tiprocessorsthatcansharecomponents.Somepowerreductionhasalsooccurredthroughthesharingofmorepower-hungrycomponentsinthesesystems.Theseshort-termsolutionsarereachingtheirpracticallimits,how-ever,assystemsarereachingthemaximumcomponentdensitythatcanstillbeeffectivelyair-cooled.Asaresult,thenextstageofdevelopmentforthesesystemsinvolvesanewstep:theCMP(chipmultiprocessor).5TherstCMPstargetedtowardtheservermarketimplementtwoormoreconventionalsuperscalarproces-sorstogetheronasingledie.6,7,8,9Theprimarymotivationforthisisreducedvolumemultipleprocessorscannowtinthespacewhereformerlyonlyonecould,sooverallperformanceperunitvolumecanbeincreased.Somesavingsinpoweralsooccursbecausealloftheproces-sorsonasinglediecanshareasingleconnectiontotherestofthesystem,reducingtheamountofhigh-speedcommunicationinfrastructurerequired,inadditiontothesharingpossiblewithaconventionalmultiprocessor.SomeCMPs,suchastherstonesannouncedfromAMDandIntel,shareonlythesysteminterfacebetweenproces-sorcores(illustratedingure4b),butothersshareoneormorelevelsofon-chipcache(gure4c),whichallowsinterprocessorcommunicationbetweentheCMPcoreswithoutoff-chipaccesses.Furthersavingsinpowercanbeachievedbytakingadvantageofthefactthatwhileserverworkloadsrequirehighthroughput,thelatencyofeachrequestisgenerallyIntelPowerOverTimepower(watts)year1985198719891991199319951997199920012003010100TheFutureofMicroprocessorsFIG3MultiprocessorsFOCUSQUEUESeptember200531morequeue:notascritical.10MostuserswillnotbebotherediftheirWebpagestakeafractionofasecondlongertoload,buttheywillcomplainiftheWebsitedropspagerequestsbecauseitdoesnothaveenoughthroughputcapacity.ACMP-basedsystemcanbedesignedtotakeadvantageofthissituation.Whenatwo-wayCMPreplacesauniprocessor,itispossibletoachieveessentiallythesameorbetterthrough-putonserver-orientedworkloadswithjusthalfoftheoriginalclockspeed.Eachrequestmaytakeuptotwiceaslongtoprocessbecauseofthereducedclockrate.Withmanyoftheseapplications,however,theslowdownwillbemuchless,becauserequestprocessingtimeismoreoftenlimitedbymemoryordiskperformancethanbyprocessorperformance.Sincetworequestscannowbeprocessedsimultaneously,however,theoverallthrough-putwillnowbethesameorbetter,unlessthereisseriouscontentionforthesamememoryordiskresources.Overall,eventhoughperformanceisthesameoronlyalittlebetter,thisadjustmentisstilladvantageousatthesystemlevel.Thelowerclockrateallowsustodesignthesystemwithasignicantlylowerpowersupplyvoltage,oftenanearlylinearreduction.Sincepowerispropor-tionaltothesquareofthevoltage,however,thepowerrequiredtoobtaintheoriginalperformanceismuchlowerusuallyabouthalf(halfofthevoltagesquared=aquarterofthepower,perprocessor,sothepowerrequiredforbothprocessorstogetherisabouthalf),althoughthepotentialsavingscouldbelimitedbystaticpowerdis-sipationandanyminimumvoltagelevelsrequiredbytheunderlyingtransistors.Forthroughput-orientedworkloads,evenmorepower/performanceandperformance/chipareacanbeachievedbytakingthe“latencyisunimportant”ideatoitsextremeandbuildingtheCMPwithmanysmallcoresinsteadofafewlargeones.Becausetypicalserverworkloadshaveverylowamountsofinstruc-tion-levelparallelismandmanymemorystalls,mostofthehardwareassociatedwithsuperscalarinstruc-tionissueisessentiallywastedfortheseapplica-tions.Atypicalserverwillhavetensorhundredsofrequestsinightatonce,however,sothereisenoughworkavailabletokeepmanyprocessorsbusysimultaneously.Therefore,replacingeachlarge,superscalarpro-cessorinaCMPwithsev-eralsmallones,ashasbeendemonstratedsuccessfullywiththeSunNiagara,11isawinningpolicy.Eachsmallprocessorwillprocessitsrequestmoreslowlythanalarger,superscalarprocessor,butthislatencyslowdownismorethancompensatedforbythefactthatthesamechipareacanbeoccupiedbyamuchlargernumberofprocessorsaboutfourtimesasmany,inthecaseCMPImplementationOptionsmainmemoryL2cacheCPUcore1L1I$L1D$regsregsregsregsCPUcoreNL1I$L1D$regsregsregsregsI/Od)multithreaded,shared-cachechipmultiprocessormainmemoryL2cacheL2cacheCPUcore1L1I$L1D$registersregistersCPUcoreNL1I$L1D$I/Oc)shared-cachechipmultiprocessormainmemoryL2cacheL2cacheCPUcore1L1I$L1D$registersregistersCPUcoreNL1I$L1D$I/Ob)simplechipmultiprocessormainmemoryCPUcoreL1I$L1D$registersI/Oa)conventionalmicroprocessorFIG432September2005QUEUErants:ofNiagara,whichhaseightsingle-issueSPARCprocessorcoresinatechnologythatcanholdonlyapairofsuper-scalarUltraSPARCcores.Takingthisideaonestepfurther,stillmorelatencycanbetradedforhigherthroughputwiththeinclusionofmultithreadinglogicwithineachofthecores.12,13,14Becauseeachcoretendstospendafairamountoftimewaitingformemoryrequeststobesatised,itmakessensetoassigneachcoreseveralthreadsbyincludingmultipleregisterles,oneperthread,withineachcore(gure4d).Whilesomeofthethreadsarewaitingformemorytorespond,theprocessormaystillexecuteinstructionsfromtheothers.Largernumbersofthreadscanalsoalloweachproces-sortosendmorerequestsofftomemoryinparallel,increasingtheutilizationofthehighlypipelinedmemorysystemsontodaysprocessors.Overall,threadswilltypi-callyhaveaslightlylongerlatency,becausetherearetimeswhenallareactiveandcompetingfortheuseoftheprocessorcore.ThegainfromperformingcomputationduringmemorystallsandtheabilitytolaunchnumerousmemoryaccessessimultaneouslymorethancompensatesforthislongerlatencyonsystemssuchasNiagara,whichhasfourthreadsperprocessoror32fortheentirechip,andPentiumchipswithIntelsHyperthreading,whichallowstwothreadstoshareaPentium4core.LATENCYPERFORMANCEIMPROVEMENTTheperformanceofmanyimportantapplicationsismea-suredintermsoftheexecutionlatencyofindividualtasksinsteadofhighoverallthroughputofmanyessentiallyunrelatedtasks.Mostdesktopprocessorapplicationsstillfallinthiscategory,asusersaregenerallymoreconcernedwiththeircomputersrespondingtotheircommandsasquicklyaspossiblethantheyarewiththeircomput-ersabilitytohandlemanycommandssimultaneously,althoughthissituationischangingslowlyovertimeasmoreapplicationsarewrittentoincludemany“back-ground”tasks.Usersofmanyothercomputation-boundapplications,suchasmostsimulationsandcompilations,aretypicallyalsomoreinterestedinhowlongthepro-gramstaketoexecutethaninexecutingmanyinparallel.Multiprocessorscanspeedupthesetypesofapplica-tions,butitrequireseffortonthepartofprogrammerstobreakupeachlong-latencythreadofexecutionintoalargenumberofsmallerthreadsthatcanbeexecutedonmanyprocessorsinparallel,sinceautomaticparalleliza-tiontechnologyhastypicallyfunctionedonlyonFortranprogramsdescribingdense-matrixnumericalcomputa-tions.Historically,communicationbetweenprocessorswasgenerallyslowinrelationtothespeedofindividualprocessors,soitwascriticalforprogrammerstoensurethatthreadsrunningonseparateprocessorsrequiredonlyminimalcommunicationwitheachother.Becausecommunicationreductionisoftendifcult,onlyasmallminorityofusersbotheredtoinvestthetimeandeffortrequiredtoparallelizetheirprogramsinawaythatcouldachievespeedup,sothesetechniquesweretaughtonlyinadvanced,graduate-levelcomputersciencecourses.Instead,inmostcasesprogrammersfoundthatitwasjusteasiertowaitforthenextgenerationofuni-processorstoappearandspeeduptheirapplicationsfor“free”insteadofinvestingtheeffortrequiredtoparallel-izetheirprograms.Asaresult,multiprocessorshadahardtimecompetingagainstuniprocessorsexceptinverylargesystems,wherethetargetperformancesimplyexceededthepowerofthefastestuniprocessorsavailable.Withtheexhaustionofessentiallyallperformancegainsthatcanbeachievedfor“free”withtechnologiessuchassuperscalardispatchandpipelining,wearenowenteringanerawhereprogrammersmustswitchtomoreparallelprogrammingmodelsinordertoexploitmulti-processorseffectively,iftheydesireimprovedsingle-pro-gramperformance.Thisisbecausethereareonlythreereal“dimensions”toprocessorperformanceincreasesbeyondMooreslaw:clockfrequency,superscalarinstruc-tionissue,andmultiprocessing.Wehavepushedthersttwototheirlogicallimitsandmustnowembracemultiprocessing,evenifitmeansthatprogrammerswillbeforcedtochangetoaparallelprogrammingmodeltoachievethehighestpossibleperformance.Conveniently,thetransitionfrommultiple-chipsystemstochipmultiprocessorsgreatlysimpliestheproblemstraditionallyassociatedwithparallelprogram-ming.Previouslyitwasnecessarytominimizecommu-nicationbetweenindependentthreadstoanextremelylowlevel,becauseeachcommunicationcouldrequirehundredsoreventhousandsofprocessorcycles.WithinanyCMPwithasharedon-chipcachememory,however,eachcommunicationeventtypicallytakesjustahandfulTheFutureofMicroprocessorsMultiprocessorsFOCUSQUEUESeptember200533morequeue:ofprocessorcycles.Withlatencieslikethese,communica-tiondelayshaveamuchsmallerimpactonoverallsystemperformance.Programmersmuststilldividetheirworkintoparallelthreads,butdonotneedtoworrynearlyasmuchaboutensuringthatthesethreadsarehighlyinde-pendent,sincecommunicationisrelativelycheap.Thisisnotacompletepanacea,however,becauseprogrammersmuststillstructuretheirinter-threadsynchronizationcorrectly,ortheprogrammaygenerateincorrectresultsordeadlock,butatleasttheperformanceimpactofcommu-nicationdelaysisminimized.Parallelthreadscanalsobemuchsmallerandstillbeeffectivethreadsthatareonlyhundredsorafewthou-sandcycleslongcanoftenbeusedtoextractparallelismwiththesesystems,insteadofthemillionsofcycleslongthreadstypicallynecessarywithconventionalparallelmachines.ResearchershaveshownthatparallelizationofapplicationscanbemadeeveneasierwithseveralschemesinvolvingtheadditionoftransactionalhardwaretoaCMP.15,16,17,18,19Thesesystemsaddbufferinglogicthatletsthreadsattempttoexecuteinparallel,andthendynamicallydetermineswhethertheyareactuallyparallelatruntime.Ifnointer-threaddependenciesaredetectedatruntime,thenthethreadscompletenormally.Ifdepen-denciesexist,thenthebuffersofsomethreadsareclearedandthosethreadsarerestarted,dynamicallyserializingthethreadsintheprocess.Suchhardware,whichisonlypracticalontightlycou-pledparallelmachinessuchasCMPs,eliminatestheneedforprogrammerstodeterminewhetherthreadsareparal-lelastheyparallelizetheirprogramstheyneedonlychoosepotentiallyparallelthreads.Overall,theshiftfromconventionalprocessorstoCMPsshouldbelesstraumaticforprogrammersthantheshiftfromconventionalproces-sorstomultichipmultiprocessors,becauseoftheshortCMPcommunicationlatenciesandenhancementssuchastransactionalmemory,whichshouldbecommerciallyavailablewithinthenextfewyears.Asaresult,thispara-digmshiftshouldbewithintherangeofwhatisfeasiblefor“typical”programmers,insteadofbeinglimitedtograduate-levelcomputersciencetopics.HARDWAREADVANTAGESInadditiontothesoftwareadvantagesnowandinthefuture,CMPshavemajoradvantagesoverconventionaluniprocessorsforhardwaredesigners.CMPsrequireonlyafairlymodestengineeringeffortforeachgenerationofprocessors.Eachmemberofafamilyofprocessorsjustrequiresthestampingdownofadditionalcopiesofthecoreprocessorandthenmakingsomemodicationstorelativelyslowlogicconnectingtheprocessorstogethertoaccommodatetheadditionalprocessorsineachgenera-tionandnotacompleteredesignofthehigh-speedprocessorcorelogic.Moreover,thesystemboarddesigntypicallyneedsonlyminortweaksfromgenerationtogeneration,sinceexternallyaCMPlooksessentiallythesamefromgenerationtogeneration,evenasthenumberofprocessorswithinitincreases.TheonlyrealdifferenceisthattheboardwillneedtodealwithhigherI/ObandwidthrequirementsastheCMPsscale.Overseveralsiliconprocessgenerations,thesavingsinengineeringcostscanbesignicant,becauseitisrelativelyeasytostampdownafewmorecoreseachtime.Also,thesameengineeringeffortcanbeamortizedacrossalargefamilyofrelatedprocessors.Simplyvary-ingthenumbersandclockfrequenciesofprocessorscanallowessentiallythesamehardwaretofunctionatmanydifferentprice/performancepoints.ANINEVITABLETRANSITIONAsaresultofthesetrends,weareatapointwherechipmultiprocessorsaremakingsignicantinroadsintothemarketplace.ThroughputcomputingistherstandmostpressingareawhereCMPsarehavinganimpact.Thisisbecausetheycanimprovepower/performanceresultsrightoutofthebox,withoutanysoftwarechanges,thankstothelargenumbersofindependentthreadsthatareavailableinthesealreadymultithreadedapplications.Inthenearfuture,CMPsshouldalsohaveanimpactinthemorecommonareaoflatency-criticalcomputations.Althoughitisnecessarytoparallelizemostlatency-criti-calsoftwareintomultipleparallelthreadsofexecutiontoreallytakeadvantageofachipmultiprocessor,CMPsmakethisprocesseasierthanwithconventionalmulti-processors,becauseoftheirshortinterprocessorcommu-nicationlatencies.Viewedanotherway,thetransitiontoCMPsisinevi-tablebecausepasteffortstospeedupprocessorarchi-tectureswithtechniquesthatdonotmodifythebasicvonNeumanncomputingmodel,suchaspipeliningandsuperscalarissue,areencounteringhardlimits.Asaresult,themicroprocessorindustryisleadingthewaytomulticorearchitectures;however,thefullbenetofthesearchitectureswillnotbeharnesseduntilthesoftwareindustryfullyembracesparallelprogramming.Theartofmultiprocessorprogramming,currentlymasteredbyonlyasmallminorityofprogrammers,ismorecomplexthanprogramminguniprocessormachinesandrequiresanunderstandingofnewcomputationalprinciples,algo-rithms,andprogrammingtools.Q34September2005QUEUErants:REFERENCES1.Moore,G.E.1965.Crammingmorecomponentsontointegratedcircuits.Electronics(April):114117.2.Hennessy,J.L.,andPatterson,D.A.2003.ComputerArchitecture:AQuantitativeApproach,3rdEdition,SanFrancisco,CA:MorganKaufmannPublishers.3.Wall,D.W.1993.LimitsofInstruction-LevelParallelism,WRLResearchReport93/6,DigitalWesternResearchLaboratory,PaloAlto,CA.4.Barroso,L.,Dean,J.,andHoezle,U.2003.Websearchforaplanet:thearchitectureoftheGooglecluster.IEEEMicro23(2):2228.5.Olukotun,K.,Nayfeh,B.A.,Hammond,L.Wilson,K.andChang,K.1996.Thecaseforasinglechipmulti-processor.Proceedingsofthe7thInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS-VII):211.6.Kapil,S.2003.UltraSPARCGemini:DualCPUProces-sor.InHotChips15(August),Stanford,CA;/archives/.7.Maruyama,T.2003.SPARC64VI:Fujitsusnextgen-erationprocessor.InMicroprocessorForum(October),SanJose,CA.8.McNairy,C.,andBhatia,R.2004.Montecito:thenextproductintheItaniumprocessorfamily.InHotChips16(August),Stanford,CA;/archives/.9.Moore,C.2000.POWER4systemmicroarchitecture.InMicroprocessorForum(October),SanJose,CA.10.Barroso,L.A.,Gharachorloo,K.,McNamara,R.,Nowatzyk,A.,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论