L02FundamentalsofComputerDesign体系结构.ppt_第1页
L02FundamentalsofComputerDesign体系结构.ppt_第2页
L02FundamentalsofComputerDesign体系结构.ppt_第3页
L02FundamentalsofComputerDesign体系结构.ppt_第4页
L02FundamentalsofComputerDesign体系结构.ppt_第5页
免费预览已结束,剩余66页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Chapter1,FundamentalsofComputerDesign,Outline,WhySuchChangein20years?TheEndoftheUniprocessorEraSeaChangeinChipDesignNewProjectinBerkeleyNewTrendsinComputerDesignWhatComputerArchitecturebringstoTable?1)TakingAdvantageofParallelism2)ThePrincipleofLocality3)FocusontheCommonCase4)AmdahlsLaw5)Processorperformanceequation,WhySuchChangein20years?,Performance性能TechnologyAdvances技术工艺的进步CMOSVLSIdominatesoldertechnologies(TTL,ECL)incostANDperformance在成本和性能上超越了较老的工艺技术Computerarchitectureadvancesimproveslow-end计算机体系结构的进步,改进了低端系统的性能RISC,superscalar,RAID,Price:LowercostsduetoSimplerdevelopment开发更简单CMOSVLSI:smallersystems,fewercomponents系统更小、部件更少(集成度高、功能强大)Highervolumes容量更大CMOSVLSI:samedevicecost10,000vs.10,000,000unitsFunctionRiseofnetworking/localinterconnectiontechnology联网/局部互联技术的高速发展,TechnologyTrends:MicroprocessorCapacity微处理器的晶体管数,CMOSimprovements:Diesize:2Xevery3yrsLinewidth:halve/7yrs,Alpha21264:15millionPentiumPro:5.5millionPowerPC620:6.9millionAlpha21164:9.3millionSparcUltra:5.2million,1971第一款处理器4004(4位微处理器)只有2300个晶体管P处理器包含超过2000万个晶体管,MemoryCapacity(SingleChipDRAM),yearsize(Mb)cyctime19800.0625250ns19830.25220ns19861190ns19894165ns199216145ns199664120ns2000256100ns,CurrentDRAM,TechnologyTrends(Summary),CapacitySpeed(latency)Logic2xin3years2xin3yearsDRAM4xin3-4years2xin10yearsDisk4xin2-3years2xin10years,ProcessorPerformanceTrends,Year,Gatesperclock,Atypicalpipelinehasafixedamountofworkthatisrequiredtodecodeandexecuteaninstruction.Thisworkisperformedbyindividuallogicaloperationscalledgates.Gatesperclockishowmanygatesinapipelinemaychangestateinasingleclockcycle.Ifweincreaseclockspeedfasterthanimprovementsingatespeed,wecanjustreducethegatesperclockandaddmorepipelinestages.Thiscanbereducedbyinsertinglatches(锁存器)intothedatapath:whenthenumberofgatesbetweenlatchesisreduced,ahigherclockispossible.,ProcessorPerformance(1.35Xbefore,1.55Xin90s),80年代中期以前,技术驱动:电路技术。此后,得益于先进的系统结构思想:流水技术、乱序执行、超标量、多级Cache,ProcessorPerformanceTrends(Summary),Workstationperformance(measuredinSPECMarks)improvesroughly50%peryear(2Xevery18months)以SPEC分数评测,工作站性能大约每年改进50%(每十八月翻一番)Improvementincostperformanceestimatedat70%peryear性能价格比大约每年改进70%,补充:SPEC基准程序(SPECbenchmark),限制微处理器设计、实现的严峻挑战不是制造能力,而是:功耗密度,Crossroads:UniprocessorPerformance,VAX:25%/year1978to1986RISC+x86:52%/year1986to2002RISC+x86:20%/year2002topresent,FromHennessyandPatterson,ComputerArchitecture:AQuantitativeApproach,4thedition,October,2006,RecentIntelProcessors,“Wearededicatingallofourfutureproductdevelopmenttomulticoredesigns.Webelievethisisakeyinflectionpointfortheindustry.”IntelPresidentPaulOtellini,IDF2005,TheEndoftheUniprocessorEra,Singlebiggestchangeinthehistoryofcomputingsystems,OldConventionalWisdom:Powerisfree,TransistorsexpensiveNewConventionalWisdom:“Powerwall”Powerexpensive,Transistorsfree(Canputmoreonchipthancanaffordtoturnon)OldCW:SufficientincreasingInstruction-LevelParallelismviacompilers,innovation(pipelining,superscalar,out-of-order,speculation,VLIW,)NewCW:“ILPwall”lawofdiminishingreturnsonmoreHWforILPOldCW:Multiplies(乘法器)areslow,MemoryaccessisfastNewCW:“Memorywall”Memoryslow,multipliesfast(200clockcyclestoDRAMmemory,4clocksformultiply)OldCW:Uniprocessorperformance2X/1.5yrsNewCW:PowerWall+ILPWall+MemoryWall=BrickWallUniprocessorperformancenow2X/5(?)yrsSeachangeinchipdesign:multiple“cores”(2Xprocessorsperchip/2years)More,simplerprocessorsaremorepowerefficient,ConventionalWisdominComputerArchitecture,TLP:2+cores/2yearsDLP:2xwidth/4yearsPredictionforx86processors,fromHennessy2XCPUs,1.2XclockrateHWresearchcommunitydoeslogicdesign(“gateshareware”)tocreateout-of-the-box,MassivelyParallelProcessorrunsstandardbinariesofOS,appsGateware:Processors,Caches,Coherency,EthernetInterfaces,Switches,Routers,(somefreefromopensourcehardware)(/)E.g.,1000processor,standardISA(IBMPOWER)binary-compatible,64-bit,cache-coherentsupercomputer200MHz/CPUin2007,FPGA的优势,过去的20年中,CPU速度一直遵循摩尔定律,计算机并行性未受太多关注。2005年以来由于功耗和散热等问题,单CPU速度增长趋于停止,多核芯片开始推出,并行计算机系统设计成为研究热点。由于FPGA正按照摩尔定律在速度、价格、集成度方面不断进步,所以由FPGA实现的并行计算机系统将很快趋于实用化。随着半导体工艺向深亚纳米演进,产品越来越复杂,技术成本不断上升。同时,由于产品的市场机会越来越短,由3-5年缩短为1年,“可编程”成为缩短上市时间的一个必要功能。FPGA的优势越来越明显,FPGA的每个逻辑单元的价格每年下降25%。FPGA的市场增长速度是ASSP(专用标准产品)的两倍,是ASIC(专用集成电路)的3倍。,多方观点,RichardSevcik,Xilinx公司执行副总裁:多平台FPGA的发展将终结ASIC时代?DaveBursky,ElectronicDesign数字IC/DSP编辑:FPGA技术的进步铺平了通向真正的SoC解决方案之路JustinR.Rattner,IntelCTO:我感兴趣的一个领域是可重复配置的硬件。更普通的设计是在处理器中集成FPGA,我们会在今年某些时候增加这样的研究项目,进行大规模的实验。,RAMP,Sincegoalistorampupresearchinmultiprocessing,calledResearchAcceleratorforMultipleProcessorsTolearnmore,read“RAMP:ResearchAcceleratorforMultipleProcessors-ACommunityVisionforaSharedExperimentalParallelHW/SWPlatform,”TechnicalReportUCB/CSD-05-1412,Sept2005,SourceIEEEMicroV27,I2(March2007)Pages46-57AuthorsJohnWawrzynekUniversityofCalifornia,BerkeleyDavidPattersonUniversityofCalifornia,BerkeleyMarkOskinUniversityofWashingtonShih-LienLuIntelChristoforosKozyrakisStanfordUniversityJamesC.HoeCarnegieMellonUniversityDerekChiouUniversityofTexasatAustinKrsteAsanovicMassachusettsInstituteofTechnologyW,RAMP:ResearchAcceleratorforMultipleProcessors,ABSTRACTTheRAMPprojectsgoalistoenabletheintensive,multidisciplinaryinnovationthatthecomputingindustrywillneedtotackletheproblemsofparallelprocessing.RAMPitselfisanopen-source,community-developed,FPGA-basedemulatorofparallelarchitectures.Itsdesignframeworkletsalarge,collaborativecommunitydevelopandcontributereusable,composabledesignmodules.Threecompletedesigns-fortransactionalmemory,distributedsystems,anddistributed-sharedmemory-demonstratetheplatformspotential.,thestonesoupofarchitectureresearchplatforms,I/O,Patterson,Monitoring,Kozyrakis,NetSwitch,Oskin,Coherence,Hoe,Cache,Asanovic,PPC,Arvind,x86,Lu,Glue-support,Chiou,Hardware,Wawrzynek,RAMPuses(internal),Internet-in-a-Box,Patterson,BlueSpec,Arvind,Net-uP,Chiou,Wawrzynek,BEE,WhyRAMPGoodforResearch?,CompletedDec.2004(14x17inch22-layerPCB),Module:FPGAs,memory,10GigEconn.CompactFlashAdministration/maintenanceports:10/100EnetHDMI/DVIUSB4K/modulew/oFPGAsorDRAM,RAMP1Hardware,Called“BEE2”forBerkeleyEmulationEngine2,MultipleModuleRAMP1Systems,8computemodules(pluspowersupplies)in8Urackmountchassis500-1000emulatedprocessorsManytopologiespossible2UsinglemoduletrayfordevelopersDiskstorage:diskemulator+NetworkAttachedStorage,千兆位级收发器(MGT),RAMPsISA,Gotit:Power405(32b),SPARCv8(32b),XilinxMicroblaze(32b)VeryLikely:SPARCv9(64b),Likely:IBMPower64bProbably(haventasked):MIPS32,MIPS64Notlikely:x86Evenlesslikely:x86-64Wellsueyou:ARM,Vision:MultiprocessingWateringHole,RAMPattractsmanycommunitiestosharedartifactCross-disciplinaryinteractionsAccelerateinnovationinmultiprocessingRAMPasnextStandardResearchPlatform?(e.g.,VAX/BSDUnixin1980s,x86/Linuxin1990s),RAMP,Parallelfilesystem,Threadscheduling,Multiprocessorswitchdesign,Faultinsertiontocheckdependability,Datacenterinabox,Internetinabox,Dataflowlanguage/computer,Securityenhancements,Routerdesign,CompiletoFPGA,Parallellanguages,PapersandTechnicalReportsofRAMPProject/index.php?publications,BerkeleysNewFocus,Paperreviewstheissuesand,asanexample,describeanintegratedapproachweredevelopingattheParallelComputingLaboratorytotackletheparallelchallenge.Akeyresearchobjectiveistoenableprogrammerstoeasilywriteprogramsthatrunasefficientlyonmany-coresystemsasonsequentialones.,AViewoftheParallelComputingLandscape(communicationsoftheacm,oct.2009,vol.52,no.10),12computationalpatternsin7generalapplicationareasand5ParLabapplications,Twolongtechnicalreports,Asanovic,K.etal.TheParallelComputingLaboratoryatU.C.Berkeley:AResearchAgendaBasedontheBerkeleyView.UCB/EECS-2008-23,UniversityofCalifornia,Berkeley,Mar.21,2008.Asanovic,K.etal.TheLandscapeofParallelComputingResearch:AViewfromBerkeley.UCB/EECS-2006-183,UniversityofCalifornia,Berkeley,Dec.18,2006.,ParallelComputingResearchatIllinoisTheUPCRCAgenda,NewTrendsinComputerDesign,Top500,一个为高性能计算机提供统计的组织。主要针对高性能计算机制造商,用户,潜在用户。Top500从1993年开始对高性能计算机用Linpack程序进行基准测试,取前500个最优质系统进行列表在Top500网站上进行公布。Linpack:是一个求解100个线形方程的计算机程序,被用于对高性能计算机进行基准测试。/,Top10Supercomputers-06/2012,2020/5/22,43,用于个人计算机、工作站和游戏机的专用图像显示设备显示卡或主板集成;nVidia和ATI(nowAMD)是主要制造商,2020/5/22,44,GraphicProcessingUnit(GPU),GPU与CPU的差异,GPU,面向计算密集型和大量数据并行化的计算大量的晶体管用于计算单元,通用CPU,面向通用计算大量的晶体管用于Cache和控制电路,CPU,GPU,GPU与CPU的峰值速度比较,1Basedonslide7ofS.Green,“GPUPhysics,”SIGGRAPH2007GPGPUCourse./s2007/slides/15-GPGPU-physics.pdf,暴增的GPU(CUDA)核心数量,GeForceGTX690:ShaderProcessors,2x1536,2020/5/22,48,CPU、GPU、FPGA实现比较,APU是“AcceleratedProcessingUnits”的简称,是AMD融聚理念的产品,它第一次将处理器和独显核心做在一个晶片上。APU微架构由五大部分融合而成:CPU、GPU、北桥、内存控制器和输入输出控制器。,APU:让CPU和GPU融为一体,2020/5/22,49,IntelsManyCoreandMulti-core,Intel80-coreTeraScaleProcessor(Vangaletal.2008)developedasolver(singleprecision)forthischipthatranat1TFLOPwithonly97Watts,Source:TimMattson,IntelLabs,Trendsareputtingallontoonechip,Thefuturebelongstoheterogeneous,manycoreSOCasthestandardbuildingblockofcomputingSOC=systemonachip,Source:TimMattson,IntelLabs,XeonPhi,XeonPhi是由美国英特尔公司于2012年11月12日正式推出的首款60核处理器。XeonPhi并非传统意义上的CPU,它更像是与CPU协同工作的GPU,其基于英特尔消费级GPU技术Larrabee,不过该项目已经于2009年被取消。英特尔需要Larrabee技术,从而在超级计算机市场与Nvidia竞争,因为更简单、更专业的GPU处理器可以更有效地处理某些超级计算任务,从而提高性能并减少能耗。,TOP10ListJune2013,TOP10ListJune2016,TOP10ListJune2018,JohnHennessy(ComputingLegend,LeadinginventorofRISCandPresidentofStanfordUniversity)“whenwestarttalkingaboutparallelismandeaseofuseoftrulyparallelcomputers,weretalkingaboutaproblemthatsashardasanythatcomputersciencehasfaced.”“AconversationwithHennessyandPatterson,”ACMQueueMagazine,4:10,1/07,2020/5/22,56,ChallengeofParallelProgramming,WhatComputerArchitecturebringstoTable?,WhatComputerArchitecturebringstoTable,OtherfieldsoftenborrowideasfromarchitectureQuantitativePrinciplesofDesign设计角度TakeAdvantageofParallelismPrincipleofLocalityFocusontheCommonCaseAmdahlsLawTheProcessorPerformanceEquationCareful,quantitativecomparisons分析角度Define,quantity,andsummarizerelativeperformance性能Defineandquantityrelativecost成本Defineandquantitydependability可靠性Defineandquantitypower功耗CultureofanticipatingandexploitingadvancesintechnologyCultureofwell-definedinterfacesthatarecarefullyimplementedandthoroughlychecked,1)TakingAdvantageofParallelism,IncreasingthroughputofservercomputerviamultipleprocessorsormultipledisksDetailedHWdesignCarrylookaheadadders(超前进位加法器)usesparallelismtospeedupcomputingsumsfromlineartologarithmic(对数的)innumberofbitsperoperandMultiplememorybankssearchedinparallelinset-associativecaches(组相联Cache)Pipelining:overlap(交迭)instructionexecutiontoreducethetotaltimetocompleteaninstructionsequence.Noteveryinstructiondependsonimmediatepredecessor(前条指令)executinginstructionscompletely/partiallyinparallelpossibleClassic5-stagepipeline:1)InstructionFetch(Ifetch),2)RegisterRead(Reg),3)Execute(ALU),4)DataMemoryAccess(Dmem),5)RegisterWrite(Reg),PipelinedInstructionExecution,Limitstopipelining,Hazards冲突preventnextinstructionfromexecutingduringitsdesignated(指定的)clockcycleStructuralhazards:attempttousethesamehardwaretodotwodifferentthingsatonceDatahazards:InstructiondependsonresultofpriorinstructionstillinthepipelineControlhazards:Causedbydelaybetweenthefetchingofinstructionsanddecisionsaboutchangesincontrolflow(branchesandjumps).,2)ThePrincipleofLocality,ThePrincipleofLocality:Programaccessarelativelysmallportionoftheaddressspaceatanyinstantoftime.TwoDifferentTypesofLocality:TemporalLocality(LocalityinTime):Ifanitemisreferenced,itwilltendtobereferencedagainsoon(e.g.,loops,reuse)SpatialLocality(LocalityinSpace):Ifanitemisreferenced,itemswhoseaddressesareclosebytendtobereferencedsoon(e.g.,straight-linecode,arrayaccess)Last30years,HWreliedonlocalityformemoryperf.,P,MEM,$,LevelsoftheMemoryHierarchy,CPURegisters100sBytes300500ps(0.3-0.5ns),L1andL2Cache10s-100sKBytes1ns-10ns$1000s/GByte,MainMemoryGBytes80ns-200ns$100/GByte,Disk10sTBytes,10ms(10,000,000ns)$1/GByte,CapacityAccessTimeCost,Tapeinfinitesec-min$1/GByte,Registers,L1Cache,Memory,Disk,Tape,Instr.Operands,Blocks,Pages,Files,StagingXferUnit,prog./compiler1-8bytes,cachecntl32-64bytes,OS4K-8Kbytes,user/operatorMbytes,UpperLevel,LowerLevel,faster,Larger,L2Cache,cachecntl64-128bytes,Blocks,3)FocusontheCommonCase常见的情况,CommonsenseguidescomputerdesignSinceitsengineering,commonsenseisvaluableInmakingadesigntrade-off,favorthefrequentcaseovertheinfrequentcase平衡E.g.,Instructionfetchanddecodeunitusedmorefrequentlythanmultiplier,sooptimizeit1stE.g.,Ifdatabaseserverhas50disks/processor,storagedependabilitydominatessystemdependability,sooptimizeit1stFrequentcaseisoftensimplerandcanbedonefasterthantheinfrequentcaseE.g.,overflowisrarewhenadding2numbers,soimproveperformancebyoptimizingmorecommoncaseofnooverflowMayslowdownoverflow,butoverallperformanceimprovedbyoptimizingforthenormalcaseWhatisfrequentcaseandhowmuchperformanceimprovedbymakingcasefaster=AmdahlsLaw,4)AmdahlsLaw,1.FractionenhancedThefractionofthecomputationtimeintheoriginalcomputerthatcanbeconvertedtotakeadvantageoftheenhancementForexample,if20softheexecutiontimeofaprogram(takes60sintotal)canuseanenhancement,thefractionis20/60.Fractionenhancedisalwayslessthanorequalto1.增强比例:计算机执行某个任务的总时间中可被改进部分的时间所占的百分比.2.SpeedupenhancedTheimprovementgainedbytheenhancedexecutionmode;Thisv

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论