CS267 并行计算机的应用_第1页
CS267 并行计算机的应用_第2页
CS267 并行计算机的应用_第3页
CS267 并行计算机的应用_第4页
CS267 并行计算机的应用_第5页
已阅读5页,还剩84页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1CS267

ApplicationsofParallelComputers

/~demmel/cs267_Spr13/

Lecture1:Introduction

JimDemmelEECS&MathDepartmentsdemmel@2OutlineWhypowerfulcomputersmustbeparallelprocessorsLargeComputationalScienceandEngineering(CSE)problemsrequirepowerfulcomputersWhywriting(fast)parallelprogramsishardStructureofthecourseCommercialproblemstooIncludingyourlaptopsandhandheldsallButthingsareimproving3UnitsofMeasureHighPerformanceComputing(HPC)unitsare:Flop:floatingpointoperation,usuallydoubleprecisionunlessnotedFlop/s:floatingpointoperationspersecondBytes:sizeofdata(adoubleprecisionfloatingpointnumberis8)Typicalsizesaremillions,billions,trillions…Mega Mflop/s=106flop/sec Mbyte=220=1048576~106bytesGiga Gflop/s=109flop/sec Gbyte=230~109bytesTera Tflop/s=1012flop/sec Tbyte=240~1012bytesPeta Pflop/s=1015flop/sec Pbyte=250~1015bytesExa Eflop/s=1018flop/sec Ebyte=260~1018bytesZetta Zflop/s=1021flop/sec Zbyte=270~1021bytesYotta Yflop/s=1024flop/sec Ybyte=280~1024bytesCurrentfastest(public)machine~27Pflop/sUp-to-datelistat

4Whypowerfulcomputersareparallelcirca1991-2006all(2007)5TunnelVisionbyExperts“Ithinkthereisaworldmarketformaybefivecomputers.”ThomasWatson,chairmanofIBM,1943.“Thereisnoreasonforanyindividualtohaveacomputerintheirhome”KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.“640K[ofmemory]oughttobeenoughforanybody.”BillGates,chairmanofMicrosoft,1981.“Onseveralrecentoccasions,Ihavebeenaskedwhetherparallelcomputingwillsoonberelegatedtothetrashheapreservedforpromisingtechnologiesthatneverquitemakeit.”KenKennedy,CRPCDirectory,1994Slidesource:Warfieldetal.6TechnologyTrends:MicroprocessorCapacity2Xtransistors/ChipEvery1.5yearsCalled“Moore’sLaw”

Moore’sLawMicroprocessorshavebecomesmaller,denser,andmorepowerful.GordonMoore(co-founderofIntel)predictedin1965thatthetransistordensityofsemiconductorchipswoulddoubleroughlyevery18months.Slidesource:JackDongarra7MicroprocessorTransistors/Clock(1970-2000)8ImpactofDeviceShrinkageWhathappenswhenthefeaturesize(transistorsize)shrinksbyafactorofx?Clockrategoesupbyxbecausewiresareshorteractuallylessthanx,becauseofpowerconsumptionTransistorsperunitareagoesupbyx2Diesizealsotendstoincreasetypicallyanotherfactorof~xRawcomputingpowerofthechipgoesupby~x4!typically

x3isdevotedtoeitheron-chipparallelism:hiddenparallelismsuchasILPlocality:cachesSomostprogramsx3timesfaster,withoutchangingthem9ManufacturingIssuesLimitPerformanceMoore’s2ndlaw(Rock’slaw):costsgoupDemoof0.06micronCMOSSource:ForbesMagazineYieldWhatpercentageofthechipsareusable?E.g.,Cellprocessor(PS3)wassoldwith7outof8“on”toimproveyieldManufacturingcostsandyieldproblemslimituseofdensityPowerDensityLimitsSerialPerformance40048008808080858086286386486Pentium®P611010010001000019701980199020002010YearPowerDensity(W/cm2)HotPlateNuclearReactorRocketNozzleSun’sSurfaceSource:PatrickGelsinger,ShenkarBokar,Intel

Scalingclockspeed(businessasusual)willnotworkHighperformanceserialprocessorswastepowerSpeculation,dynamicdependencechecking,etc.burnpowerImplicitparallelismdiscoveryMoretransistors,butnotfasterserialprocessorsConcurrentsystemsaremorepowerefficientDynamicpowerisproportionaltoV2fCIncreasingfrequency(f)alsoincreasessupplyvoltage(V)

cubiceffectIncreasingcoresincreasescapacitance(C)butonlylinearlySavepowerbyloweringclockspeed11RevolutioninProcessorsChipdensityiscontinuingincrease~2xevery2yearsClockspeedisnotNumberofprocessorcoresmaydoubleinsteadPowerisundercontrol,nolongergrowing12Parallelismin2013?TheseargumentsarenolongertheoreticalAllmajorprocessorvendorsareproducingmulticorechipsEverymachinewillsoonbeaparallelmachineTokeepdoublingperformance,parallelismmustdoubleWhich(commercial)applicationscanusethisparallelism?Dotheyhavetoberewrittenfromscratch?Willallprogrammershavetobeparallelprogrammers?NewsoftwaremodelneededTrytohidecomplexityfrommostprogrammers–eventuallyInthemeantime,needtounderstanditComputerindustrybettingonthisbigchange,butdoesnothavealltheanswersBerkeleyParLabestablishedtoworkonthisMemoryisNotKeepingPaceTechnologytrendsagainstaconstantorincreasingmemorypercoreMemorydensityisdoublingeverythreeyears;processorlogiciseverytwoStoragecosts(dollars/Mbyte)aredroppinggraduallycomparedtologiccostsSource:DavidTurek,IBMCostofComputationvs.MemoryQuestion:Canyoudoubleconcurrencywithoutdoublingmemory?Strongscaling:fixedproblemsize,increasenumberofprocessorsWeakscaling:growproblemsizeproportionallytonumberofprocessorsSource:IBMListingthe500mostpowerfulcomputersintheworldYardstick:RmaxofLinpackSolveAx=b,denseproblem,matrixisrandomDominatedbydensematrix-matrixmultiplyUpdatetwiceayear:ISC’xyinJuneinGermanySCxyinNovemberintheU.S.AllinformationavailablefromtheTOP500websiteat:TheTOP500ProjectTheTOP10inNovember2012RankSiteManufacturerComputerCountryCoresRmax[Pflops]Power[MW]1OakRidgeNationalLaboratoryCrayTitan

CrayXK7,Opteron16C2.2GHz,Gemini,NVIDIAK20xUSA560,64017.598.212LawrenceLivermoreNationalLaboratoryIBMSequoiaBlueGene/Q,

PowerBQC16C1.6GHz,CustomUSA1,572,86416.327.893RIKENAdvancedInstituteforComputationalScienceFujitsuKcomputer

SPARC64VIIIfx2.0GHz,

TofuInterconnect

Japan705,02410.5112.664ArgonneNationalLaboratoryIBMMiraBlueGene/Q,

PowerBQC16C1.6GHz,Custom

USA786,4328.163.955ForschungszentrumJuelich(FZJ)IBMJUQUEENBlueGene/Q,

PowerBQC16C1.6GHz,CustomGermany393,2164.141.976LeibnizRechenzentrumIBMSuperMUC

iDataPlexDX360M4,

XeonE58C2.7GHz,InfinibandFDRGermany147,4562.903.427TexasAdvancedComputingCenter/UTDellStampede

PowerEdgeC8220,

XeonE58C2.7GHz,IntelXeonPhiUSA204,9002.668NationalSuperComputerCenterinTianjinNUDTTianhe-1A

NUDTTHMPP,

Xeon6C,NVidia,FT-10008CChina186,3682.574.049CINECAIBMFermi

BlueGene/Q,

PowerBQC16C1.6GHz,CustomItaly163,8401.73.8210IBMIBMDARPATrialSubset

Power775,

Power78C3.84GHz,CustomUSA63,3601.523.5816RankSiteManufacturerComputerCountryCoresRmax[Pflops]Power[MW]1OakRidgeNationalLaboratoryCrayTitan

CrayXK7,Opteron16C2.2GHz,Gemini,NVIDIAK20xUSA560,64017.598.212LawrenceLivermoreNationalLaboratoryIBMSequoiaBlueGene/Q,

PowerBQC16C1.6GHz,CustomUSA1,572,86416.327.893RIKENAdvancedInstituteforComputationalScienceFujitsuKcomputer

SPARC64VIIIfx2.0GHz,

TofuInterconnect

Japan705,02410.5112.664ArgonneNationalLaboratoryIBMMiraBlueGene/Q,

PowerBQC16C1.6GHz,Custom

USA786,4328.163.955ForschungszentrumJuelich(FZJ)IBMJUQUEENBlueGene/Q,

PowerBQC16C1.6GHz,CustomGermany393,2164.141.976LeibnizRechenzentrumIBMSuperMUC

iDataPlexDX360M4,

XeonE58C2.7GHz,InfinibandFDRGermany147,4562.903.427TexasAdvancedComputingCenter/UTDellStampede

PowerEdgeC8220,

XeonE58C2.7GHz,IntelXeonPhiUSA204,9002.668NationalSuperComputerCenterinTianjinNUDTTianhe-1A

NUDTTHMPP,

Xeon6C,NVidia,FT-10008CChina186,3682.574.049CINECAIBMFermi

BlueGene/Q,

PowerBQC16C1.6GHz,CustomItaly163,8401.73.8210IBMIBMDARPATrialSubset

Power775,

Power78C3.84GHz,CustomUSA63,3601.523.5819LawrenceBerkeleyNationalLaboratoryCrayHopper

CrayXE6,6C2.1GHzUSA153,4081.0542.91TheTOP10inNovember2012,plusonePerformanceDevelopment(Nov2012)59.7GFlop/s400MFlop/s1.17TFlop/s17.6PFlop/s76.5TFlop/s162PFlop/sSUMN=1N=5001Gflop/s

1Tflop/s100Mflop/s100Gflop/s100Tflop/s10Gflop/s10Tflop/s

1Pflop/s100Pflop/s10Pflop/s1Eflop/sProjectedPerformanceDevelopment(Nov2012)SUMN=1N=5001Gflop/s1Tflop/s100Mflop/s100Gflop/s100Tflop/s10Gflop/s10Tflop/s1Pflop/s100Pflop/s10Pflop/s1Eflop/sCoreCountMoore’sLawreinterpretedNumberofcoresperchipcandoubleeverytwoyearsClockspeedwillnotincrease(possiblydecrease)NeedtodealwithsystemswithmillionsofconcurrentthreadsNeedtodealwithinter-chipparallelismaswellasintra-chipparallelism21OutlineWhypowerfulcomputersmustbeparallelprocessorsLargeCSEproblemsrequirepowerfulcomputersWhywriting(fast)parallelprogramsishardStructureofthecourseCommercialproblemstooIncludingyourlaptopsandhandheldsallButthingsareimproving22ComputationalScience-NewsNature,March23,2006“Animportantdevelopmentinsciencesisoccurringattheintersectionofcomputerscienceandthesciencesthathasthepotentialtohaveaprofoundimpactonscience.Itisaleapfromtheapplicationofcomputing…totheintegrationofcomputerscienceconcepts,tools,andtheoremsintotheveryfabricofscience.”-Science2020Report,March200623DriversforChangeContinuedexponentialincreaseincomputationalpowerCansimulatewhattheoryandexperimentcan’tdoContinuedexponentialincreaseinexperimentaldataMoore’sLawappliestosensorstooNeedtoanalyzeallthatdataContinuedexponentialincreaseincomputationalpower

simulationisbecomingthirdpillarofscience,complementingtheoryandexperimentContinuedexponentialincreaseinexperimentaldata

techniquesandtechnologyindataanalysis,visualization,analytics,networking,andcollaborationtoolsarebecomingessentialinalldatarichscientificapplications24

Simulation:TheThirdPillarofScience

Traditionalscientificandengineeringmethod:(1)Dotheoryorpaperdesign(2)PerformexperimentsorbuildsystemLimitations:

–Toodifficult—buildlargewindtunnels–Tooexpensive—buildathrow-awaypassengerjet–Tooslow—waitforclimateorgalacticevolution–Toodangerous—weapons,drugdesign,climate experimentationComputationalscienceandengineeringparadigm:(3)UsecomputerstosimulateandanalyzethephenomenonBasedonknownphysicallawsandefficientnumericalmethodsAnalyzesimulationresultswithcomputationaltoolsandmethodsbeyondwhatispossiblemanuallySimulationTheoryExperimentDataDrivenScienceScientificdatasetsaregrowingexponentiallyAbilitytogeneratedataisexceedingourabilitytostoreandanalyzeSimulationsystemsandsomeobservationaldevicesgrowincapabilitywithMoore’sLawPetabyte(PB)datasetswillsoonbecommon:Climatemodeling:estimatesofthenextIPCCdataisin10sofpetabytesGenome:JGIalonewillhave.5petabyteofdatathisyearanddoubleeachyearParticlephysics:LHCisprojectedtoproduce16petabytesofdataperyearAstrophysics:LSSTandotherswillproduce5petabytes/year(via3.2Gigapixelcamera)Createscientificcommunitieswith“ScienceGateways”todata2526SomeParticularlyChallengingComputationsScienceGlobalclimatemodelingBiology:genomics;proteinfolding;drugdesignAstrophysicalmodelingComputationalChemistryComputationalMaterialSciencesandNanosciencesEngineeringSemiconductordesignEarthquakeandstructuralmodelingComputationfluiddynamics(airplanedesign)Combustion(enginedesign)CrashsimulationBusinessFinancialandeconomicmodelingTransactionprocessing,webservicesandsearchenginesDefenseNuclearweapons--testbysimulationsCryptography27EconomicImpactofHPCAirlines:System-widelogisticsoptimizationsystemsonparallelsystems.Savings:approx.$100millionperairlineperyear.Automotivedesign:Majorautomotivecompaniesuselargesystems(500+CPUs)for:CAD-CAM,crashtesting,structuralintegrityandaerodynamics.Onecompanyhas500+CPUparallelsystem.Savings:approx.$1billionpercompanyperyear.Semiconductorindustry:Semiconductorfirmsuselargesystems(500+CPUs)fordeviceelectronicssimulationandlogicvalidationSavings:approx.$1billionpercompanyperyear.EnergyComputationalmodelingimprovedperformanceofcurrentnuclearpowerplants,equivalenttobuildingtwonewpowerplants.28$5BWorldMarketinTechnicalComputingin2004Source:IDC2004,fromNRCFutureofSupercomputingReport29WhatSupercomputersDo–TwoExamplesClimatemodelingsimulationreplacingexperimentthatistooslowCosmicmicrowavebackgroundraditionanalyzingmassiveamountsofdatawithnewtools30GlobalClimateModelingProblemProblemistocompute:f(latitude,longitude,elevation,time)

“weather”=(temperature,pressure,humidity,windvelocity)Approach:Discretizethedomain,e.g.,ameasurementpointevery10kmDeviseanalgorithmtopredictweatherattimet+dtgiventUses:Predictmajorevents,e.g.,ElNinoUseinsettingairemissionsstandardsEvaluateglobalwarmingscenariosSource:/chammp/chammp.html31GlobalClimateModelingComputationOnepieceismodelingthefluidflowintheatmosphereSolveNavier-StokesequationsRoughly100Flopspergridpointwith1minutetimestepComputationalrequirements:Tomatchreal-time,need5x1011flopsin60seconds=8Gflop/sWeatherprediction(7daysin24hours)56Gflop/sClimateprediction(50yearsin30days)4.8Tflop/sTouseinpolicynegotiations(50yearsin12hours)288Tflop/sTodoublethegridresolution,computationis8xto16xStateoftheartmodelsrequireintegrationofatmosphere,clouds,ocean,sea-ice,landmodels,pluspossiblycarboncycle,geochemistryandmoreCurrentmodelsarecoarserthanthis32HighResolutionClimateModelingonNERSC-3–P.Duffy,etal.,LLNL33U.S.A.HurricaneSource:DatafromM.Wehner,visualizationbyPrabhat,LBNL34NERSCUserGeorgeSmootwins2006NobelPrizeinPhysicsSmootandMather1992COBEExperimentshowedanisotropyofCMBCosmicMicrowaveBackgroundRadiation(CMB):animageoftheuniverseat400,000years35TheCurrentCMBMapUniqueimprintofprimordialphysicsthroughthetinyanisotropiesintemperatureandpolarization.Extractingthese

Kelvinfluctuationsfrominherentlynoisydataisaseriouscomputationalchallenge.sourceJ.Borrill,LBNL36EvolutionOfCMBDataSets:Cost>O(Np^3)ExperimentNtNpNbLimitingDataNotesCOBE(1989)2x1096x1033x101TimeSatellite,WorkstationBOOMERanG(1998)3x1085x1053x101PixelBalloon,1stHPC/NERSC(4yr)WMAP(2001)7x10104x1071x103?Satellite,Analysis-boundPlanck(2007)5x10116x1086x103Time/PixelSatellite,MajorHPC/DAeffortPOLARBEAR(2007)8x10126x1061x103TimeGround,NG-multiplexingCMBPol(~2020)1014109104Time/PixelSatellite,Earlyplanning/designdatacompressionWhichcommercialapplicationsrequireparallelism?Analyzedindetailin“BerkeleyView”reportAnalyzedindetailin“BerkeleyView”report/Pubs/TechRpts/2006/EECS-2006-183.htmlMotif/Dwarf:CommonComputationalMethods

(RedHot

BlueCool)WhatdocommercialandCSEapplicationshaveincommon?39OutlineWhypowerfulcomputersmustbeparallelprocessorsLargeCSEproblemsrequirepowerfulcomputersWhywriting(fast)parallelprogramsishardStructureofthecourseCommercialproblemstooIncludingyourlaptopsandhandheldsallButthingsareimproving40PrinciplesofParallelComputingFindingenoughparallelism(Amdahl’sLaw)Granularity–howbigshouldeachparalleltaskbeLocality–movingdatacostsmorethanarithmeticLoadbalance–don’twant1KprocessorstowaitforoneslowoneCoordinationandsynchronization–sharingdatasafelyPerformancemodeling/debugging/tuningAllofthesethingsmakesparallelprogrammingevenharderthansequentialprogramming.41“Automatic”ParallelisminModernMachinesBitlevelparallelismwithinfloatingpointoperations,etc.Instructionlevelparallelism(ILP)multipleinstructionsexecuteperclockcycleMemorysystemparallelismoverlapofmemoryoperationswithcomputationOSparallelismmultiplejobsruninparalleloncommoditySMPsLimitstoallofthese--forveryhighperformance,needusertoidentify,scheduleandcoordinateparalleltasks42FindingEnoughParallelismSupposeonlypartofanapplicationseemsparallelAmdahl’slawletsbethefractionofworkdonesequentially,so(1-s)isfractionparallelizableP=numberofprocessorsSpeedup(P)=Time(1)/Time(P)<=1/(s+(1-s)/P)<=1/sEveniftheparallelpartspeedsupperfectlyperformanceislimitedbythesequentialpartTop500list:currentlyfastestmachinehasP~560K;2ndfastesthas~1.57M43OverheadofParallelismGivenenoughparallelwork,thisisthebiggestbarriertogettingdesiredspeedupParallelismoverheadsinclude:costofstartingathreadorprocesscostofcommunicatingshareddatacostofsynchronizingextra(redundant)computationEachofthesecanbeintherangeofmilliseconds(=millionsofflops)onsomesystemsTradeoff:Algorithmneedssufficientlylargeunitsofworktorunfastinparallel(i.e.largegranularity),butnotsolargethatthereisnotenoughparallelwork44LocalityandParallelismLargememoriesareslow,fastmemoriesaresmallStoragehierarchiesarelargeandfastonaverageParallelprocessors,collectively,havelarge,fastcachetheslowaccessesto“remote”datawecall“communication”AlgorithmshoulddomostworkonlocaldataProcCacheL2CacheL3CacheMemoryConventionalStorageHierarchyProcCacheL2CacheL3CacheMemoryProcCacheL2CacheL3CacheMemorypotentialinterconnects45Processor-DRAMGap(latency)µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformanceGap:

(grows50%/year)PerformanceTime“Moore’sLaw”Goal:findalgorithmsthatminimizecommunication,notnecessarilyarithmetic46LoadImbalanceLoadimbalanceisthetimethatsomeprocessorsinthesystemareidleduetoinsufficientparallelism(duringthatphase)unequalsizetasksExamplesofthelatteradaptingto“interestingpartsofadomain”tree-structuredcomputationsfundamentallyunstructuredproblemsAlgorithmneedstobalanceloadSometimescandetermineworkload,divideupevenly,beforestarting“StaticLoadBalancing”Sometimesworkloadchangesdynamically,needtorebalancedynamically“DynamicLoadBalancing”47ParallelSoftwareEventually–ParLabview2typesofprogrammers

2layersofsoftwareEfficiencyLayer(10%ofprogrammers)ExpertprogrammersbuildLibrariesimplementingkernels,“Frameworks”,OS,….HighestfractionofpeakperformancepossibleProductivityLayer(90%ofprogrammers)Domainexperts/Non-expertprogrammersproductivelybuildparallelapplicationsbycomposingframeworks&librariesHideasmanydetailsofmachine,parallelismaspossibleWillingtosacrificesomeperformanceforproductiveprogrammingExpectstudentsmaywanttoworkateitherlevelInthemeantime,weallneedtounderstandenoughoftheefficiencylayertouseparallelismeffectively48OutlineWhypowerfulcomputersmustbeparallelprocessorsLargeCSEproblemsrequirepowerfulcomputersWhywriting(fast)parallelprogramsishardStructureofthecourseCommercialproblemstooIncludingyourlaptopsandhandheldsallButthingsareimproving49CourseMechanicsWebpage: /~demmel/cs267_Spr13/NormallyamixofCS,EE,andotherengineeringandsciencestudentsPleasefilloutsurveyonwebpage(posted)Grading:Warmupassignment(homework0ontheweb)BuildawebpageonaninterestofyoursinCSEThreeprogrammingassignmentsinfirsthalfofsemesterWewillteamupCS/nonCSstudentsforHW1FinalprojectsCouldbeparallelizinganapplication,buildingorevaluatingatool,etc.Weencourageinterdisciplinaryteams,sincethisisthewayparallelscientificsoftwareisgenerallybuiltClasscomputeraccountsonHopper,DiracatNERSCFilloutformsnexttimeRemoteinstruction–preparinganexperimentLectureswillbewebcast,archived,asinpastsemestersSeeclasswebpagefordetailsXSEDEisnationwideprojectsupportingusersofNSFsupercomputerfacilitiesXSEDEplanstoofferCS267tostudentsnationwide,starting2/14BasedonVideosfromSpring2012offeringChallengesto“scalingup”educationQ&A–piazzaforCS267,moodleforXSEDEAutogradingForcorrectness–runtestcases(notaseasyasitsounds)Forperformance–timingonsuitableplatformDittoforKurtKeutzer’sCS194class5051RoughListofTopicsBasicsofcomputerarchitecture,memoryhierarchies,performanceParallelProgrammingModelsandMachinesSharedMemoryandMultithreadingDistributedMemoryandMessagePassingDataparallelism,GPUsCloudcomputingParallellanguagesandlibrariesSharedmemorythreadsandOpenMPMPIOtherLanguages,frameworks(UPC,CUDA,PETSC,“PatternLanguage”,…)“SevenDwarfs”ofScientificComputingDense&SparseLinearAlgebraStructuredandUnstructuredGridsSpectralmethods(FFTs)andParticleMethods6additionalmotifsGraphalgorithms,Graphicalmodels,DynamicProgramming,Branch&Bound,FSM,LogicGeneraltechniquesAutotuning,Loadbalancing,performancetoolsApplications:climatemodeling,materialsscience,astrophysics…(guestlecturers)52ReadingMaterialsWhatdoesGooglerecommend?PointersonclasswebpageMustread:“TheLandscapeofParallelProcessingResearch:TheViewfromBerkeley”/Pubs/TechRpts/2006/EECS-2006-183.pdfSomeon-linetexts:Demmel’snotesfromCS267Spring1999,whicharesimilarto2000and2001.However,theycontainlinkstohtmlnotesfrom1996./~demmel/cs267_Spr99/IanFoster’sbook,“DesigningandBuildingParallelProgramming”./dbpp/Potentiallyusefultexts:“SourcebookforParallelComputing”,byDongarra,Foster,Fox,..Ageneraloverviewofparallelcomputingmethods“PerformanceOptimizationofNumericallyIntensiveCodes”byStefanGoedeckerandAdolfyHoisieThisisapracticalguidetooptimization,mostlyforthoseofyouwhohaveneverdoneanyoptimization53ReadingMaterials(cont.)RecentbookswithpapersaboutthecurrentstateoftheartDavidBader(ed.),“PetascaleComputing,AlgorithmsandApplications”,Chapman&Hall/CRC,2007MichaelHeroux,PadmaRagahvan,HorstSimon(ed.),”ParallelProcessingforScientificComputing”,SIAM,2006.M.Sottile,T.Mattson,C.Rasmussen,IntroductiontoConcurrencyinProgrammingLanguages,Chapman&Hall/CRC,2009.MorepointersonthewebpageInstructorsJimDemmel,EECS&MathematicsGSIs:DavidSheffield,EECSMichaelDriscoll,EECSContactinformationonwebpage54Students56registeredoronthewaitlist(45grad,11undergrad)24CSorEECSgradstudents,restfromBioengineeringBusinessAdministrationChemicalEngineeringChemistryCivil&EnvironmentalEngineeringEarth&PlanetaryScienceMaterialScience&EngineeringMechanicalEngineeringPhysics10CSorEECSundergrads,1appliedmath5556WhatyoushouldgetoutofthecourseIndepthunderstandingof:Whenisparallelcomputinguseful?Understandingofparallelcomputinghardwareoptions.Overviewofprogrammingmodels(software)andtools,andexperienceusingsomeofthemSomeimportantparallelapplicationsandthealgorithmsPerformanceanalysisandtuningExposuretovariousopenresearchquestions57Extraslides58MoreExoticSolutionsontheHorizonGraphicsandGameprocessorsGraphicsProcessingUnits(GPUs),e.g.,NVIDIAandATI/AMDGameprocessors,e.g.,CellforPS3ParallelprocessorattachedtomainprocessorOriginallyspecialpurpose,gettingmoregeneralProgrammingmodelnotyetmatureFPGAs–FieldProgrammableGateArraysInefficientuseofchipareaMoreefficientthanmulticoreforsomedomainsProgrammingchallengenowincludeshardwaredesign,e.g.,layoutWireroutingheuristicsstilltroublesome;DataflowarchitecturesHaveconsiderableexperiencewithdataflowfrom1980’sProgrammingwithfunctionallanguages?59MoreLimits:Howfastcanaserialcomputerbe?Considerthe1Tflop/ssequentialmachine:Datamusttravelsomedistance,r,togetfrommemorytoprocessor.Toget1dataelementpercycle,thismeans1012timespersecondatthespeedoflight,c=3x108m/s.Thusr<c/1012=0.3mm.Nowput1Tbyteofstorageina0.3mmx0.3mmarea:Eachbitoccupiesabout1squareAngstrom,orthesizeofasmallatom.Nochoicebutparallelismr=0.3mm1Tflop/s,1Tbytesequentialmachine36thList:TheTOP10RankSiteManufacturerComputerCountryCoresRmax[Tflops]Power[MW]1NationalSuperComputerCenterinTianjinNUDTTianhe-1ANUDTTHMPP,

Xeon6C,NVidia,FT-10008CChina186,3682,5664.042OakRidgeNationalLaboratoryCrayJaguar

CrayXT5,HC2.6GHzUSA224,1621,7596.953NationalSupercomputingCentreinShenzhenDawningNebulae

TC3600Blade,IntelX5650,NVidiaTeslaC2050GPUChina120,6401,2712.584GSIC,TokyoInstituteofTechnologyNEC/HPTSUBAME-2HPProLiant,Xeon6C,NVidia,Linux/WindowsJapan73,2781,1921.405DOE/SC/

LBNL/NERSCCrayHopper

CrayXE6,6C2.1GHzUSA153,4081.0542.916Commissariatal'EnergieAtomique(CEA)BullTera100

Bullbullxsuper-nodeS6010/S6030France138.3681,0504.597DOE/NNSA/LANLIBMRoadrunner

BladeCenterQS22/LS21USA122,4001,0422.348UniversityofTennesseeCrayKraken

CrayXT5HC2.36GHzUSA98,928831.73.099ForschungszentrumJuelich(FZJ)IBMJugene

BlueGene/PSolutionGermany294,912825.52.2610DOE/NNSA/

LANL/SNLCrayCielo

CrayXE6,6C2.4GHzUSA107,152816.62.95PerformanceDevelopment1Gflop/s

1Tflop/s100Mflop/s100Gflop/s100Tflop/s10Gflop/s10Tflop/s

1Pflop/s100Pflop/s10Pflop/s59.7GFlop/s400MFlop/s1.17TFlop/s2.57PFlop/s31.12TFlop/s43.66PFlop/sSUMN=1N=500ProjectedPerformanceDevelopment1Gflop/s

1T

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论