CUDA Optimization NVIDIA培训资料_第1页
CUDA Optimization NVIDIA培训资料_第2页
CUDA Optimization NVIDIA培训资料_第3页
CUDA Optimization NVIDIA培训资料_第4页
CUDA Optimization NVIDIA培训资料_第5页
已阅读5页,还剩47页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

CUDAOptimization PengWang Ph D HPCDeveloperTechnology NVIDIA OptimizationOverview GPUarchitectureKerneloptimizationLatencyoptimizationMemoryoptimizationInstructionoptimizationDatatransferoptimizationOverlappedexecutionusingstreams GPUArchitecture GPUHighLevelView SMEM SMEM SMEM SMEM GlobalMemory CPUChipset PCIe StreamingMultiprocessorGlobalmemoryx bar FermiMultiprocessor Controlunit2WarpSchedulerUpto1536concurrentthreadsExecutionunit32CUDACoresFullIEEE754 2008FP32andFP6432FP32ops clock 16FP64ops clockSFU LSUMemoryRegisters 32bit 32K32 bitCacheL1 sharedmemory 64KB Texture 8KB Constant 8KB GPUandProgrammingModel Software GPU Threadsareexecutedbyscalarprocessors Thread CUDACore ThreadBlock Multiprocessor ThreadblocksareexecutedonmultiprocessorsThreadblocksdonotmigrateSeveralconcurrentthreadblockscanresideononemultiprocessor limitedbymultiprocessorresources Grid Device AkernelislaunchedasagridofthreadblocksUpto16kernelscanexecuteonadeviceatonetime WarpandSIMT Block 32Threads 32Threads 32Threads Warps Blocksdivideintogroupsof32threadscalledwarpsWarpsarebasicschedulingunitsWarpsalwaysperformthesameinstruction SingleInstructionMultipleThreads SIMT AlotofwarpscanhidememorylatencyContextswitchingisfree Time FermiMemoryHierarchy 3levels verysimilartoCPURegisterSpillstolocalmemoryCachesSharedmemoryL1cacheL2cacheConstantcacheTexturecacheGlobalmemory FermiMemoryHierarchyReview L2 GlobalMemory Registers C SM 0 L1 SMEM TEX Registers C SM 1 L1 SMEM TEX Registers C SM N L1 SMEM TEX GeneralOptimizationStrategies Measurement FindoutthelimitingfactorinkernelperformanceMemorybandwidthbound memoryoptimization Instructionthroughputbound instructionoptimization Latencybound configurationoptimization Measureeffectivememory instructionthroughputOptimizeforpeakmemory instructionthroughputFindingoutthebottleneckTypicallyaniterativeprocess KernelOptimizationWorkflow FindLimiter ComparetopeakGB s Memoryoptimization Comparetopeakinst s Instructionoptimization Latencyoptimization Memorybound Instructionbound Latencybound Done LatencyOptimization LatencyOptimization WhenthecodeislatencyboundBoththememoryandinstructionthroughputsarefarfromthepeakLatencyhiding switchingthreadsPurpose haveenoughthreadstohidelatencyMajortechniques adjustresourceusagetoincreasethreads UnderstandingLatencyBound Hardware inst mem ispipelinedLatency length ofthepipelineThroughput width ofthepipelineLittle sLawConcurrency Throughput LatencyBasicunits instruction cacheline thatareactiveinthepipeline on the fly C2050 150GB s 600 1 15GHz 128B 14SM 44cachelineperSMonthefly input output Latency 4Throughput 2Concurrency 8 Example Little sLawonGPU AchievedbandwidthasafunctionofrequestsperSM EnoughBlocks ofblocks ofSM 100toscalewelltofuturedeviceBlocksizeshouldbeamultipleof32 warpsize Minimum 64 Igenerallyuse128or256 Butusewhateverisbestforyourapp Dependsontheproblem doexperiments EnoughThreads GPUthreadsareverycheap useasmanyasyouwantOccupancy ratioofactivewarpsperSMtothemaximumnumberofallowedwarpsMaximumnumber 48inFermiWeneedtheoccupancytobehighenoughtohidelatency WhatLimitsOccupancy PartitioningofSMResourcesSharedmemoryispartitionedamongblocksRegistersarepartitionedamongthreads 63Threadblockslots 8Threadslots 1536AnyofthosecanbethelimitingfactoronhowmanythreadscanbelaunchedatthesametimeonaSM OccupancyOptimizations KnowthecurrentoccupancyVisualprofiler ptxas options v outputresourceusageinfo inputtoOccupancyCalculatorAdjustresourceusagetochangeoccupancy AdjustblocksizeLimitregisterusageCompileroption maxrregcount n perfile launch bounds perkernelUsetemplateReducesharedmemoryDynamicalallocation OccupancyCalculator LatencyHidingOccupancyCalculation Assumeglobalmemorytakes400cycles weneed400 2 200arithmeticinstructionstohidethelatency Forexample assumethecodehas8independentarithmeticinstructionsforeveryoneglobalmemoryaccess Thus200 8 26warpswouldbeenough 54 occupancy Lessons RequiredoccupancydependsonBOTHarchitectureandapplicationInthisexample beyond54 higheroccupancywon tleadtofurtherperformanceincrease MemoryOptimization MemoryOptimization Ifthecodeismemory boundandeffectivememorythroughputismuchlowerthanthepeakPurpose accessonlydatathatareabsolutelynecessaryMajortechniquesImproveaccesspatterntoreducewastedtransactions coalescingReduceredundantaccess sharedmemory GlobalMemoryThroughputMetric Measuringeffectivememorythroughput Fromtheapppointofview useful bytes numberofbytesneededbythealgorithmdividedbykerneltimeComparetothetheoreticalbandwidth70 80 isverygoodFindingoutbottleneckStartwithglobalmemoryoperations achievegoodthroughputAddarithmetic sharedmemory etc measuringperfasyougo Coalescing Globalmemorylatency 400 800cycles Thesinglemostimportantperformanceconsideration Coalescing globalmemoryaccessfromawarpcanbecoalescedintoasingletransactionCriterion requestsfromawarpfallinginaL1cacheline onetransaction transaction L1lineaccessed Load Warprequests32aligned consecutive4 bytewordsAddressesfallwithin1cache lineWarpneeds128bytes128bytesmoveacrossthebusonamiss addressesfromawarp 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses 0 Load 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses addressesfromawarp 0 Warprequests32aligned permuted4 bytewordsAddressesfallwithin1cache lineWarpneeds128bytes128bytesmoveacrossthebusonamiss Load 96 192 128 160 224 288 256 addressesfromawarp 32 64 0 352 320 384 448 416 Memoryaddresses Warprequests32misaligned consecutive4 bytewordsAddressesfallwithin2cache linesWarpneeds128bytes256bytesmoveacrossthebusonmisses Load addressesfromawarp 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses 0 Allthreadsinawarprequestthesame4 bytewordAddressesfallwithinasinglecache lineWarpneeds4bytes128bytesmoveacrossthebusonamiss Load addressesfromawarp 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses 0 Warprequests32scattered4 bytewordsAddressesfallwithinNcache linesWarpneeds128bytesN 128bytesmoveacrossthebusonamiss SharedMemory Lowlatency afewcyclesHighthroughput 73 6GB sperSM 1 03TB sperGPU Usage shared MainuseSharingdataamongthreadsofthesameblockUser managedcache SharedMemoryExample MatrixMultiplication A B C C AxB EverythreadcorrespondstooneentryinC NaiveKernel global voidsimpleMultiply float a float b float c intN introw threadIdx x blockIdx x blockDim x intcol threadIdx y blockIdx y blockDim y floatsum 0 0f for inti 0 i N i sum a row N i b i N col c row N col sum EverythreadcorrespondstooneentryinC BlockedMatrixMultiplication A B C C AxB Datareuseintheblockedversion Blockedandcachedkernel global voidcoalescedMultiply double a double b double c intN shared floataTile TILE DIM TILE DIM shared doublebTile TILE DIM TILE DIM introw blockIdx y blockDim y threadIdx y intcol blockIdx x blockDim x threadIdx x floatsum 0 0f for intk 0 k N k TILE DIM aTile threadIdx y threadIdx x a row TILE DIM threadIdx x bTile threadIdx y threadIdx x b threadIdx y N col syncthreads for inti k i k TILE DIM i sum aTile threadIdx y i bTile i threadIdx x c row N col sum PerformanceResults M N K 512 MemoryOptimizations StriveforperfectcoalescingTransposethedatastructure e g AOStoSOAPaddingChangeparallelizationscheme 1 thread per taskto1 warp per task Usesharedmemorytoreduceglobalmemoryaccess avoidnon coalescedaccessBoundtotexturecacheforunpredictableuncoalescedaccessUseconstantcacheifallthreadsinawarpwillaccessthesameconstantdata InstructionOptimization InstructionOptimization IfyoufindoutthecodeisinstructionboundCompute intensivealgorithmcaneasilybecomememory boundifnotcarefulenoughTypically worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizationsPurpose reduceinstructioncountUselessinstructionstogetthesamejobdoneMajortechniquesReducewastedinstructions branchdivergence etc Usehighthroughputinstructions ControlFlow Singleinstructionmultiplethreads SIMT modelAsingleinstructionisissuedforawarpatatimeDifferentcodepathwithinawarphandledbyhardwareDivergentbranches threadswithinasinglewarptakedifferentpathsExamplewithdivergence if threadIdx x 2 else DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformance Example if else NoBranchDivergencewithinaWarp Time Warp Warp BranchDivergencewithinaWarp Time UsingHighThroughputInstructions AvoidautomaticconversionofdoubletofloatAdding f tofloatingliterals e g 1 0f becausethedefaultisdoubleFermidefault ftz false prec div true prec sqrt trueforIEEEcomplianceFastmathfunctionsTwotypesofruntimemathlibraryfunctionsfunc slowerbuthigheraccuracy 5ulporless func fastbutloweraccuracy seeprog guideforfulldetails use fast math forceseveryfunc to func IntdivideandmoduloareexpensiveDivideby2 n use n Modulo2 n use 2 n 1 KernelOptimizationWorkflow FindLimiter ComparetopeakGB s Memoryoptimization Comparetopeakinst s Instructionoptimization Latencyoptimization Memorybound Instructionbound Latencybound Done DataTransferOptimization MinimizingCPU GPUdatatransfer Hostdevicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess 8GB s PCIex16Gen2 vs156GB s 515Ginst s C2050 MinimizetransferIntermediatedatacanbeallocated operated de allocateddirectlyonGPUSometimesit sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGrouptransferOnelargetransfermuchbetterthanmanysmallones 10microseclatency 8GB s latencydominatedifdatasize 80KBOverlapmemorytransferwithcomputationDoublebuffering Overlapkernelandmemorycopy Requirements D2HorH2Dmem

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论