




已阅读5页,还剩47页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CUDAOptimization PengWang Ph D HPCDeveloperTechnology NVIDIA OptimizationOverview GPUarchitectureKerneloptimizationLatencyoptimizationMemoryoptimizationInstructionoptimizationDatatransferoptimizationOverlappedexecutionusingstreams GPUArchitecture GPUHighLevelView SMEM SMEM SMEM SMEM GlobalMemory CPUChipset PCIe StreamingMultiprocessorGlobalmemoryx bar FermiMultiprocessor Controlunit2WarpSchedulerUpto1536concurrentthreadsExecutionunit32CUDACoresFullIEEE754 2008FP32andFP6432FP32ops clock 16FP64ops clockSFU LSUMemoryRegisters 32bit 32K32 bitCacheL1 sharedmemory 64KB Texture 8KB Constant 8KB GPUandProgrammingModel Software GPU Threadsareexecutedbyscalarprocessors Thread CUDACore ThreadBlock Multiprocessor ThreadblocksareexecutedonmultiprocessorsThreadblocksdonotmigrateSeveralconcurrentthreadblockscanresideononemultiprocessor limitedbymultiprocessorresources Grid Device AkernelislaunchedasagridofthreadblocksUpto16kernelscanexecuteonadeviceatonetime WarpandSIMT Block 32Threads 32Threads 32Threads Warps Blocksdivideintogroupsof32threadscalledwarpsWarpsarebasicschedulingunitsWarpsalwaysperformthesameinstruction SingleInstructionMultipleThreads SIMT AlotofwarpscanhidememorylatencyContextswitchingisfree Time FermiMemoryHierarchy 3levels verysimilartoCPURegisterSpillstolocalmemoryCachesSharedmemoryL1cacheL2cacheConstantcacheTexturecacheGlobalmemory FermiMemoryHierarchyReview L2 GlobalMemory Registers C SM 0 L1 SMEM TEX Registers C SM 1 L1 SMEM TEX Registers C SM N L1 SMEM TEX GeneralOptimizationStrategies Measurement FindoutthelimitingfactorinkernelperformanceMemorybandwidthbound memoryoptimization Instructionthroughputbound instructionoptimization Latencybound configurationoptimization Measureeffectivememory instructionthroughputOptimizeforpeakmemory instructionthroughputFindingoutthebottleneckTypicallyaniterativeprocess KernelOptimizationWorkflow FindLimiter ComparetopeakGB s Memoryoptimization Comparetopeakinst s Instructionoptimization Latencyoptimization Memorybound Instructionbound Latencybound Done LatencyOptimization LatencyOptimization WhenthecodeislatencyboundBoththememoryandinstructionthroughputsarefarfromthepeakLatencyhiding switchingthreadsPurpose haveenoughthreadstohidelatencyMajortechniques adjustresourceusagetoincreasethreads UnderstandingLatencyBound Hardware inst mem ispipelinedLatency length ofthepipelineThroughput width ofthepipelineLittle sLawConcurrency Throughput LatencyBasicunits instruction cacheline thatareactiveinthepipeline on the fly C2050 150GB s 600 1 15GHz 128B 14SM 44cachelineperSMonthefly input output Latency 4Throughput 2Concurrency 8 Example Little sLawonGPU AchievedbandwidthasafunctionofrequestsperSM EnoughBlocks ofblocks ofSM 100toscalewelltofuturedeviceBlocksizeshouldbeamultipleof32 warpsize Minimum 64 Igenerallyuse128or256 Butusewhateverisbestforyourapp Dependsontheproblem doexperiments EnoughThreads GPUthreadsareverycheap useasmanyasyouwantOccupancy ratioofactivewarpsperSMtothemaximumnumberofallowedwarpsMaximumnumber 48inFermiWeneedtheoccupancytobehighenoughtohidelatency WhatLimitsOccupancy PartitioningofSMResourcesSharedmemoryispartitionedamongblocksRegistersarepartitionedamongthreads 63Threadblockslots 8Threadslots 1536AnyofthosecanbethelimitingfactoronhowmanythreadscanbelaunchedatthesametimeonaSM OccupancyOptimizations KnowthecurrentoccupancyVisualprofiler ptxas options v outputresourceusageinfo inputtoOccupancyCalculatorAdjustresourceusagetochangeoccupancy AdjustblocksizeLimitregisterusageCompileroption maxrregcount n perfile launch bounds perkernelUsetemplateReducesharedmemoryDynamicalallocation OccupancyCalculator LatencyHidingOccupancyCalculation Assumeglobalmemorytakes400cycles weneed400 2 200arithmeticinstructionstohidethelatency Forexample assumethecodehas8independentarithmeticinstructionsforeveryoneglobalmemoryaccess Thus200 8 26warpswouldbeenough 54 occupancy Lessons RequiredoccupancydependsonBOTHarchitectureandapplicationInthisexample beyond54 higheroccupancywon tleadtofurtherperformanceincrease MemoryOptimization MemoryOptimization Ifthecodeismemory boundandeffectivememorythroughputismuchlowerthanthepeakPurpose accessonlydatathatareabsolutelynecessaryMajortechniquesImproveaccesspatterntoreducewastedtransactions coalescingReduceredundantaccess sharedmemory GlobalMemoryThroughputMetric Measuringeffectivememorythroughput Fromtheapppointofview useful bytes numberofbytesneededbythealgorithmdividedbykerneltimeComparetothetheoreticalbandwidth70 80 isverygoodFindingoutbottleneckStartwithglobalmemoryoperations achievegoodthroughputAddarithmetic sharedmemory etc measuringperfasyougo Coalescing Globalmemorylatency 400 800cycles Thesinglemostimportantperformanceconsideration Coalescing globalmemoryaccessfromawarpcanbecoalescedintoasingletransactionCriterion requestsfromawarpfallinginaL1cacheline onetransaction transaction L1lineaccessed Load Warprequests32aligned consecutive4 bytewordsAddressesfallwithin1cache lineWarpneeds128bytes128bytesmoveacrossthebusonamiss addressesfromawarp 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses 0 Load 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses addressesfromawarp 0 Warprequests32aligned permuted4 bytewordsAddressesfallwithin1cache lineWarpneeds128bytes128bytesmoveacrossthebusonamiss Load 96 192 128 160 224 288 256 addressesfromawarp 32 64 0 352 320 384 448 416 Memoryaddresses Warprequests32misaligned consecutive4 bytewordsAddressesfallwithin2cache linesWarpneeds128bytes256bytesmoveacrossthebusonmisses Load addressesfromawarp 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses 0 Allthreadsinawarprequestthesame4 bytewordAddressesfallwithinasinglecache lineWarpneeds4bytes128bytesmoveacrossthebusonamiss Load addressesfromawarp 96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memoryaddresses 0 Warprequests32scattered4 bytewordsAddressesfallwithinNcache linesWarpneeds128bytesN 128bytesmoveacrossthebusonamiss SharedMemory Lowlatency afewcyclesHighthroughput 73 6GB sperSM 1 03TB sperGPU Usage shared MainuseSharingdataamongthreadsofthesameblockUser managedcache SharedMemoryExample MatrixMultiplication A B C C AxB EverythreadcorrespondstooneentryinC NaiveKernel global voidsimpleMultiply float a float b float c intN introw threadIdx x blockIdx x blockDim x intcol threadIdx y blockIdx y blockDim y floatsum 0 0f for inti 0 i N i sum a row N i b i N col c row N col sum EverythreadcorrespondstooneentryinC BlockedMatrixMultiplication A B C C AxB Datareuseintheblockedversion Blockedandcachedkernel global voidcoalescedMultiply double a double b double c intN shared floataTile TILE DIM TILE DIM shared doublebTile TILE DIM TILE DIM introw blockIdx y blockDim y threadIdx y intcol blockIdx x blockDim x threadIdx x floatsum 0 0f for intk 0 k N k TILE DIM aTile threadIdx y threadIdx x a row TILE DIM threadIdx x bTile threadIdx y threadIdx x b threadIdx y N col syncthreads for inti k i k TILE DIM i sum aTile threadIdx y i bTile i threadIdx x c row N col sum PerformanceResults M N K 512 MemoryOptimizations StriveforperfectcoalescingTransposethedatastructure e g AOStoSOAPaddingChangeparallelizationscheme 1 thread per taskto1 warp per task Usesharedmemorytoreduceglobalmemoryaccess avoidnon coalescedaccessBoundtotexturecacheforunpredictableuncoalescedaccessUseconstantcacheifallthreadsinawarpwillaccessthesameconstantdata InstructionOptimization InstructionOptimization IfyoufindoutthecodeisinstructionboundCompute intensivealgorithmcaneasilybecomememory boundifnotcarefulenoughTypically worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizationsPurpose reduceinstructioncountUselessinstructionstogetthesamejobdoneMajortechniquesReducewastedinstructions branchdivergence etc Usehighthroughputinstructions ControlFlow Singleinstructionmultiplethreads SIMT modelAsingleinstructionisissuedforawarpatatimeDifferentcodepathwithinawarphandledbyhardwareDivergentbranches threadswithinasinglewarptakedifferentpathsExamplewithdivergence if threadIdx x 2 else DifferentexecutionpathswithinawarpareserializedDifferentwarpscanexecutedifferentcodewithnoimpactonperformance Example if else NoBranchDivergencewithinaWarp Time Warp Warp BranchDivergencewithinaWarp Time UsingHighThroughputInstructions AvoidautomaticconversionofdoubletofloatAdding f tofloatingliterals e g 1 0f becausethedefaultisdoubleFermidefault ftz false prec div true prec sqrt trueforIEEEcomplianceFastmathfunctionsTwotypesofruntimemathlibraryfunctionsfunc slowerbuthigheraccuracy 5ulporless func fastbutloweraccuracy seeprog guideforfulldetails use fast math forceseveryfunc to func IntdivideandmoduloareexpensiveDivideby2 n use n Modulo2 n use 2 n 1 KernelOptimizationWorkflow FindLimiter ComparetopeakGB s Memoryoptimization Comparetopeakinst s Instructionoptimization Latencyoptimization Memorybound Instructionbound Latencybound Done DataTransferOptimization MinimizingCPU GPUdatatransfer Hostdevicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess 8GB s PCIex16Gen2 vs156GB s 515Ginst s C2050 MinimizetransferIntermediatedatacanbeallocated operated de allocateddirectlyonGPUSometimesit sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGrouptransferOnelargetransfermuchbetterthanmanysmallones 10microseclatency 8GB s latencydominatedifdatasize 80KBOverlapmemorytransferwithcomputationDoublebuffering Overlapkernelandmemorycopy Requirements D2HorH2Dmem
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025-2030中国创新药研发趋势及国际合作模式与投资风险评估报告
- 2025至2030咖啡伴侣行业产业运行态势及投资规划深度研究报告
- 国际投标书5篇
- 2025至2030铝合金汽车零部件行业发展趋势分析与未来投资战略咨询研究报告
- 第六章 质量与密度 单元测试卷 (含答案)2025-2026学年人教版(2024)八年级物理上册
- 塔里木油田分公司高校毕业生招聘考试真题2024
- 2025年上海市测绘院公开招聘高层次专业技术人员模拟试卷及1套参考答案详解
- 2025年海洋能源利用:海水淡化反渗透膜技术创新在海洋波浪能中的应用
- 2025北京海关所属事业单位招聘5人模拟试卷及答案详解一套
- 2025广东清远市连州市教育局招聘高中教师10人(编制)模拟试卷及答案详解(各地真题)
- 2025河北水发节水有限公司公开招聘工作人员16人笔试参考题库附答案解析
- 新版中华民族共同体概论课件第十二讲民族危亡与中华民族意识觉醒(1840-1919)-2025年版
- 2025-2026学年人教版(2024)九年级物理全册第十四章 内能的利用(单元同步检测练习)(含答案)
- 第1课时 10的加、减法(教学设计)-2024-2025学年一年级上册数学人教版
- 2025至2030中国聚烯烃行业项目调研及市场前景预测评估报告
- 夜间红外成像算法优化-洞察及研究
- 2025四川达州宣汉县国有资产管理服务中心县属国有企业招聘劳动合同职工26人笔试历年参考题库附带答案详解
- 外国戏剧史课件
- (正式版)DB15∕T 4179-2025 《输氢管道工程施工规范》
- 新教科版小学1-6年级科学需做实验目录
- WS/T 102-1998临床检验项目分类与代码
评论
0/150
提交评论