CUDA超大规模并行程序设计.ppt

上传人：y*** IP属地：四川上传时间：2020-05-17 格式：PPT 页数：93 大小：2.15MB 积分：15 举报 版权申诉

已阅读5页，还剩88页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

CUDA超大规模并行程序设计,邓仰东dengyd清华大学微电子学研究所,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,GraphicProcessingUnit(GPU),用于个人计算机、工作站和游戏机的专用图像显示设备显示卡nVidia和ATI(nowAMD)是主要制造商Intel准备通过Larrabee进入这一市场主板集成Intel,3维图像流水线,Framebuffer,Texture,CPU,GPU,实时3维高速图形处理,一帧典型图像1Mtriangles3Mvertices25Mfragments30frames/s30Mtriangles/s90Mvertices/s750Mfragments/s,传统GPU架构,Graphicsprogram,Vertexprocessors,Fragmentprocessors,Pixeloperations,Outputimage,GPU的强大运算能力,数据级并行:计算一致性,专用存储器通道有效隐藏存储器延时,GeneralPurposeComputingonGPU(GPGPU),GPGPU,核心思想用图形语言描述通用计算问题把数据映射到vertex或者fragment处理器但是硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧,NVidiaG200Architecture,CUDA:ComputeUnifiedDeviceArchitecture,通用并行计算模型单指令、多数据执行模式(SIMD)所有线程执行同一段代码(1000sthreadsonthefly)大量并行计算资源处理不同数据隐藏存储器延时提升计算通信比例合并相邻地址的内存访问快速线程切换1cycleGPUvs.1000cyclesCPU,混合计算模型,CUDA:集成CPU+GPUC应用程序CPU:顺序执行代码GPU=超大规模数据并行协处理器“批发”式执行大量细粒度线程,kernel0,CPUSerialCode,CPUSerialCode,GPUParallelCode,GPUParallelCode,Concurrentexecution!,kernel1,CUDA成功案例,CUDA性能,BLAS3:127GFLOPS/基本线性代数:matrix-matrixFFT:52benchFFT*GFLOPSFDTD:1.2Gcells/sec/计算电动力学SSEARCH:5.2Gcells/sec/Smith-Waterman基因序列比较BlackScholes:4.7GOptions/sec/期权定价模型VMD:290GFLOPS/分子动力学图形显示,ProblemInstancesforSparseMatrixVectorProduct(SMVP),SPMVThroughputonGTX280,SMVPApplication:StaticTimingAnalysis,AdaptedfromRamalingam,A.et.al.AnAccurateSparseMatrixBasedFrameworkforStatisticalStaticTimingAnalysis.ICCAD.2006.,StaticTimingAnalysisResultsonGTX280,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,并行性的维度,1维y=a+b/y,a,bvectors2维P=MN/P,M,Nmatrices3维CTorMRIimaging,=,并行线程组织结构,Thread:并行的基本单位Threadblock:互相合作的线程组CooperativeThreadArray(CTA)允许彼此同步通过快速共享内存交换数据以1维、2维或3维组织最多包含512个线程Grid:一组threadblock以1维或2维组织共享全局内存Kernel:在GPU上执行的核心程序Onekernelonegrid,ParallelProgramOrganizationinCUDA,Thread,Threadblock,Grid,SP,Software,Hardware,SM,GPU,并行线程执行,调用kernelfunction需要指定执行配置Threads和blocks具有IDsthreadIdx:1D,2D,or3DblockIdx:1D,or2D由此决定相应处理数据,_global_voidkernel(.);dim3DimGrid(3,2);/6threadblocksdim3DimBlock(16,16);/256threadsperblockkernel(.);,实例1:Element-WiseAddition,/CPUprogram/sumoftwovectorsaandbvoidadd_cpu(float*a,float*b,intN)for(intidx=0;idx(a,b,N);,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,CUDAProcessingFlow,并行线程执行,SM内以(warp即32threads)为单位并行执行Warp内线程执行同一条指令Half-warp是存储操作的基本单位,Warp,GPU负载分配,Globalblockscheduler管理threadblock级并行从CPU获得线程组织信息根据硬件结构分配threadblock到SM,StreamingMultiprocessor(SM),StreamingMultiprocessor执行ThreadBlocks,线程以block为单位分配到SM视资源需求,一个SM分配至多8个blockSMinG80可以接受768个线程256(threads/block)*3blocks或128(threads/block)*6blocks,etc.线程并发(concurrently)运行SM分配并维护线程IDSM管理并调度线程,ThreadLifeCycle,Grid在GPU上启动Threadblocks顺序分配到SMs一般SM应有1threadblockSM把线程组织为warpsSM调度并执行就绪的warpWarps和threadblocks执行结束后释放资源GPU继续分发threadblocks,ExampleofHidingMemoryLatency,G80:执行warp全部线程的一条指令需要8个时钟cycle假定1globalmemoryaccess/8instructionsA400-cycleglobalmemorylatencyHowmanywarpsareneededtotoleratethelatency?,400cycles*1MEM/8cycles=50cyclesperinstructiononaverage50cycles/4cyclesperwarp=12.513warpstokeepanSMbusy,ArithmeticInstructionThroughput,4clockcyclesSingle-precisionfloating-pointadd,multiply,andmultiply-add,Integeradd,24-bitintegermultiplicationBitwiseoperations,compare,min,max,typeconversioninstruction;,Note.1.Awarpisissuedin4cycles2.Arithmeticopsarepipelined.3.Stillpossibletohave8opsreadyineachcycle.,ArithmeticInstructionThroughput,16clockcyclesReciprocal,reciprocalsquareroot,32-bitIntegermultiplicationOtherfunctionsarecombinationsoftheabovey/x=rcp(x)*y/20cyclesperwarpsqrt(x)=rcp(rsqrt(x)/32cyclesperwarpIntegerdivisionandmodulooperationarecostly!,浮点数精度,GT200之前的GPUIEEE-754FloatingPointStandard单精度浮点数GT200增加双精度浮点数支持SP仍然只支持单精度浮点数每个SM配置一个双精度浮点单元双精度比单精度运算慢8-12倍,控制流(ControlFlow),同一warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行每次执行warp中一条可能的路径N条指令路径1/Nthroughput只需要考虑同一warp即可，不同warp的不同的指令路径不具相关性G80上使用指令预测技术加速指令执行,控制流(ControlFlow),常见情况:分支条件是threadID的函数时,容易导致divergenceExamplewithdivergence:If(threadIdx.x2)在threadblock产生两条不同指令路径Branchgranularity2)也在threadblock产生两条不同指令路径Branchgranularityisawholemultipleofwarpsize同一warp的所有线程具备相同指令路径,线程同步,void_syncthreads();Barriersynchronization同步threadblock之内的所有线程避免访问共享内存时发生RAW/WAR/WAW冒险(hazard),_shared_floatscratch256;scratchthreadID=beginthreadID;_syncthreads();intleft=scratchthreadID-1;,在此等待，直至所有线程到达才开始执行下面的代码,Dead-Lockwith_syncthreads,Dead-lockifSomethreadshavevallargerthanthresholdAndothersnot,_global_voidcompute(.)/dosomecomputationforvalif(valthreshold)return;_syncthreads();/workwithval,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,CUDA扩展语言结构,Declspecsglobal,device,shared,local,constantKeywordsthreadIdx,blockIdxthreadDim,blockDimIntrinsics_syncthreadsRuntimeAPIMemory,symbol,executionmanagementFunctionlaunch,_device_floatfilterN;_global_voidconvolve(float*image)_shared_floatregionM;.regionthreadIdx=imagei;_syncthreads().imagej=result;/AllocateGPUmemoryvoid*myimage=cudaMalloc(bytes)/100blocks,10threadsperblockfoo(parameters);,存储器空间,R/Wper-threadregisters1-cyclelatencyR/Wper-threadlocalmemorySlowregisterspillingtoglobalmemoryR/Wper-blocksharedmemory1-cyclelatencyButbankconflictsmaydragdownR/Wper-gridglobalmemory500-cyclelatencyButcoalescingaccessingcouldhidelatencyReadonlyper-gridconstantandtexturememories500-cyclelatencyButcached,GPUGlobalMemory分配,cudaMalloc()分配显存中的globalmemory两个参数对象数组指针和数组尺寸cudaFree()释放显存中的globalmemory对象数组指针,intblk_sz=64;float*Md;intsize=blk_sz*blk_sz*sizeof(float);cudaMalloc(void*),HostDevice数据交换,cudaMemcpy()MemorydatatransferRequiresfourparametersPointertodestinationPointertosourceNumberofbytescopiedTypeoftransferHosttoHost,HosttoDevice,DevicetoHost,DevicetoDevice,cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);,CUDA引入的新变量类型,_device_储存于GPU上的globalmemory空间和应用程序具有相同的生命期(lifetime)可被grid中所有线程存取,CPU代码通过runtime函数存取_constant_储存于GPU上的constantmemory空间和应用程序具有相同的生命期(lifetime)可被grid中所有线程存取,CPU代码通过runtime函数存取_shared_储存于GPU上threadblock内的共享存储器和threadblock具有相同的生命期(lifetime)只能被threadblock内的线程存取Local变量储存于SM内的寄存器和localmemory和thread具有相同的生命期(lifetime)Thread私有,CUDA函数定义,_global_定义kernel函数必须返回void_device_函数不能用intheight;float*elements;Matrix;,A,B,C,WM.width=N.heightI,M.height,M.width,N.width,实例2:矩阵相乘,C=ABofsizeWIDTHxWIDTH一个线程处理一个矩阵元素简化:假定WIDTHxWIDTH(Ad,Bd,Cd,M.width);/ReadPfromthedevicecopyFromDeviceMatrix(C.elements,Cd);cudaMemCopy(C,Cd,N*size,cudaMemcpyDeviceToHost);/FreedevicematricescudaFree(Ad);cudaFree(Bd);cudaFree(Cd);,CUDAImplementationKernel,/Matrixmultiplicationkernelthreadspecification_global_voidMuld(float*Ad,float*Bd,float*Cd,intwidth)/2DThreadIDinttx=threadIdx.x;intty=threadIdx.y;/cvalueisusedtostoretheelementofthematrix/thatiscomputedbythethreadfloatcvalue=0;,CUDAImplementationKernel,A,B,C,WIDTH,WIDTH,WIDTH,WIDTH,ty,tx,for(intk=0;kwidth;+k)floatae=Adty*width+k;floatbe=Bdtx+k*width;cvalue+=ae*be;/Writethematrixtodevicememory;/eachthreadwritesoneelementCdty*width+tx=cvalue;,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器SharedmemoryGlobalmemoryCUDA程序设计工具程序优化,共享存储器(SharedMemory),设置于streamingmultiprocessor内部由一个线程块内部全部线程共享完全由软件控制访问一个地址只需要1个时钟周期,共享存储器结构,G80的共享存储器组织为16banksAddressedin4bytesBankID=4-byteaddress%16相邻4-byte地址映射相邻banks每一bank的带宽为4bytesperclockcycle对同一bank的同时访问导致bankconflict只能顺序处理仅限于同一线程块内的线程,BankAddressing实例,NoBankConflictsLinearaddressingstride=1(s=1),NoBankConflictsRandom1:1Permutation,_shared_floatshared256;floatfoo=sharedthreadIdx.x;,BankAddressing实例,2-waybankconflictsLinearaddressingstride=2(s=2),8-waybankconflictsLinearaddressingstride=8(s=8),_shared_floatshared256;floatfoo=shared2*threadIdx.x;,_shared_floatshared256;floatfoo=shared8*threadIdx.x;,常见BankConflict模式,Sharedmemory存放2D浮点数组16x16-elelmentsharedmemory1个线程处理矩阵的一行循环处理一行16个元素同一block的线程同时访问一列即column1inpurple16-waybankconflicts,BankIndiceswithoutPadding,Bank,t15,解决方案,方案1:padtherows在每行最后添加一个元素方案2:transposebeforeprocessingSufferbankconflictsduringtransposeButpossiblysavethemlater,BankIndiceswithPadding,Transpose,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器SharedmemoryGlobalmemoryCUDA程序设计工具程序优化,全局内存(GlobalMemory),全局内存在G80上没有缓存Constantmemory和texturememory有少量缓存存取延时400-600clockcycles非常容易成为性能瓶颈优化是提高性能的关键!,CoalescedGlobalMemoryAccesses,在half-warp层次对访问globalmemory进行协调访问连续globalmemory区域:64bytes-eachthreadreadsaword:int,float,128bytes-eachthreadreadsadouble-word:int2,float2,256byteseachthreadreadsaquad-word:int4,float4,额外限制:Globalmemory区域的起始地址必须是该区域数据类型尺寸的整数倍Warp中第k个线程访问第k个地址例外:可以有某些中间线程不参加Predicatedaccess,divergencewithinawarp,CoalescedGlobalMemoryAccesses,Non-CoalescedGlobalMemoryAccesses,Non-CoalescedGlobalMemoryAccesses,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器SharedmemoryGlobalmemoryCUDA程序设计工具程序优化,下载CUDA软件,软件环境,GPU硬件和CUDA软件安装后:,CPU(Host),CUDALibraries(CUFFTfloat3a=d_inindex;a.x+=2;a.y+=2;a.z+=2;d_outindex=a;,Uncoalescedfloat3Code,float3需要12bytes:float3f=d_inthreadIdx.x;Eachthreadendsupexecuting3readssizeof(float3)4,8,or16Half-warpreadsthree64Bnon-contiguousregions,Coalescingfloat3Access,A3-stepapproach(256threads/block),GlobalMemory,Sharedmemory,Sharedmemory,Coalescingfloat3Access,Usesharedmemorytoallowcoalescing256threadsperblockAthreadblockneedssizeof(float3)x256bytesofSMEMEachthreadreads3scalarfloats:Offsets:0,(threads/block),2*(threads/block)Thesewilllikelybeprocessedbyotherthreads,sosyncProcessingEachthreadretrievesitsfloat3fromSMEMarrayCasttheSMEMpointerto(float3*)UsethreadIDasindexRestofthecomputecodedoesnotchange!,Coalescingfloat3Access代码,MatrixTranspose,SDKSample(“transpose”)解释通过sharedmemory实现coalescing在小尺度数据即可显示优化的明显效果,UncoalescedTranspose,_global_voidtranspose_naive(float*odata,float*idata,intwidth,intheight)unsignedintxIndex=blockDim.x*blockIdx.x+threadIdx.x;unsignedintyIndex=blockDim.y*blockIdx.y+threadIdx.y;if(xIndexwidth,tx,ty,Height,Width,Height,Width,UncoalescedTranspose,CoalescedTranspose,假设:矩阵已被分解为方块(tile)Threadblock(bx,by):Readthe(bx,by)inputtile,storeintoSMEMWritetheSMEMdatato(by,bx)outputtileThread(tx,ty):Readselement(tx,ty)frominputtileWriteselement(tx,ty)intooutputtileCoalescingisachievedif:Block/tiledimensionsaremultiplesof16,CoalescedTranspose,CoalescedTranspose,_global_voidtranspose(float*odata,float*idata,intwidth,intheight)_shared_floatblockBLOCK_DIM*BLOCK_DIM;unsignedintxBlock=blockDim.x*blockIdx.x;unsignedintyBlock=blockDim.y*blockIdx.y;unsignedintxIndex=xBlock+threadIdx.x;

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

CUDA超大规模并行程序设计.ppt

文档简介

温馨提示

最新文档

评论

CUDA超大规模并行程序设计.ppt

文档简介

温馨提示

最新文档

评论

相关文档