翻译以.原文和在同一文件中前_第1页
翻译以.原文和在同一文件中前_第2页
翻译以.原文和在同一文件中前_第3页
翻译以.原文和在同一文件中前_第4页
翻译以.原文和在同一文件中前_第5页
已阅读5页,还剩48页未读 继续免费阅读

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

CUDAGPU3D图像,是一个耗时的过程。在本文中我们将提出一种基于GPUCUDA编写的盒计数算法的快速并行版本。优化的GPU版本与串行CPU28CPU7倍。我们改进的盒计数算法已在3D模型下通过不同复杂程度、特征和大小进试,其有效性和算法精度已经过已知的FD值模型验证。在这个研究中,一个3D的包含一些脑组织数据的FD分析用于我们的GPU盒计数算法。1.法主要包括计数非零像素(2D)或素立体(3D)这类由图像模型填补,在方框网格中所载形网格大小变化中不断重复。数据中FD的值,根据选择体素的类型(白、灰、黑、黑+示盒中该类型体素的数量N(l。最终得到的FD线性回归比率为: FD的估计值。为找到盒的尺寸的上下界已进行了许多研究,而事实上这仍是一个开放性问题。有些理论研究[3,4]中,了这些框大小限制的参 研究了FD计算在生物医学和医学影像学中的应用。Lopes和Betrouni了一篇有关计算FD方法和他们在外地医学图像分析中的应用的有趣文章。更硬化,并检测在早产儿和无生长受限变化情况。在[15]的程序,用于计算的MRI数据的三维FD呈现。他们在优化这个软件的执行时间时需要通过加速3D盒子计数算近来,GPUGPGPU的使用已成倍增长,主要是由于新的架构和编程模型的外观的出现,如NVIDIACUDA技术[16]Khronos的OpenCL的[17]。使用众所周知的语言(C语言)进行简单的扩展。GPUNVIDIACUDA,作者使用GPU来有效地执行内部患者的生物医学图像弹性配准。在文献[22]中的一个GPULiebotich等人开发的。[23][24]。第二种算法[25]由萨卡和乔胡瑞研制,称为差分盒计数(DBC。这些算法虽然有效,却只使用一个线程执行。最近,曾等人。利用多CPU[26]完善而且并行化了一个DBC盒计数算法。然而,据我们所知,以目前的GPU的高度并行计算能力来实现这些算法的性能改进的在本文接下来的部分,我们首先会在硬件和软件两个主要方面描述GPU处理和CUDA。然后,我们描述了我们在并行GPU上的盒子计数方法。接下来,展示我们FD的有效性。在我们的快速盒计数算法在生物医学方面的测试中,我们展示出三个脑组织三维FD分析。结果最后归纳在第6部分。2、算法和理CUDA编程模型的特点。其次,我们描述了我们并行的盒计数算法,讨论为什么这种算法被选定为并行GPU上。我们也解释了当在CPU上执行所CUDA编程模GPGPU编程模型是NVIDIA的CUDA和Khronos能,我们选择了NVIDIA的CUDA开发我们的软件执行相同的功能,这就是所谓的内核。只需要一个传统语言(通常是C/C++)编程,使支持CUDA的GPU的硬件规格。每个CUDA设备有几个多处理器(MPS,每个包含流处理器(SP)SIMD(多数据流)架构。这些处理器负责以并行的方式执行所有线程。CUDA的内存架构是由程序员显式管理。每个设备都有大量慢速全局内存(CPUGPU均可写入,一个全局只读(GPU方面)常量内存,和一个特殊的纹理32GF100架构[29]的出现,一个真正的高1,23维的线程块。线程块被GPU的线程可以同时启动,并在GPU上执行。每个线程块被分配到一个多处理器,所以它的(SIMT中线程数为warp大小的倍数[30]时达到最佳的性能。随着CUDA调度执行,这一结构允许更高水平的扩展,因为线程块被自动分配到空闲侯氏盒计数算[23],由侯等人优化[24],具有O(N*ln(N))间复杂度。这将是我们在本文中使用在GPU上的盒。其他比较熟知的盒子计数算法是由萨卡和乔杜里[25]的差本文中,我们假设3D素体化模型(例如,通过一个2DMR图像的堆栈形成的3D矩 侯氏盒子计数算法通过k位二进制数表示唯一确定一素坐标的全局坐标,其中2𝑘为能够完全包含整个模型边界的体素个数。框格以迭代方式生成,每一个的尺寸,2𝑚素,各不相同,其中0≤M≤k。因此,考虑到n为在一个方框内网格的非零体素的数量,如果n=(2𝑚 1侯氏盒计数算法及其相关CUDAD表示m=1标包含k位。的串。位于([𝑥𝑘−1…𝑥2𝑥1𝑥0]𝑏𝑖𝑛,[𝑦𝑘−1…𝑦2𝑦1𝑦0]𝑏𝑖𝑛,[𝑧𝑘−1…𝑧2𝑧1𝑧0]𝑏𝑖𝑛)的体素其插位字符串为[𝑥𝑘−1𝑦𝑘−1𝑧𝑘−1…𝑥 𝑧]𝑏𝑖𝑛。因此插位字符串长度一定为k*3位。从而我们可以获得一系列的插位字符串,其每一比特串唯一的标识一个3*m02个或以上标记过程反复进行直至m≤k。体素)3D模型,其分辨率k=264(1中仅示出了程中当m1这一算法在CPU上有一些实现,例如由Kruger的[31],以及最近的 在这一部分中,介绍侯氏盒计数算法在NVDIAGPU和CUDA执行时如何并行2GPU盒维数算法的主要模块和数据流。需要注意的是图二2阴影部分的功能和数据是在GPU端执行(函数)或的。另一方面,对于进一步代码。此代码直接对应的流程图显示在图2。2PCI-Express总线传输GPU方面(全局内存,在任何GPGPU的应用(1的第5行)GPU1示为图1的一个立方体。如果它是一个空坐标(图1中灰色立方体,该线程一个在3 的值(例如(2𝑘),在图1和图3中表示为无限大。我像素坐标(X,Y,Z)三个二进制值([𝑥𝑘−1…𝑥2𝑥1𝑥0]𝑏𝑖𝑛,[𝑦𝑘−1…𝑦2𝑦1𝑦0]𝑏𝑖𝑛,𝑧210𝑏𝑖𝑛3D模型的坐标中运行(18-10行。迭代的数目和网格尺寸直接依赖步,很多研究,根据GPU的性能优势提出了一些排序算法,以提高性能。因此,我们研模块。在文献[32],Peters等。提出了一种基于Batcher的传统双调排序算法(双调排序也是CUDA软件开发包中包含的一个样本)的一个高性能的排序算法。最近Kolonias等为每个插位字符串是一个独特的值)的数据。最后,Merrill等,在基数排序算法的基础装的。如果在未来的一个新的,更好的GPU排序算法被开发,它可以被直接引入到我们标,从而提高了算法的性能,因为较少的线程必须在连续的内核推出。其次,t在盒子计数0(m0时)t盒被标记为黑色,并且没有白色或灰色的确定的值的数目。所以一旦t被计算,我们就可以直接在第一个框中计数的值(0个白色,0个灰色以及t个黑盒。随后,进入到主循环(参见图2和列表1中的17-30行。请记住,2𝑘是完的=1,且M≤k时我们反复地启动两个GPU的核函数,标记位字符串(MaskBit Boxes,算盒增量。两个内核都在相同的CUDA配置环境下启动:一维线程块中每个包含512个线程以及二维网格,包含启动T个线程必须数量的线程快。通过启动t个线程,只有真正需要的工作流程在执行,从而忽视了位串数组的无用的位,如前面提到的。这也示于图1。 Strings(列表1,20行。每个线程将位串最右边m*3位标记为0,如第2.2节,步骤D的第一行所示。这个内核执(MaskedArray Boxes(盒是黑色或灰色。为了更好地理解,检查盒过程图解详见图3,其内核代码如列表2所线程无法计数任何黑色或灰色的盒。这些线程在图3中被表示为“什么也不做”。反之,如果是这样的v的第一次出现,则线程“跳跃”2𝑚3−1坐标(这是在一个方框内的体素的最后一个问题涉及到的检查盒的核函数,这个问题是是如何的黑色和灰色框的数列表加前一个共享内存值,最后它们将此中间值在全局器。列表2显示了这个内核的模型的x和y坐标,因此不得不在每次迭代掩盖2*m位,而不是3*m位。结当我们在GPU上实现了并行化的盒计数算法,对比原始侯氏算法。我们还要性。最后,我们使用我们的GPU盒计数算法对三个脑组织进行三维FD分析。硬件环境和测我们使用的GPU为一块NVIDIAGeGTX580(GTX580),驱动版本为286.19,CUDA4.1,CUDAcomputingcapability2.0。GTX580FermiMP1536个线MP81024MP最大线程过1536。在内存方面,对于GTX580每个MP拥有32K个32位的寄存存,16KB64KB1536MB的DDR5全局内存。在CPU方面,我们使用InXeonE5620@2.40GHz四核处理器,可以同时仿真4个线程。我们配备了12GB的内存,系统为Windows7Professional64位。使用C++windows.h的PerfTimer类度量运行时间。我们尽量实现了侯氏单线程算法,GPU和CPU的运算效率,我们实现了一个多线程的侯氏盒计数算法并在多核CPU上运行。这个多线程算法使用C语言实现使用了process.h[36](在VisualStudio中包含。对于CPU并行计算的排序算法,我们选用In的TBB排序算法[37,38]。TBB算法可以改善多核运算的效率,对于测试过程中使用的模型,我们选择一套包含不同复杂度、特征和大小的3D体素女性骨盆,绳结环来自于[40],另外纳芙蒂蒂王后像来自于[41],图4展示了每个模型在而后,盒计数算法的准确性通过使用两个熟知的3DFD值分型比较准确性:一个了20年代Aubert-Broche的的正常人脑解剖模型[42,43]。这组模型可以在此实现,比较了他们相关的GPU运行时间,我们执行了从128x128x128体素到512x512x512体素的六组模型。测试模型平均GPU时间包括CPU-GPU-CPU的数据传输。线程,因此多核CPU得到充分利用。本的增长,而GPU运行时间的增长则较慢。事实上,如在图5中所示,加速度几乎以一的GPU盒计数算法缩放很理想。7倍。表2示出了在两个不同的分辨率下各个模型的加速可以达到的情况。这些结果表6对每一个大小为2𝑚速比示于图6,再一次,加速随着模型尺寸的增大而呈几乎线性趋势缩放。势,即在表2中的3D模型的结果。7显示了当执行我们改进的盒计数算法时设备级的使用比例。这个数字清楚地表明GPUNVIDIAVisualProfiler工具。要注意,这些值与GPU运行时间严格相关。因此,CPU参与的指令,如内核启CUDAGPU实现的盒计数算法,快速程序,这项工作我们已完成。为了测量运行时间,他们所使用的“门格尔海境下)实现。我们也将在该环境下执行我们的CUDA“门格尔海绵”分形。他们的3D512×512×512的固体立方体(3DFD=3729×729×729(3D形“谢尔宾斯毯”(理论二维FD=1.8928)检查了我们的算法的精度。点的范围要考虑到最低数量,能够最大化线性回归的相关值R以及更符合理论值,每个点2.7212.7268。类似的结果可以在固体立方体测试得宾斯毯”而获得的二维FD为1.8629(参照图10),也非常接近理论1的值,320个人(受试者编号如[40])三个脑组织(血管,骨髓和脑灰质)的详细尺FD3DFD的平均值,执行3DFDCPU和GPU的运行时间和相应的加速比。每个组织类的平均执行时间和加速比如图12FD2,这是一个逻辑的结果,因为骨髓形状类似后,灰质3DFD值因偏差较小而脱颖而出。为在三维FD的平均值(方形)SEM(块程序的可用结GPU实现的侯氏盒子计数算法。我们已经解释了每GPU上进行的优化。与此相关,我们已经简要地介绍了主流的GPU排序算法,选择最适合我们的算法。盒计数算法的有效性是明显的,因为它是用于执行对FD的计算基础,以及许多生物和医学成像研究和应用领域的基本GPU实现的盒子计数算法。在我们的程序中,最终的平3D2D图像,并能够获得已将我们的快速盒计数算法应用于脑组织的3DFD分析。致 ERDF。通过研究项目UJA2009/13/04和PI10-TIC-5807支参考文 Russel,J. 1175–1178.1242computermethodsand F.Normant, Dean,V. in Afast program (2011)241–249. Y.T. H.Y. W.Y. 61(2010)Y.T.Wu, T.R.Chen, W.Y.Guo, Sahgal,X. fyingdegenerationofwhitematterinnormalagingusingfractaldimension,NeurobiologyofAging28(2007)Y.T.Wu,K.K.Shyu,C.W.Jao,Z.Y.Wang,B.W.Soong, P.S.Wang, change NeuroImage49 F.J.Esteban, Go˜ni, P.Villoslada, Bargalló,P.Villoslada,E.Grós,Fractal-ysisdetectscerebralchangesinpreterminfantswithandwithoutintrauterinegrowthrestriction,NeuroImage43(2010)1225–1232.J.RuizdeMiras,P.Villoslada,J.Navas,F.J.Esteban,UJA- (2011)452–460.NVIDIA OPENCL T.Aach, –possibilitiesand P. andmulti-core . T.Toth, Afast A141 box-countingapproachtocomputefractaldimensionofimage,IEEETransactionsonGeoscienceandRemoteSensing32(1994)1096–1102.Y.C.Tzeng,K.T.Fan,Y.J.Su,K.S.Chen,Aparalleldifferentialboxcountingalgorithmappliedtohyperspectralimageclassifications,in:IEEEIGARSS, (1987)NVIDIACUDALibrary /compute/DevZone/docs/html/C/doc/html/index.html. (2011)50–59. withBitonic V.Kolonias, Voyiatzis,G. High AParallel InThreading In 2011,. 3DVIA Anew computermethodsandprogramsinbiomedicine108(2012)journalhomepage: /journals/cmpb Fastbox-countingalgorithmonJ.Jiménez,J.RuizdeMirasDepartmentofComputerSciences,UniversityofJaén,Jaén,articl inf abstracArticleReceived17December2011Receivedinrevisedform28JuneAccepted30July:BoxcountingFractal

Thebox-countingalgorithmisoneofthemostwidelyusedmethodsforcalculatingthefractaldimension(FD).TheFDhasmanyimage ysisapplicationsinthebiomedicalfield,whereithasbeenusedextensivelytocharacterizeawiderangeofmedicalsignals.However,computingtheFDforlargeimages,especiallyin3D,isatimeconsumingprocess.Inthispaperwepresentafastparallelversionofthebox-countingalgorithm,whichhasbeencodedinCUDAforexecutionontheGraphicProcessingUnit(GPU).TheoptimizedGPUimplementationachievedanaveragespeedupof28times(28×)comparedtoamono-threadedCPUimplementation,andanaveragespeedupof7times(7×)comparedtoamulti-threadedCPUimplementation.Theperformanceofourimprovedbox-countingalgorithmhasbeentestedwith3Dmodelswithdifferentcomplexity,featuresandsizes.Thevalidityandaccuracyofthealgorithmhasbeenconfirmedusingmodelswithwell-knownFDvalues.Asacasestudy,a3DFD ysisofseveralbraintissueshasbeenperformedusingourGPUbox-countingalgorithm.Thebox-countingalgorithm[1]isusedforcalculatinganesti-mateofthefractaldimension(FD)ofanimage,bothintwo(2DFD)andthreedimensions(3DFD).Thebox-countingalgorithmbasicallyconsistsofcountinghowmanynon-zeropixels(2D)orvoxels(3D),thosethatarefilledbytheimagemodel,arecontainedinagridofsquareboxes.Eachoneoftheseboxesislabelledasblack,grayorwhite,dependingonhowmanypixelsorvoxelsitcontains.Thisprocessisrepeatedforgridswithdifferentboxsizes.Fromthesedata,thevalueoftheFD,foraselectedtypeofvoxel(white,gray,black,black+gray),iscalcu-latedthroughalog–loglinearregressioninwhichtheXaxisrepresentstheinverseboxsize,l,andtheYaxisrepresentsthenumberofboxesforthattypeofvoxel,N(l).ThefinalvaluefortheFDcorrespondstotheslopeofthelinearregression:FD=−((lnN(l))/(lnl))Severalstudies[2]haveshowedthatthebox-methodisvalidforstatisticallyself-similarmodels.

©2012Elsevier .All locationofthegridineachi tionand,especially,theselectionoftheboxsizeaffectthevalueoftheestimatedFD.Manystudiesweredonetofindtheupperandlowerboundsfortheboxsize;infactthisisstillanopenissue.Therearetheoreticalstudies[3,4]thatestablishtheseboxsizelimitsintermsofparameterssuchastheimagesize,itsdimensionorthenumberofpointsitcontains.Thesemethodsaregenerallyapplicableonlyforidealfractals.Mostauthorsselecttheboundsfortheboxsizeaccordingtotheirparticularmethodforcalculatingthebox-countingandtheapplicationfield.Inthesecases,thelinearportionofthelog–logplotusedtoestimatethevalueoftheFDisnormallydeterminedbyyzingdifferentvoxelsizerangeslookingforthosethatizethecorrelationvalue[5,6].Despitethesedraw-backs,thebox-countingalgorithmisboththemostfrequentlyusedandthemostpopularmethodtocalculatetheFD.Inrecentyears,severalpapershavestudiedtheutilityoftheFDcalculationinbiomedicineandmedicalimaging.LopesandBetrounipresentin[7]aninterestingreviewofthemostrelevantmethodstocalculatethefractaldimensionandtheir∗Correspondingauthorat:Dpt.ComputerSciences,UniversityofJaén,CampusLasLagunillass/n,23071Jaén,Spain..: E-mailaddress:demiras@ujaen.es(J.Ruizde0169-2607/$–seefrontmatter©2012Elsevier .All computermethodsandprogramsinbiomedicine108(2012)applicationinthefieldofmedicalimageysis.Inmoredetail,in[8,9],anysisofthecomplexityofthefetalcorti-calsurfaceisperformedbycalculatingthe3DFD.In[5,10,11]thecerebralstructureofwhitematteranditsdegenerationwithageingisfied,thanksagaintothe3DFD.Wuetal.larstructurecouldbefiedby3DFDcalculation.Finally,in[13,14]3DFDisusedtoyzethegraymatterinmulti-plesclerosisandtodetectingchangesinpreterminfantswithandwithoutintrauterinegrowthrestriction,respectively.In[15]aprogramforcalculatingthe3DFDofMRIdataispre-sented.Theyexposedtheneedtooptimizethissoftwareintermsofexecutiontimebyacceleratingthe3Dbox-countingalgorithm.Duetotheincreasingamountandsizeofdata,forexampleinthefieldofmedicalimaging,thistypeofimageprocessingalgorithmhastobeoptimized.Recently,theuseoftheGraphicProcessingUnit(GPU)forgeneralpurposes(GPGPU)hasgrownexponentially,prin-cipallyduetotheappearanceofnewarchitecturesandprogrammingmodels,suchasNVIDIACUDA[16]orKhronosOpenCL[17].TheseprogrammingmodelsdonotrequireanypreviousknowledgenoronprogrammingGPU,neitheronthegraphicsvisualizationpipeline,andusesimpleextensionsofwellknownlanguageslikeC.GPUcomputing,andespeciallyNVIDIACUDA,hastakenanimportantroleinresolvingcom-plexproblemsinbiomedicine,asrecentlyexpoundedin[18].Asforrecentexamples,[19,20]presentedstudiesforusingtheGPUparallelcapabilitytoyzefunctionalmagneticresonanceimaging(fMRI)data;in[21]theauthorsusetheGPUtoefficientlyperformelasticregistrationofintra-patientbiomedicalimages.In[22]aGPUskeletonizationalgorithmispresented.Thisalgorithmhasmanybiomedicalapplications;seeforexample[18].Somestudieshavefocusedonimprovingthebox-algorithm.Therearetwowidelyusedalgorithmsforper-formingbox-counting.ThefirstalgorithmwasdevelopedbyLiebotichetal.[23]andwaslaterimprovedbyHouetal.[24].SarkarandChaudhuridevelopedthesecondalgo-rithm[25],knownasdifferentialbox-counting(DBC).Thesealgorithms,althoughefficient,onlyuseonethreadintheirexecution.Recently,Tzengetal.improvedandparallelizedaDBCbox-countingalgorithmbyusingamulti-coreCPU[26].However,toourknowledge,therearenostudiesexploitingthehighlyparallelcomputingcapabilityofthecurrentGPUsinordertoachieveaperformanceimprovementforthesealgo-Anotherwidelyusedalgorithminmedicalimagevisual-izationistheMarchingCubesalgorithm[27].Thisalgorithmfromaregulargridwhereeachvertexisassociatedwithsomescalarvalue.Thenumberofgridcubeswhosevaluesatthever-intersectedbytheiso-surface.Thus,bysubsamplingthegridwithcubesoflargersizes,thebox-countingoftheiso-surfacecanalsobeobtained.However,afundamentaldisadvantageofdirectlyusingthistechniquetoobtainthebox-countingistheadditionalcomputingcostfortheysisofthevalues

theverticescontainedinthesubsampledcubes,alsorequir-ingcountingthevoxelsoncethecubesareclassifiedasblack,grayorwhite.IntherestofthepaperwefirstdescribethemainhardwareandsoftwareaspectsofGPUprocessingandCUDA.Thenwedescribethebox-countingmethodthatweparallelizeontheGPU.Next,weshowthekernelsofoursoftware,detailinghowwehaveoptimizedeachofthem.InSection4,weyzethespeedupandefficiencyofourimplementations.WealsotestitsvalidityforcalculatingtheFD.Withtheproposaloftestingourfastbox-countingalgorithminabiomedicalcontext,a3DFDysisofthreebraintissueshasbeenperformed.TheresultsarefinallysummarizedinSection6.ComputationalmethodsandInthissection,wefirstsummarizethecharacteristicsoftheCUDAprogrammingmodel.Second,wedescribethebox-countingalgorithmthatwehaveparallelized,discussingwhythisalgorithmwaschosenforparallelizationonGPU.Wealsoexinhowtheselectedbox-countingalgorithmworkswhenexecutedonaCPU.CUDAprogrammingAspreviouslymentioned,themostcommonGPGPUpro-grammingmodelsareNVIDIACUDAandKhronosOpenCL.Currently,andmainlyduetoCUDAbeingamoreestablishedtechnology,thereisahighernumberofwell-optimizedAPIsandlibraries(whichimplementfunctionssuchas:parallelreductions,prefix-sums,sortingalgorithms...)designedforCUDAthanforOpenCL.So,withtheaimofaccessingtheseexistingwell-optimizedmodulesandfunctions,thusobtain-ingthefastestalgorithmaspossible,wechoseNVIDIACUDAfordeveloourTheCUDAprogrammingmodel[16]allowsthemertosimultaneouslylaunchthousandsofGPUthreads.Eachthreadexecutesthesamefunction,whichiscalledkernel,onadataset.Itisonlynecessarytoprogramwithasupportedtradi-tionallanguage(usuallyC/C++),usingtheCUDAAPI[28]toharnesstheGPUresources.InadditiontotheispresentinallCUDA-capableGPUs.EachCUDAdevicehasseveralMultiProcessors(MPs),eachconsistingofsomeStreamingProcessors(SPs)withSIMD(SimpleInstruction–MultipleData)architec-ture.Theseprocessorsareresponsibleforexecutingallthethreadsinaparallelway.TheCUDAmemoryarchitectureisexplicitlymanagedbytheprogrammer.EachdevicehasalargeamountofslowGlobalMemory(writablebyboththeCPUandtheGPU),aglobalread-only(intheGPUside)ConstantMemory,andaspecialTextureMemory.Somesmall-sizeandfast-accessmemorymodules,calledSharedMemory,arepresentineachMP.TheSPscanalsomanageseveral32-bitregisters.Inaddi-tion,sincethearrivalofGF100architecture[29],arealcachehierarchyhasbeenintroduced.

Allthreadsareorganizedintoseverallevels.Individualthreadsaregroupedin1-,2-or3-dimensionalthread-blocks.Thread-blocksaregroupedinamono-orbi-dimensionalgrid,andalso,onlyonGF100GPUs,inthree-dimensionalgrids.Eachthread-blockhasalimitof1024threads(onGF100GPUs),andeachgridsupportsupto65,535blocksineachdimen-sion.Therefore,thousandsofthreadscouldbesimultaneouslycomputermethodsandprogramsinbiomedicine108(2012) launchedandexecutedontheGPU.Eachthread-blockisassignedtoanMP,soitssharedmemoryisonlyaccessibleforthreadswithinthisthread-block.Residentthreadsaresimul-taneouslyexecutedineachSPingroupsof32,calledwarps.Awarpistheminimumschedulingunit.Thedeviceexecutesaninstructionforallthreadswithinawarpbeforelaunchingthenextinstruction.ThisisknownasSingleInstructionMulti-pleThreads(SIMT).Threadswithinawarpphysicallyexecutethesameinstructionineachcycle.Thebestperformanceisachievedwhenthenumberofthreadsinathread-blockisamultipleofthewarpsize[30].Thisorganizationphilosophy,alongwiththeCUDAsched-ulerexecution,allowsahighscalabilitylevel,becausethread-blocksareautomaticallyassignedtoidleMPs,indepen-dentlyofcompiledprogramcanbeexecutedondifferentandheterogeneousCUDAcapabledevices.CUDAoffersspe-cialbarrierinstructionsinordertoestablishsynchronizationpointsbetweenthreadsinthesameblock.However,globalmemoryisneededtosharedatabetweenthread-blocksorsetasynchronizationpointforthreadslocatedindifferentHou’sbox-countingAspreviouslymentioned,therearetwodifferentbox-countingalgorithmsthatarewidelyusedinliture.Liebotich’salgo-rithm[23],improvedin[24]byHouetal.,hasatimecomplexityofO(N*ln(N)).Thisisthebox-countingmethodthatweuseinthispaperandimplementontheGPU.Theotherremark-ablebox-countingalgorithmisthedifferentialbox-counting(DBC)presentedin[25]bySarkarandChaudhuri.ThisDBCalgorithmwasalsorecentlyimprovedbyTzengetal.in[26],whoattemptedtoadjustitforanefficientandfastexecutiononmulti-coreCPUs.TheDBCalgorithmisbasedongraylevelimages,whichfallsoutsidethescopeofourwork.Inthispaperweassume3Dvoxelizedmodels(forexample,a3Dmatrix

formedbyastackof2DMRimages),sofromnowonwewillreferonlytovoxelsand3Dboxes.Hou’sbox-countingalgorithmuniquelydeterminestheglobalpositionofavoxelthroughthebinaryrepresentationofitscoordinateswithkbits,where2kvoxelsistheedgesizeoftheboxthatcompleycoversthemodel.Gridsofboxesitivelygenerated,eachonewithadifferentedgesizeofvoxels,where0≤m≤k.Soconsideringthatnisthenumberofnon-zerovoxelswithinaboxofthegrid,thisboxwilllabelledasblackifn=(2m)3,asgrayif0<n<(2m)3,andlabelledaswhiteifn=ThedetailedstepsofHou’sbox-countingalgorithmforthe3Dcaseare(seeFig.1):Thevoxelcoordinatesx,yandz,withrange0...2k−1,Thus,eachcoordinatewillbecomposedofkForeachvoxel,itsbinarycoordinatesarethenintercalated,formingalargebitstring.Theinterca-latedbitstringofavoxelsituatedintheposition([xk−1...x2x1x0]bin,[yk−1...y2y1y0]bin,[zk−1...z2z1z0]bin)is(xk−1yk−1zk−1...x2y2z2x1y1x1x0y0z0)bin.Thus,theinterca-latedbitstringmusthaveasizeequaltok*3bits.Asresult,alistofintercalatedbitstringsisobtained,eachbitstringuniquelyidentifiesoneThelistofintercalatedbitstringsissortedaccordingtothevalueofeachintercalatedbitstring.Finally,m*3bitsontherightofeachbitstringaremaskedto0.Iftwoormoremaskedbitstringshavethesamevalue,thentheircorrespondingvoxelsbelongtothesameedgesize2mbox.Asaresultofthismaskingstep,thenumberofdistinctvaluesinthelistgivesusthenumberofboxes(blackboxes+grayboxes)ofedgesize2mnecessarytocompleycoverthemodel,thusperform-ingthebox-countingforanedgesizeof2m.Whiteboxesarethenobtainedtrivially.WedifferentiatebetweenblackandgrayboxesusingthefixednumberofvoxelsrequiredFig.1–Hou’sbox-countingalgorithmanditsrelatedCUDAkernels.StepDisrepresentedform= computermethodsandprogramsinbiomedicine108(2012)tofillabox.Thisprocessis tivelyrepeatedm≤beencompleted,yieldingthenumberofblack,grayandboxes,foredgesize2mwith0≤m≤k,tocoverthewhole3Dvoxelizedmodel.TheHou’sbox-countingalgorithmisgraphicallyrepre-sentedinFig.1.Inthiscase,weusea3Dmodelformedby10voxels(redvoxels)insidea3Dgridwitharesolutionofk=2voxelsofthewholegrid).Inthisfigure,weonlyrepresentthebox-countingprocessform=1.TherearesomeimplementationsofthisalgorithmonCPU,liketheoneperformedbyKrugerin[31],orarecentMAT-LABimplementationin[6].But,asmentionedthroughoutthispaper,toourknowledge,therearenoGPUimplementationsofHou’salgorithm,noranyotherbox-countingalgorithm.GPUbox-countingInthissection,wedescribehowHou’sbox-countingalgorithmhasbeenparallelizedinordertoachieveanoptimalperfor-mancewhenexecutingitonNVIDIAGPUswithCUDA.Ontheonehand,Fig.2showsthemainmodulesanddataflowofourGPUbox-countingalgorithm.ItisimportanttonotethatshadedfunctionsanddatainFig.2areexecuted(func-tions)orstored(data)ontheGPUside.Ontheotherhand,andforfurtherunderstanding,Listing1showsthesimplified

mainfunctioncodeofourbox-countingalgorithm.ThisdirectlycorrespondswiththeflowchartshowedinFig.Asaninitializationstep,thedataset(a3Darrayof0’sand1’srepresentingthevoxelizedmodel)istransferredfromthehostside(RAMmemory)totheGPUside(globalmemory)throughthePCI-expressbus,asinanyGPGPUapplication(line5inListing1).OncethedataisontheGPU,kernelscanoperateonit.Forabetterunderstanding,Figs.1and3showhoweachmainkerneloperatesandtransformsthedataalongthewholeThefirstkerneliscalledBitIntercalate.Eachthreadofthiskernelaccessesonlyonepositionofthe3Dmodel.Eachposi-tionisrepresentedasacubeinFig.1.Ifitisanemptyposition(graycubeinFig.1),thethreadstoresapredefinedandunreachablevalue(like(2k)3,representedasinfiniteinFigs.1and3forclarification)inthebitstringsarray.Weben-efitfromthisoperationinthefollowingkernel.Otherwise,ifthepositioninthemodelisnotempty(redcubeinFig.1)(Forinterpretationofthereferencestocolorinthissentence,theFig.2–Dataflowchartfortheparallelbox-countingreaderisreferredtothewebversionofthearticle.),eachthreadinterpretsthevoxelcoordinates(x,y,z)asbinary([xk−1...x1x0]bin,[yk−1...y1y0]bin,[zk−1...z1z0]bin).ThethreadthenintercalatesthesebitsasshowninstepSectionFinallyeachthreadstoresthefinalintercalatedbitstringinanewglobalmemoryarray(“BitStringsArray”inFig.1).Sowhenallthethreadsfinalizetheirexecution,thearrayofinter-calatedcoordinates(bitstringsarray)iscompleygenerated.RegardingtheCUDAlaunchconfigurationoftheBitInter-calatekernel,weselecta3Dthread-blockwithsize8×88threads,whichistheum3Dsizewhichisofthewarpsize.Also,thisisthethread-blockconfigurationthatworksbestinourcase,accordingtoourtests.Inaddi-tion,this3Dthread-blockconfigurationdirectlycorrespondswiththe3Dcoordinatesofthemodel,whichhelptoimprovethekernelperformance.Althoughthekernelworksovera3Dmodel,weselecta2Dgridofblocksinsteadofa3Dgridbecauseweachievehigherspeedupvaluesaccordingtoourtests.Therefore,wehavetoitivelylaunchtheBitIntercalatekernelinordertoexecuteitoverall3Dmodelpositions(lines8–10inListing1).BoththenumberofitionsandthegridFig.3–CheckBoxesprocedure.Itstartsfromanarrayofmaskedbitstrings,andthenitdetectswhichboxesareblackandwhicharegray.ThissamplecorrespondswithFputermethodsandprogramsinbiomedicine108(2012) sizedirectlydependonthe3Dmodelsize,andarecalculatedontheThenextstepconsistsofsortingthebitstringsarray.Sincesortingdataisafundamentalandnecessarystepinmanyalgorithms,therearemanystudiesthatproposesortingmethodswhichtakeadvantageoftheGPUcharacteristics,toimprovetheperformance.Therefore,wehavestudiedsomecurrentparallelsortingalgorithmsinordertodeterminewhichoneisthebest,andthenwehaveincludeditasamoduleofourfastbox-countingalgorithm.In[32],Petersetal.presentahigh-performancesortingalgorithmbasedonBatcher’straditionalbitonicsort(bitonicsortisalsooneofthesamplesincludedintheCUDASDK).MorerecentlyKoloniasetal.havedesignedandimplementedanefficientcountsort

algorithminCUDAGPUs[33].Thisisoneofthefastestsortingalgorithmswhenalargeamountofasmallrangeofvalueshastobesorted.Unfortunaythisisnotourcase,sincewehavealargeamountofdatawithahighrangeofvalues(sinceeachintercalatedbitstringisauniquevalue).Finally,Merrilletal.presentin[34]anewimplementationbasedontheradixsortalgorithm.Theyachieveaspeedupofupto3.8times,withrespecttoothercurrentGPUsortingalgorithms.ThisCUDAsortingalgorithmiscurrentlytheonethatoffersthebestper-formance,thereforewehaveuseditinourwork.TheauthorsoffertheirsortingalgorithmpackedintheThrustLibrary[35].ThislibraryisautomaticallyinstalledtogetherwiththeCUDAToolkit4.0orgreater.IfinthefutureanewandbetterGPUsortingalgorithmisdeveloped,itcouldbedirectlyintroducedintoouralgorithmtoimprovethegeneralperformance.##GENERALPROCEDURE- log2(modeledge modeledge//CopyinputdatafromCPUtocudaMemcpy(dev3DModel,3DModel,//Bitfor(griddev3DModel,dev10.//Sort-thrust::sort(devintedCoords,devintedcCoords+//Count"Non-ZeroVoxels"- size-thrust::count(devdevintedCoords+size,for(m1;m<k;//MaskBitmask size,mask//Check CheckBoxes<<<nblocks,nthreads>>>(devintedCoords,size,t,pow(2.0.masksize),dblacks,dgrays);//ThrustReduceCounters(BlackBoxesarray&GrayBoxesthrust::inclusive_scan(dblacks,ddthrust::inclusive_scan(dgrays,dd//Saveresults(BlackBoxesvalue&GrayBoxescudaMemcpy(&dBCB[m-1],&dblacks[ncounters-sizeof(int),cudaMemcpy(&dBCG[m-1],&dgrays[ncounters-sizeof(int),31.//CopyoutputdatafromGPUtocudaMemcpy(BCB,dBCB,cudaMemcpy(BCG,dBCG,Listing1.MainprocedureoftheGPUBox-Countingalgorithm computermethodsandprogramsinbiomedicine108(2012)Afterthisstep(line13inListing1),thebitstringsissortedinascendingorder,i.e.“SortedBitStringsArray”in1.Itisimportanttonotethatuselessbitstrings,corre-spondingtoemptyvoxels(representedasinfiniteinFig.1),arethenmovedtotheendofthearray.Oncethearrayissorted,wehavetop

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论