全文预览已结束
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
SearchingSingleNucleotidePolymorphismMarkerstoComplexDiseasesusingGeneticAlgorithmFrameworkandaBoostModeSupportVectorMachineKhantharatAnekboon,SuphakantPhimoltares,andChidchanokLursinsapAVIC,DepartmentofMathematics,ChulalongkornUniversity,Bangkok,ThailandKhantharat.AStudent.chula.ac.th,suphakant.pchula.ac.th,andlchidchachula.ac.thSissadesTongsimaGenomeInstitute,NationalCenterforGeneticEngineeringandBiotechnology,Pathumtani,Thailandsissadesbiotec.or.thSuthatFucharoenThalassemiaResearchCenter,InstituteofMolecularBiosciences,MahidolUniversity,SalayaCampus,Nakhonpathom,Thailandgrsfcmahidol.ac.thAbstractWiththeadventoflarge-scalehighdensitysinglenucleotidepolymorphism(SNP)arrays,case-controlassociationstudieshavebeenperformedtoidentifypredisposinggeneticfactorsthatinfluencemanycommoncomplexdiseases.ThesegenotypingplatformsprovideverydenseSNPcoverageperonechip.Muchresearchhasbeenfocusingonmultivariategeneticmodeltoidentifygenesthatcanpredictthediseasestatus.However,increasingthenumberofSNPsgenerateslargenumberofcombinedgeneticoutcomestobetested.ThisworkpresentsanewmathematicalalgorithmforSNPanalysiscalledIFGAthatusesa“BoostMode”supportvectormachine(SVM)toselectthebestsetofSNPmarkersthatcanpredictastateofcomplexdiseases.Theproposedalgorithmhasbeenappliedtotestfortheassociationstudyintwodiseases,namelyCrohnsandseverityspectrumof0/HbEThalassemiadiseases.TheresultsrevealedthatourpredictedSNPscanrespectivelybestclassifybothdiseasesat71.57%and71.06%accuracyusing10-foldcrossvalidationcomparingwiththeoptimumrandomforest(ORF)andclassificationandregressiontrees(CART)techniques.Keywords-SingleNucleotidePolymorphism;SupportVectorMachine;GeneticAlgorithmI.INTRODUCTIONScientistshavelongbeeninterestedinidentifyinggeneticfactorsthatinfluencetheoccurrenceofcomplexdiseases.Withtheadventofparallelgenotypingtechnology,costandtimeinfindingSNPsarenotoutofreach.Largecase-controlcohortsgeneratedfromverydenseSNParrays(DNAchipcontainsdensearrayofSNPs)challengingresearcherstosearchforSNPsthatareassociatedwiththediseases.Incontrasttothesinglegenedisorders,thestateofcomplexdiseasescouldbetriggeredfrommultiplegeneswhenexposingtocertainenvironmentalfactors1,2.However,searchingformultiplemarkerinteractionsfromalargepoolofSNPsimposeshighcomputationalandmemorycomplexity.Atechniqueofselectingsubsetofrelevantfeatures,namedFeatureSelection3,hasbeenwidelyusedinalmostfields,includingbioinformatics.Thistechniqueprovidesmoreeffectivewaytoimprovelearningaccuracytounderstandtheimportanceofthefeaturesbyremovingirreverentorredundantones.II.THEPROPOSEDIFGAMETHODInthissection,weintroduceanewencodingmethodcalledIFGA.Fig.1demonstratesthesummaryoftheIFGAmethod.Thefirstpopulationisconstructedbyourproposedintegerencodingapproach.Thedatainthechromosome(inGeneticAlgorithm(GA)context)arerepresentedbyasetofselectedfeatures.Afterthepopulationisgenerated,eachchromosomeisevaluatedbyafitnessscore.ThisscoreisobtainedbyusingtheBoostMode-SVMapproach.Then,theIFGAre-generatesthenextpopulationbyIFGAselection,IFGAcross-over,andIFGAmutationuntilaterminationcriterionissatisfied.A.TheIntegerEncodingMethodUtilizingGAtoperformfeatureselectioncanbedonebyconvertinginputdatausingbinaryencoding4.Thelengthof978-1-4244-4713-8/10/$25.002010IEEEFigure1.TheoverallIFGAflowchar.achromosomeequalsanumberofallfeatures.Thesizeofencodedchromosomecorrespondsdirectlytothenumberofinputfeatures.This,however,presentsaproblemduetotworeasons.First,therunningtimehighlydependsonthelengthofchromosome.Second,ageneralbinaryencodingdoesnotfixanumberofselectedfeatures.Itfixesonlythelengthofthechromosome.TheIFGAintegerencodingmethodisproposedtosolvetheseproblems.Assumethatacase-controldatausedinthisstudyhavemnumberofgenotypes.LetQibetheithchromosomeprocessedinthealgorithm.ThelengthofQi,denotedby|Qi|,issettoaconstantlessthanorequaltom.Then,random|Qi|numbers,representthelocationstoselectthecorrespondinggenotypesfromagivenfeaturesequence.DuringtheIFGA,thelengthofeachchromosomeisnotnecessarilyidentical.Forexample,supposem=7,thechromosomesize(|Qi|)issetto3,andtherandomlyselectedlocationsare1,5,and6.So,thechromosomeQi=1,5,6.B.IFGASelectionEachindividualchromosomeisselectedbasedonitsfitnessscoreintoamatingpoolbyastochasticuniversalsamplingmethod(SUS)5.TheIFGAalsousesanelitismtechnique6,inwhichthenextgenerationchromosomederivesfromthebestchromosomeinacurrentgeneration.C.IFGACross-OverThecross-overfunctionoftraditionalGArandomlyselectstherecombinationpointandswapsthetwochromosomesflankingthispoint.Cross-overfromtheoriginalGA,however,cannotbeappliedtotheIFGAapproachbecauseallchromosomesmusthavethesamesizeandfeaturesfromthesamelocicannotbeonthesamechromosome.WemustdeviseanIFGAcross-overtechniquetoovercomethisproblem.Assumethat,parent1andparent2aretheparentalchromosomeswhereeachlocusisthepositionofselectedfeature.Eithernumberofparent1sorparent2slocusmustbemorethan1.Numberofbothparentsloci(parent1andparent2)mustbegreaterthanorequaltoone.Outputsfromthisalgorithmareoffspring1sandoffspring2.1:x2:y3:tmp1parent14:fori=0to|parent1|do5:v|tmp1|6:selrandom(1,2,.,v)7:xxsel8:tmp1tmp1parent1selsuppress9:endfor10:tmp2parent211:fori=0to|parent2|do12:v|tmp2|13:selrandom(1,2,.,v)14:yysel15:tmp2tmp2parent2sel16:endfor17:crandom1,min(|parent1|,|parent2|)118:offspring1x1,x2,.,xc,yc+1,.,y|parent2|19:offspring2y1,y2,.,yc,xc+1,.,x|parent1|D.IFGAMutationMutationfunctionaltersthevalueofaspecifiedlocus.Ithardlyoccurswhencomparingwiththecross-overprocess.IFGAmutationispresentedhere.Letmdenotethelengthofagivengenotypesequence,input_chromisachromosomethatwillbemutated,andoutput_chromisamutatedchromosome.Eachelementinachromosomeisaselectedfeature.1:pos_outrandom1,|input_chrom|2:pos_inrandom1,m3:fori=1to|input_chrom|do4:ifi=pos_outthen5:output_chromipos_in6:else7:output_chromiinput_chromi8:endif9:endforE.GeneratingaPopulationTherearetwokindsofpopulation,theinitialpopulationandthenextgenerationpopulation.TogeneratetheinitialpopulationwithPchromosomes,wherePisauser-definednumberofchromosomesinthepopulation,thealgorithmrepeatedlygeneratesthechromosomesbyintegerencodingmethodandaddsthemintothesetofpopulationuntilthenumberofthechromosomesinthepopulationisequaltoP.Ontheotherhand,thepopulationinthenextgenerationconsistsofthechromosomeb,thebestfitnessscorefromthecurrentgeneration,egroupsoffeaturesfromevolution,cross-overandmutation,andrgroupsofthefeaturesfromthenewre-selectedfeatures.Afteraddingbandetothenextgeneration,thosechromosomesarecheckedforredundancy.Eachchromosomemustbeidenticalinthenextgeneration.Duplicatedchromosomeswillberemoved.Ifthenumberofchromosomesinthenextgenerationislessthanthenumberofchromosomesinthecurrentgenerationthenanewsubsetsoffeatures,r,willberandomlycreatedandaddedtothenextgeneration.F.TerminationThisIFGAalgorithmconsistsofasetofrecursivestepsforgeneratingthepopulation,evaluationbyaBoostMode-SVM,IFGAselection,IFGAcross-over,andIFGAmutation.Thesestepsareexecuteduntilthenumberofthebestresultsremainsconstantinthenext300iterations.III.THEPROPOSEDBOOSTMODE-SVMMETHODThegoalofSVM7istofindamaximalseparatinghyperplane:eitherfor(1)linearlyseparablecaseor(2)thenonlinearlyseparablecase.Notedthat,wTisatransposevectorofweight,xiisaninputvector,isamappingfunction,andbisabiasvalue.yi=sign(wxi+b)(1)yi=sign(w(xi)+b)(2)Theseequationsfacethesameproblemoccurredwhentheinputdataareimbalanced.Thelearnedseparatinghyperplanefromimbalanceddatasetmayshifttoomuchinthedirectiontowardsthesmallergroupcomparedwiththetrueseparatinghyperplane8.Tosolvethisproblem,thedecisionhyperplaneshouldbeadjusted.Itcanbeseenfrom(1)and(2)thattheparameterweffectstheclassificationoutput.So,modifyingwwilladjustthedecisionhyperplane,whichmayimprovetheclassifier.A.BoostMode-SVMAnewtechniqueofoversamplingfornominalfeatureisproposedtoimprovetheperformanceoftheSVM.TheBoostMode-SVM(Fig.1)generatestwoSVMs,namelySVM1andSVM2.TheSVM1isconstructedforgeneratingthescoreofthetrainingdatasetwhereastheSVM2isthefinalSVMmodelforclassificationthetestset.First,onlythetrainingsetisusedtoconstructtheSVM1andtofindtheBoostMode.ThisBoostModeistheindicatorvectoroftheminoritydataset.ItisbroughttotestwiththeSVM1.Twoscoringmethods,anUnbiasedScoring(US)andaBiasScoring(BS),areproposedtofindthescoringvalue.TheUSmethodisperformedwhentheSVM1correctlyclassifiestheBoostMode,otherwisetheBSmethodisperformed.Afterthat,aScoringOver-Samplingapproach(SOS)isprocessedforaddingartificialdatatominoritygroupbysamplingthedataoftheminoritygroupuntilanumberofdataofbothgroupsareequal.Theminoritygroupinthispapermeansthegroupofdatahavingfewerelements.ThenewSVM2isconstructedfortheclassificationbytheprevioustrainingdatasetandnewsetofdatafromtheSOStechnique.Finally,thetestsetisrunintheSVM2fortheevaluation.TheerrorrateforthetestsetisthefitnessscorevalueusingintheIFGAsectionabove.B.FindingtheBoostModeTobalancethesizeofdatafrombothclasses,someadditionaldataintheminoritygroupmustbegenerated.Theselectedgeneratingmethod(eitherUSorBS)willdependuponaBoostModevector.ThefollowingproceduredescribeshowtocomputetheBoostModevector.Letnminorbethenumberofdataintheminoritygroup.Boostrapsamplingwithreplacementisappliedontheminoritygrouptogeneratetdatasets,i.e.BoostGroup1,.,BoostGroupt.EachBoostGroupicontainsnminordata.1:fori=1totdo.2:allmodeimode(BoostGroupi)3:endfor4:BoostModemode(allmodei)iC.TheUnbiasedScoringMethodThistechniqueisprocessedwhentheSVM1classifiestheBoostModecorrectly.Alldatapointshaveequalchances(equalscoringvalues)tobeselectedfortheover-samplingtechnique.ThefollowingalgorithmdescribestheprocessoffindingthescoringvaluebytheUStechnique.ThescoreValisanoutputfromthisalgorithm.1:fori=1tonminordo2:scoreVali=1/nminor3:endforD.TheBiasScoringMethodTheBStechniqueisrunwhentheSVM1incorrectlyclassifiesbytheBoostMode.Thescoringvalueiscalculatedfromthedistanceofitspointtothedecisionhyperplaneby(3)forlinearseparabilityor(4)fornonlinearseparability.distancei=wxi+b(3)distancei=w(xi)+b(4)Thedatapointthatiscorrectlyclassifiedhaslesserchance(lessscoringvalue)tobeselectedfortheoversamplingprocessthantheonethatiswronglyclassified.Therefore,increasinginnumberofincorrectclassificationswouldinfluencethehigherchanceofsamplestobechosenandviceversa.ThescoringvaluefortheBSmethodisdescribedbythefollowingalgorithm.Letdistancebeasetofdistancesofalldatapointsintheminoritygroup.TheoutputfromthisalgorithmisasetofscoreVal.1:sumSV102:minValmin(distancei)i3:addVal=absolute(minVal)+14:fori=1tonminordo5:tmpi=distancei+addVal6:sumSV1=sumSV1+tmpi7:endfor8:iftheminoritygroupisthecontrolgroupthen9:fori=1tonminordo10:tmpi=2tmpi11:endfor12:endif13:fori=1tonminordo14:sumSV2=015:forj=1toido16:sumSV2=sumSV2+tmpj17:endfor18:scoreValisumSV2/sumSV119:endforE.TheScoredOver-SamplingMethodTheobjectiveoftheSOSalgorithmistoselectdatafromtheminoritygroupdependingonthescoreVal,computedbyeitherUSalgorithmorBSalgorithm.LetMDidenotedataith,for1ithnminor.Thenumberofdatainminoritygroupandmajoritygroupsarenminorandnmajor,respectively.Anoutputofthisalgorithmisasetofadditionaldataaddedtotheminoritygroup,samp_data.1:z=nmajornminor2:fori=1to|scoreVal|do3:sumSV1=sumSV1+scoreVali4:endfor5:fori=1to|scoreVal|do6:sumSV2=07:forj=1toido8:sumSV2=sumSV2+scoreValj9:endfor10:mapScorei=sumSV2/sumSV111:endfor12:fori=1tozdo13:selectPos=rand(1)14:ifselectPos0andselectPosmapScore1then15:samp_datai=MD116:else17:forj=2to|scoreVal|do18:ifselectPosmapScorej1andselectPosmapScorejthen19:samp_datai=MDj20:endif21:endfor22:endif23:endforIV.EXPERIMENTSANDRESULTSTableIIshowsthecomparisonoftheIFGA-BoostMode-SVM,ORF9,andCART10by10-foldcrossvalidationofThalassemiasandCrohnsdiseases.OurIFGA-BoostMode-SVMperformsbetterclassificationthanthestandardORFandCARTmethods.Notethat,nofeat.,acc.,sen.,andspec.inTableIIarethenumberoffeatures,accuracy,sensitivity,andspecificity,respectively.Thalassemiadataset(503patientswith835SNPs)wereobtainedfromtheThalassemiaResearchCenter,MahidolUniversityandtheCrohndataset(357patientswith103SNPs)areobtainedfrom11.Missingdatawereinferredby2SNPphasingmethod12.ForSVM,asoftmarginRBFkernelfunctionwith=0.5wasdeployedtoanalyzebothCrohnsandThalasemiasdataset.DummyencodingisappliedforSVMasvectors100,010,and001whereagenotypeismajorhomozygote,minorhomozygote,andheterozygote,respectively.InIFGA,eachchromosomesizeisvariedfrom1to10.Therefore,featureselectionfrom1featureto10featuresisprocessed.ParametersintheIFGAweresetasfollows:thenumberofchromosomesis1000,thecross-overrateis0.7forThalassemiasand0.8forCrohnsdiseases,andthemutationrateis0.035forThalassemiasand0.001forCrohnsdiseases.TABLEI.THEEXPERIMENTALRESULTSDataset+Algorithmnofeat.acc.(%)sen.(%)spec.(%)Thal.+IFGA-BoostMode-SVM671.5776.3964.14Thal.+ORF654.2769.8430.30Thal.+CART669.3876.0759.09Crohn+IFGA-BoostMode-SVM871.0664.5874.90Crohn+ORF857.8820.1480.25Crohn+CART863.3123.6186.83V.CONCLUSIONAnewIFGAwithBoostMode-SVMwasproposedtoidentifythesusceptiblelocifromthecase-controlassociationstudies.TheIFGAtechniqueencodeschromosomesasdifferentintegersizes.TheSOStechniquesamplestheminoritydatasetbytwoscoringapproaches(USandBS)areproposed.Thismethodcanverywellbeappliedinthecase-controlassociationstudies.Theexperimentalresultsfromtworealdatasets:CrohnsandThalassemiasdiseasesshowthatfeatureselectionandclassificationbytheIFGAwithBoostMode-SVMoutperformsthestandardORF,andCARTtechniques.REFERENCES1J.Marchini,P.Donnelly,andL.R.Cardon,“Genome-widestrategiesfordetectingmultiplelocithatinfluencecomplexdiseases,”NatureGenetics,vol.37,pp.413417,March2005.2D.J.Weatherall,“Science,medicine,andthefuture:Singlegenedisordersorcomplextraits:Lessonsfromthethalassaemiasandothermonogenicdiseases,”BMJ,v
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年配电屏项目调研分析报告
- 2024年汽车故障项目调研分析报告
- 部编道德与法治七年级下册1.1《悄悄变化的我》课时练习
- 【可行性报告】2023年非公路矿用车项目可行性研究分析报告
- 2023年隔音降噪设备:隔音吸声材料相关行业项目操作方案
- 2025版高考语文一轮总复习复习任务群3古诗文阅读专题6古代诗歌阅读分点突破3鉴赏诗歌的表达技巧课件
- 2024八年级英语下册Module9FriendshipUnit3Languageinuse课件新版外研版
- 【初中语文】++《紫藤萝瀑布》课件+统编版语文七年级下册+
- 开源软件之消息系统-kafka安装配置
- 2024年微晶氧化铝陶瓷耐磨材料项目资金申请报告代可行性研究报告
- 质量责任损失赔偿及质量事故处理办法
- 紧固件验收规范
- 东南大学基础工程试题库
- 三年级下册数学两位数乘两位数竖式计算
- 名著导读《简爱》ppt课件(58页)
- 单位请假条模板(共1页)
- 广州牛津版八年级上册英语新版单词表(默写版)
- 工程报验申请表—GD-B1-228
- 硫酸购销合同
- 某某部工作指引模板
- 五卅惨案简介
评论
0/150
提交评论