外文资料--Searching Single Nucleotide Polymorphism Markers.PDF_第1页
外文资料--Searching Single Nucleotide Polymorphism Markers.PDF_第2页
外文资料--Searching Single Nucleotide Polymorphism Markers.PDF_第3页
外文资料--Searching Single Nucleotide Polymorphism Markers.PDF_第4页
全文预览已结束

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

SearchingSingleNucleotidePolymorphismMarkerstoComplexDiseasesusingGeneticAlgorithmFrameworkandaBoostModeSupportVectorMachineKhantharatAnekboon,SuphakantPhimoltares,andChidchanokLursinsapAVIC,DepartmentofMathematics,ChulalongkornUniversity,Bangkok,ThailandKhantharat.AStudent.chula.ac.th,suphakant.pchula.ac.th,andlchidchachula.ac.thSissadesTongsimaGenomeInstitute,NationalCenterforGeneticEngineeringandBiotechnology,Pathumtani,Thailandsissadesbiotec.or.thSuthatFucharoenThalassemiaResearchCenter,InstituteofMolecularBiosciences,MahidolUniversity,SalayaCampus,Nakhonpathom,Thailandgrsfcmahidol.ac.thAbstractWiththeadventoflarge-scalehighdensitysinglenucleotidepolymorphism(SNP)arrays,case-controlassociationstudieshavebeenperformedtoidentifypredisposinggeneticfactorsthatinfluencemanycommoncomplexdiseases.ThesegenotypingplatformsprovideverydenseSNPcoverageperonechip.Muchresearchhasbeenfocusingonmultivariategeneticmodeltoidentifygenesthatcanpredictthediseasestatus.However,increasingthenumberofSNPsgenerateslargenumberofcombinedgeneticoutcomestobetested.ThisworkpresentsanewmathematicalalgorithmforSNPanalysiscalledIFGAthatusesa“BoostMode”supportvectormachine(SVM)toselectthebestsetofSNPmarkersthatcanpredictastateofcomplexdiseases.Theproposedalgorithmhasbeenappliedtotestfortheassociationstudyintwodiseases,namelyCrohnsandseverityspectrumof0/HbEThalassemiadiseases.TheresultsrevealedthatourpredictedSNPscanrespectivelybestclassifybothdiseasesat71.57%and71.06%accuracyusing10-foldcrossvalidationcomparingwiththeoptimumrandomforest(ORF)andclassificationandregressiontrees(CART)techniques.Keywords-SingleNucleotidePolymorphism;SupportVectorMachine;GeneticAlgorithmI.INTRODUCTIONScientistshavelongbeeninterestedinidentifyinggeneticfactorsthatinfluencetheoccurrenceofcomplexdiseases.Withtheadventofparallelgenotypingtechnology,costandtimeinfindingSNPsarenotoutofreach.Largecase-controlcohortsgeneratedfromverydenseSNParrays(DNAchipcontainsdensearrayofSNPs)challengingresearcherstosearchforSNPsthatareassociatedwiththediseases.Incontrasttothesinglegenedisorders,thestateofcomplexdiseasescouldbetriggeredfrommultiplegeneswhenexposingtocertainenvironmentalfactors1,2.However,searchingformultiplemarkerinteractionsfromalargepoolofSNPsimposeshighcomputationalandmemorycomplexity.Atechniqueofselectingsubsetofrelevantfeatures,namedFeatureSelection3,hasbeenwidelyusedinalmostfields,includingbioinformatics.Thistechniqueprovidesmoreeffectivewaytoimprovelearningaccuracytounderstandtheimportanceofthefeaturesbyremovingirreverentorredundantones.II.THEPROPOSEDIFGAMETHODInthissection,weintroduceanewencodingmethodcalledIFGA.Fig.1demonstratesthesummaryoftheIFGAmethod.Thefirstpopulationisconstructedbyourproposedintegerencodingapproach.Thedatainthechromosome(inGeneticAlgorithm(GA)context)arerepresentedbyasetofselectedfeatures.Afterthepopulationisgenerated,eachchromosomeisevaluatedbyafitnessscore.ThisscoreisobtainedbyusingtheBoostMode-SVMapproach.Then,theIFGAre-generatesthenextpopulationbyIFGAselection,IFGAcross-over,andIFGAmutationuntilaterminationcriterionissatisfied.A.TheIntegerEncodingMethodUtilizingGAtoperformfeatureselectioncanbedonebyconvertinginputdatausingbinaryencoding4.Thelengthof978-1-4244-4713-8/10/$25.002010IEEEFigure1.TheoverallIFGAflowchar.achromosomeequalsanumberofallfeatures.Thesizeofencodedchromosomecorrespondsdirectlytothenumberofinputfeatures.This,however,presentsaproblemduetotworeasons.First,therunningtimehighlydependsonthelengthofchromosome.Second,ageneralbinaryencodingdoesnotfixanumberofselectedfeatures.Itfixesonlythelengthofthechromosome.TheIFGAintegerencodingmethodisproposedtosolvetheseproblems.Assumethatacase-controldatausedinthisstudyhavemnumberofgenotypes.LetQibetheithchromosomeprocessedinthealgorithm.ThelengthofQi,denotedby|Qi|,issettoaconstantlessthanorequaltom.Then,random|Qi|numbers,representthelocationstoselectthecorrespondinggenotypesfromagivenfeaturesequence.DuringtheIFGA,thelengthofeachchromosomeisnotnecessarilyidentical.Forexample,supposem=7,thechromosomesize(|Qi|)issetto3,andtherandomlyselectedlocationsare1,5,and6.So,thechromosomeQi=1,5,6.B.IFGASelectionEachindividualchromosomeisselectedbasedonitsfitnessscoreintoamatingpoolbyastochasticuniversalsamplingmethod(SUS)5.TheIFGAalsousesanelitismtechnique6,inwhichthenextgenerationchromosomederivesfromthebestchromosomeinacurrentgeneration.C.IFGACross-OverThecross-overfunctionoftraditionalGArandomlyselectstherecombinationpointandswapsthetwochromosomesflankingthispoint.Cross-overfromtheoriginalGA,however,cannotbeappliedtotheIFGAapproachbecauseallchromosomesmusthavethesamesizeandfeaturesfromthesamelocicannotbeonthesamechromosome.WemustdeviseanIFGAcross-overtechniquetoovercomethisproblem.Assumethat,parent1andparent2aretheparentalchromosomeswhereeachlocusisthepositionofselectedfeature.Eithernumberofparent1sorparent2slocusmustbemorethan1.Numberofbothparentsloci(parent1andparent2)mustbegreaterthanorequaltoone.Outputsfromthisalgorithmareoffspring1sandoffspring2.1:x2:y3:tmp1parent14:fori=0to|parent1|do5:v|tmp1|6:selrandom(1,2,.,v)7:xxsel8:tmp1tmp1parent1selsuppress9:endfor10:tmp2parent211:fori=0to|parent2|do12:v|tmp2|13:selrandom(1,2,.,v)14:yysel15:tmp2tmp2parent2sel16:endfor17:crandom1,min(|parent1|,|parent2|)118:offspring1x1,x2,.,xc,yc+1,.,y|parent2|19:offspring2y1,y2,.,yc,xc+1,.,x|parent1|D.IFGAMutationMutationfunctionaltersthevalueofaspecifiedlocus.Ithardlyoccurswhencomparingwiththecross-overprocess.IFGAmutationispresentedhere.Letmdenotethelengthofagivengenotypesequence,input_chromisachromosomethatwillbemutated,andoutput_chromisamutatedchromosome.Eachelementinachromosomeisaselectedfeature.1:pos_outrandom1,|input_chrom|2:pos_inrandom1,m3:fori=1to|input_chrom|do4:ifi=pos_outthen5:output_chromipos_in6:else7:output_chromiinput_chromi8:endif9:endforE.GeneratingaPopulationTherearetwokindsofpopulation,theinitialpopulationandthenextgenerationpopulation.TogeneratetheinitialpopulationwithPchromosomes,wherePisauser-definednumberofchromosomesinthepopulation,thealgorithmrepeatedlygeneratesthechromosomesbyintegerencodingmethodandaddsthemintothesetofpopulationuntilthenumberofthechromosomesinthepopulationisequaltoP.Ontheotherhand,thepopulationinthenextgenerationconsistsofthechromosomeb,thebestfitnessscorefromthecurrentgeneration,egroupsoffeaturesfromevolution,cross-overandmutation,andrgroupsofthefeaturesfromthenewre-selectedfeatures.Afteraddingbandetothenextgeneration,thosechromosomesarecheckedforredundancy.Eachchromosomemustbeidenticalinthenextgeneration.Duplicatedchromosomeswillberemoved.Ifthenumberofchromosomesinthenextgenerationislessthanthenumberofchromosomesinthecurrentgenerationthenanewsubsetsoffeatures,r,willberandomlycreatedandaddedtothenextgeneration.F.TerminationThisIFGAalgorithmconsistsofasetofrecursivestepsforgeneratingthepopulation,evaluationbyaBoostMode-SVM,IFGAselection,IFGAcross-over,andIFGAmutation.Thesestepsareexecuteduntilthenumberofthebestresultsremainsconstantinthenext300iterations.III.THEPROPOSEDBOOSTMODE-SVMMETHODThegoalofSVM7istofindamaximalseparatinghyperplane:eitherfor(1)linearlyseparablecaseor(2)thenonlinearlyseparablecase.Notedthat,wTisatransposevectorofweight,xiisaninputvector,isamappingfunction,andbisabiasvalue.yi=sign(wxi+b)(1)yi=sign(w(xi)+b)(2)Theseequationsfacethesameproblemoccurredwhentheinputdataareimbalanced.Thelearnedseparatinghyperplanefromimbalanceddatasetmayshifttoomuchinthedirectiontowardsthesmallergroupcomparedwiththetrueseparatinghyperplane8.Tosolvethisproblem,thedecisionhyperplaneshouldbeadjusted.Itcanbeseenfrom(1)and(2)thattheparameterweffectstheclassificationoutput.So,modifyingwwilladjustthedecisionhyperplane,whichmayimprovetheclassifier.A.BoostMode-SVMAnewtechniqueofoversamplingfornominalfeatureisproposedtoimprovetheperformanceoftheSVM.TheBoostMode-SVM(Fig.1)generatestwoSVMs,namelySVM1andSVM2.TheSVM1isconstructedforgeneratingthescoreofthetrainingdatasetwhereastheSVM2isthefinalSVMmodelforclassificationthetestset.First,onlythetrainingsetisusedtoconstructtheSVM1andtofindtheBoostMode.ThisBoostModeistheindicatorvectoroftheminoritydataset.ItisbroughttotestwiththeSVM1.Twoscoringmethods,anUnbiasedScoring(US)andaBiasScoring(BS),areproposedtofindthescoringvalue.TheUSmethodisperformedwhentheSVM1correctlyclassifiestheBoostMode,otherwisetheBSmethodisperformed.Afterthat,aScoringOver-Samplingapproach(SOS)isprocessedforaddingartificialdatatominoritygroupbysamplingthedataoftheminoritygroupuntilanumberofdataofbothgroupsareequal.Theminoritygroupinthispapermeansthegroupofdatahavingfewerelements.ThenewSVM2isconstructedfortheclassificationbytheprevioustrainingdatasetandnewsetofdatafromtheSOStechnique.Finally,thetestsetisrunintheSVM2fortheevaluation.TheerrorrateforthetestsetisthefitnessscorevalueusingintheIFGAsectionabove.B.FindingtheBoostModeTobalancethesizeofdatafrombothclasses,someadditionaldataintheminoritygroupmustbegenerated.Theselectedgeneratingmethod(eitherUSorBS)willdependuponaBoostModevector.ThefollowingproceduredescribeshowtocomputetheBoostModevector.Letnminorbethenumberofdataintheminoritygroup.Boostrapsamplingwithreplacementisappliedontheminoritygrouptogeneratetdatasets,i.e.BoostGroup1,.,BoostGroupt.EachBoostGroupicontainsnminordata.1:fori=1totdo.2:allmodeimode(BoostGroupi)3:endfor4:BoostModemode(allmodei)iC.TheUnbiasedScoringMethodThistechniqueisprocessedwhentheSVM1classifiestheBoostModecorrectly.Alldatapointshaveequalchances(equalscoringvalues)tobeselectedfortheover-samplingtechnique.ThefollowingalgorithmdescribestheprocessoffindingthescoringvaluebytheUStechnique.ThescoreValisanoutputfromthisalgorithm.1:fori=1tonminordo2:scoreVali=1/nminor3:endforD.TheBiasScoringMethodTheBStechniqueisrunwhentheSVM1incorrectlyclassifiesbytheBoostMode.Thescoringvalueiscalculatedfromthedistanceofitspointtothedecisionhyperplaneby(3)forlinearseparabilityor(4)fornonlinearseparability.distancei=wxi+b(3)distancei=w(xi)+b(4)Thedatapointthatiscorrectlyclassifiedhaslesserchance(lessscoringvalue)tobeselectedfortheoversamplingprocessthantheonethatiswronglyclassified.Therefore,increasinginnumberofincorrectclassificationswouldinfluencethehigherchanceofsamplestobechosenandviceversa.ThescoringvaluefortheBSmethodisdescribedbythefollowingalgorithm.Letdistancebeasetofdistancesofalldatapointsintheminoritygroup.TheoutputfromthisalgorithmisasetofscoreVal.1:sumSV102:minValmin(distancei)i3:addVal=absolute(minVal)+14:fori=1tonminordo5:tmpi=distancei+addVal6:sumSV1=sumSV1+tmpi7:endfor8:iftheminoritygroupisthecontrolgroupthen9:fori=1tonminordo10:tmpi=2tmpi11:endfor12:endif13:fori=1tonminordo14:sumSV2=015:forj=1toido16:sumSV2=sumSV2+tmpj17:endfor18:scoreValisumSV2/sumSV119:endforE.TheScoredOver-SamplingMethodTheobjectiveoftheSOSalgorithmistoselectdatafromtheminoritygroupdependingonthescoreVal,computedbyeitherUSalgorithmorBSalgorithm.LetMDidenotedataith,for1ithnminor.Thenumberofdatainminoritygroupandmajoritygroupsarenminorandnmajor,respectively.Anoutputofthisalgorithmisasetofadditionaldataaddedtotheminoritygroup,samp_data.1:z=nmajornminor2:fori=1to|scoreVal|do3:sumSV1=sumSV1+scoreVali4:endfor5:fori=1to|scoreVal|do6:sumSV2=07:forj=1toido8:sumSV2=sumSV2+scoreValj9:endfor10:mapScorei=sumSV2/sumSV111:endfor12:fori=1tozdo13:selectPos=rand(1)14:ifselectPos0andselectPosmapScore1then15:samp_datai=MD116:else17:forj=2to|scoreVal|do18:ifselectPosmapScorej1andselectPosmapScorejthen19:samp_datai=MDj20:endif21:endfor22:endif23:endforIV.EXPERIMENTSANDRESULTSTableIIshowsthecomparisonoftheIFGA-BoostMode-SVM,ORF9,andCART10by10-foldcrossvalidationofThalassemiasandCrohnsdiseases.OurIFGA-BoostMode-SVMperformsbetterclassificationthanthestandardORFandCARTmethods.Notethat,nofeat.,acc.,sen.,andspec.inTableIIarethenumberoffeatures,accuracy,sensitivity,andspecificity,respectively.Thalassemiadataset(503patientswith835SNPs)wereobtainedfromtheThalassemiaResearchCenter,MahidolUniversityandtheCrohndataset(357patientswith103SNPs)areobtainedfrom11.Missingdatawereinferredby2SNPphasingmethod12.ForSVM,asoftmarginRBFkernelfunctionwith=0.5wasdeployedtoanalyzebothCrohnsandThalasemiasdataset.DummyencodingisappliedforSVMasvectors100,010,and001whereagenotypeismajorhomozygote,minorhomozygote,andheterozygote,respectively.InIFGA,eachchromosomesizeisvariedfrom1to10.Therefore,featureselectionfrom1featureto10featuresisprocessed.ParametersintheIFGAweresetasfollows:thenumberofchromosomesis1000,thecross-overrateis0.7forThalassemiasand0.8forCrohnsdiseases,andthemutationrateis0.035forThalassemiasand0.001forCrohnsdiseases.TABLEI.THEEXPERIMENTALRESULTSDataset+Algorithmnofeat.acc.(%)sen.(%)spec.(%)Thal.+IFGA-BoostMode-SVM671.5776.3964.14Thal.+ORF654.2769.8430.30Thal.+CART669.3876.0759.09Crohn+IFGA-BoostMode-SVM871.0664.5874.90Crohn+ORF857.8820.1480.25Crohn+CART863.3123.6186.83V.CONCLUSIONAnewIFGAwithBoostMode-SVMwasproposedtoidentifythesusceptiblelocifromthecase-controlassociationstudies.TheIFGAtechniqueencodeschromosomesasdifferentintegersizes.TheSOStechniquesamplestheminoritydatasetbytwoscoringapproaches(USandBS)areproposed.Thismethodcanverywellbeappliedinthecase-controlassociationstudies.Theexperimentalresultsfromtworealdatasets:CrohnsandThalassemiasdiseasesshowthatfeatureselectionandclassificationbytheIFGAwithBoostMode-SVMoutperformsthestandardORF,andCARTtechniques.REFERENCES1J.Marchini,P.Donnelly,andL.R.Cardon,“Genome-widestrategiesfordetectingmultiplelocithatinfluencecomplexdiseases,”NatureGenetics,vol.37,pp.413417,March2005.2D.J.Weatherall,“Science,medicine,andthefuture:Singlegenedisordersorcomplextraits:Lessonsfromthethalassaemiasandothermonogenicdiseases,”BMJ,v

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论