会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

外文资料--Searching Single Nucleotide Polymorphism Markers.PDF

  • 资源星级:
  • 资源大小:291.06KB   全文页数:4页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

外文资料--Searching Single Nucleotide Polymorphism Markers.PDF

SearchingSingleNucleotidePolymorphismMarkerstoComplexDiseasesusingGeneticAlgorithmFrameworkandaBoostModeSupportVectorMachineKhantharatAnekboon,SuphakantPhimoltares,andChidchanokLursinsapAVIC,DepartmentofMathematics,ChulalongkornUniversity,Bangkok,ThailandKhantharat.AStudent.chula.ac.th,suphakant.pchula.ac.th,andlchidchachula.ac.thSissadesTongsimaGenomeInstitute,NationalCenterforGeneticEngineeringandBiotechnology,Pathumtani,Thailandsissadesbiotec.or.thSuthatFucharoenThalassemiaResearchCenter,InstituteofMolecularBiosciences,MahidolUniversity,SalayaCampus,Nakhonpathom,Thailandgrsfcmahidol.ac.thAbstractWiththeadventoflargescalehighdensitysinglenucleotidepolymorphismSNParrays,casecontrolassociationstudieshavebeenperformedtoidentifypredisposinggeneticfactorsthatinfluencemanycommoncomplexdiseases.ThesegenotypingplatformsprovideverydenseSNPcoverageperonechip.Muchresearchhasbeenfocusingonmultivariategeneticmodeltoidentifygenesthatcanpredictthediseasestatus.However,increasingthenumberofSNPsgenerateslargenumberofcombinedgeneticoutcomestobetested.ThisworkpresentsanewmathematicalalgorithmforSNPanalysiscalledIFGAthatusesaBoostModesupportvectormachineSVMtoselectthebestsetofSNPmarkersthatcanpredictastateofcomplexdiseases.Theproposedalgorithmhasbeenappliedtotestfortheassociationstudyintwodiseases,namelyCrohnsandseverityspectrumofβ0/HbEThalassemiadiseases.TheresultsrevealedthatourpredictedSNPscanrespectivelybestclassifybothdiseasesat71.57and71.06accuracyusing10foldcrossvalidationcomparingwiththeoptimumrandomforestORFandclassificationandregressiontreesCARTtechniques.KeywordsSingleNucleotidePolymorphismSupportVectorMachineGeneticAlgorithmI.INTRODUCTIONScientistshavelongbeeninterestedinidentifyinggeneticfactorsthatinfluencetheoccurrenceofcomplexdiseases.Withtheadventofparallelgenotypingtechnology,costandtimeinfindingSNPsarenotoutofreach.LargecasecontrolcohortsgeneratedfromverydenseSNParraysDNAchipcontainsdensearrayofSNPschallengingresearcherstosearchforSNPsthatareassociatedwiththediseases.Incontrasttothesinglegenedisorders,thestateofcomplexdiseasescouldbetriggeredfrommultiplegeneswhenexposingtocertainenvironmentalfactors1,2.However,searchingformultiplemarkerinteractionsfromalargepoolofSNPsimposeshighcomputationalandmemorycomplexity.Atechniqueofselectingsubsetofrelevantfeatures,namedFeatureSelection3,hasbeenwidelyusedinalmostfields,includingbioinformatics.Thistechniqueprovidesmoreeffectivewaytoimprovelearningaccuracytounderstandtheimportanceofthefeaturesbyremovingirreverentorredundantones.II.THEPROPOSEDIFGAMETHODInthissection,weintroduceanewencodingmethodcalledIFGA.Fig.1demonstratesthesummaryoftheIFGAmethod.Thefirstpopulationisconstructedbyourproposedintegerencodingapproach.ThedatainthechromosomeinGeneticAlgorithmGAcontextarerepresentedbyasetofselectedfeatures.Afterthepopulationisgenerated,eachchromosomeisevaluatedbyafitnessscore.ThisscoreisobtainedbyusingtheBoostModeSVMapproach.Then,theIFGAregeneratesthenextpopulationbyIFGAselection,IFGAcrossover,andIFGAmutationuntilaterminationcriterionissatisfied.A.TheIntegerEncodingMethodUtilizingGAtoperformfeatureselectioncanbedonebyconvertinginputdatausingbinaryencoding4.Thelengthof9781424447138/10/25.00©2010IEEEFigure1.TheoverallIFGAflowchar.achromosomeequalsanumberofallfeatures.Thesizeofencodedchromosomecorrespondsdirectlytothenumberofinputfeatures.This,however,presentsaproblemduetotworeasons.First,therunningtimehighlydependsonthelengthofchromosome.Second,ageneralbinaryencodingdoesnotfixanumberofselectedfeatures.Itfixesonlythelengthofthechromosome.TheIFGAintegerencodingmethodisproposedtosolvetheseproblems.Assumethatacasecontroldatausedinthisstudyhavemnumberofgenotypes.LetQibetheithchromosomeprocessedinthealgorithm.ThelengthofQi,denotedby|Qi|,issettoaconstantlessthanorequaltom.Then,random|Qi|numbers,representthelocationstoselectthecorrespondinggenotypesfromagivenfeaturesequence.DuringtheIFGA,thelengthofeachchromosomeisnotnecessarilyidentical.Forexample,supposem7,thechromosomesize|Qi|issetto3,andtherandomlyselectedlocationsare1,5,and6.So,thechromosomeQi{1,5,6}.B.IFGASelectionEachindividualchromosomeisselectedbasedonitsfitnessscoreintoamatingpoolbyastochasticuniversalsamplingmethodSUS5.TheIFGAalsousesanelitismtechnique6,inwhichthenextgenerationchromosomederivesfromthebestchromosomeinacurrentgeneration.C.IFGACrossOverThecrossoverfunctionoftraditionalGArandomlyselectstherecombinationpointandswapsthetwochromosomesflankingthispoint.CrossoverfromtheoriginalGA,however,cannotbeappliedtotheIFGAapproachbecauseallchromosomesmusthavethesamesizeandfeaturesfromthesamelocicannotbeonthesamechromosome.WemustdeviseanIFGAcrossovertechniquetoovercomethisproblem.Assumethat,parent1andparent2aretheparentalchromosomeswhereeachlocusisthepositionofselectedfeature.Eithernumberofparent1sorparent2slocusmustbemorethan1.Numberofbothparentslociparent1andparent2mustbegreaterthanorequaltoone.Outputsfromthisalgorithmareoffspring1sandoffspring2.1x←ø2y←ø3tmp1←parent14fori0to|parent1|do5v←|tmp1|6sel←random1,2,...,v7x←x∪sel8tmp1←tmp1–parent1selsuppress9endfor10tmp2←parent211fori0to|parent2|do12v←|tmp2|13sel←random1,2,...,v14y←y∪sel15tmp2←tmp2–parent2sel16endfor17c←random1,min|parent1|,|parent2|–118offspring1←{x1,x2,...,xc,yc1,...,y|parent2|}19offspring2←{y1,y2,...,yc,xc1,...,x|parent1|}D.IFGAMutationMutationfunctionaltersthevalueofaspecifiedlocus.Ithardlyoccurswhencomparingwiththecrossoverprocess.IFGAmutationispresentedhere.Letmdenotethelengthofagivengenotypesequence,input_chromisachromosomethatwillbemutated,andoutput_chromisamutatedchromosome.Eachelementinachromosomeisaselectedfeature.1pos_out←random1,|input_chrom|2pos_in←random1,m3fori1to|input_chrom|do4ifipos_outthen5output_chromi←pos_in6else7output_chromi←input_chromi8endif9endforE.GeneratingaPopulationTherearetwokindsofpopulation,theinitialpopulationandthenextgenerationpopulation.TogeneratetheinitialpopulationwithPchromosomes,wherePisauserdefinednumberofchromosomesinthepopulation,thealgorithmrepeatedlygeneratesthechromosomesbyintegerencodingmethodandaddsthemintothesetofpopulationuntilthenumberofthechromosomesinthepopulationisequaltoP.Ontheotherhand,thepopulationinthenextgenerationconsistsofthechromosomeb,thebestfitnessscorefromthecurrentgeneration,egroupsoffeaturesfromevolution,crossoverandmutation,andrgroupsofthefeaturesfromthenewreselectedfeatures.Afteraddingbandetothenextgeneration,thosechromosomesarecheckedforredundancy.Eachchromosomemustbeidenticalinthenextgeneration.Duplicatedchromosomeswillberemoved.Ifthenumberofchromosomesinthenextgenerationislessthanthenumberofchromosomesinthecurrentgenerationthenanewsubsetsoffeatures,r,willberandomlycreatedandaddedtothenextgeneration.F.TerminationThisIFGAalgorithmconsistsofasetofrecursivestepsforgeneratingthepopulation,evaluationbyaBoostModeSVM,IFGAselection,IFGAcrossover,andIFGAmutation.Thesestepsareexecuteduntilthenumberofthebestresultsremainsconstantinthenext300iterations.III.THEPROPOSEDBOOSTMODESVMMETHODThegoalofSVM7istofindamaximalseparatinghyperplaneeitherfor1linearlyseparablecaseor2thenonlinearlyseparablecase.Notedthat,wTisatransposevectorofweight,xiisaninputvector,ϕisamappingfunction,andbisabiasvalue.yisignwΤ⋅xib1yisignwΤ⋅ϕxib2Theseequationsfacethesameproblemoccurredwhentheinputdataareimbalanced.Thelearnedseparatinghyperplanefromimbalanceddatasetmayshifttoomuchinthedirectiontowardsthesmallergroupcomparedwiththetrueseparatinghyperplane8.Tosolvethisproblem,thedecisionhyperplaneshouldbeadjusted.Itcanbeseenfrom1and2thattheparameterweffectstheclassificationoutput.So,modifyingwwilladjustthedecisionhyperplane,whichmayimprovetheclassifier.A.BoostModeSVMAnewtechniqueofoversamplingfornominalfeatureisproposedtoimprovetheperformanceoftheSVM.TheBoostModeSVMFig.1generatestwoSVMs,namelySVM1andSVM2.TheSVM1isconstructedforgeneratingthescoreofthetrainingdatasetwhereastheSVM2isthefinalSVMmodelforclassificationthetestset.First,onlythetrainingsetisusedtoconstructtheSVM1andtofindtheBoostMode.ThisBoostModeistheindicatorvectoroftheminoritydataset.ItisbroughttotestwiththeSVM1.Twoscoringmethods,anUnbiasedScoringUSandaBiasScoringBS,areproposedtofindthescoringvalue.TheUSmethodisperformedwhentheSVM1correctlyclassifiestheBoostMode,otherwisetheBSmethodisperformed.Afterthat,aScoringOverSamplingapproachSOSisprocessedforaddingartificialdatatominoritygroupbysamplingthedataoftheminoritygroupuntilanumberofdataofbothgroupsareequal.Theminoritygroupinthispapermeansthegroupofdatahavingfewerelements.ThenewSVM2isconstructedfortheclassificationbytheprevioustrainingdatasetandnewsetofdatafromtheSOStechnique.Finally,thetestsetisrunintheSVM2fortheevaluation.TheerrorrateforthetestsetisthefitnessscorevalueusingintheIFGAsectionabove.B.FindingtheBoostModeTobalancethesizeofdatafrombothclasses,someadditionaldataintheminoritygroupmustbegenerated.TheselectedgeneratingmethodeitherUSorBSwilldependuponaBoostModevector.ThefollowingproceduredescribeshowtocomputetheBoostModevector.Letnminorbethenumberofdataintheminoritygroup.Boostrapsamplingwithreplacementisappliedontheminoritygrouptogeneratetdatasets,i.e.{BoostGroup1,...,BoostGroupt}.EachBoostGroupicontainsnminordata.1fori1totdo.2allmodei←modeBoostGroupi3endfor4BoostMode←modeallmodeiiC.TheUnbiasedScoringMethodThistechniqueisprocessedwhentheSVM1classifiestheBoostModecorrectly.Alldatapointshaveequalchancesequalscoringvaluestobeselectedfortheoversamplingtechnique.ThefollowingalgorithmdescribestheprocessoffindingthescoringvaluebytheUStechnique.ThescoreValisanoutputfromthisalgorithm.1fori1tonminordo2scoreVali1/nminor3endforD.TheBiasScoringMethodTheBStechniqueisrunwhentheSVM1incorrectlyclassifiesbytheBoostMode.Thescoringvalueiscalculatedfromthedistanceofitspointtothedecisionhyperplaneby3forlinearseparabilityor4fornonlinearseparability.distanceiwΤ⋅xib3distanceiwΤ⋅ϕxib4Thedatapointthatiscorrectlyclassifiedhaslesserchancelessscoringvaluetobeselectedfortheoversamplingprocessthantheonethatiswronglyclassified.Therefore,increasinginnumberofincorrectclassificationswouldinfluencethehigherchanceofsamplestobechosenandviceversa.ThescoringvaluefortheBSmethodisdescribedbythefollowingalgorithm.Letdistancebeasetofdistancesofalldatapointsintheminoritygroup.TheoutputfromthisalgorithmisasetofscoreVal.1sumSV1←02minVal←mindistanceii3addValabsoluteminVal14fori1tonminordo5tmpidistanceiaddVal6sumSV1sumSV1tmpi7endfor8iftheminoritygroupisthecontrolgroupthen9fori1tonminordo10tmpi2–tmpi11endfor12endif13fori1tonminordo14sumSV2015forj1toido16sumSV2sumSV2tmpj17endfor18scoreVali←sumSV2/sumSV119endforE.TheScoredOverSamplingMethodTheobjectiveoftheSOSalgorithmistoselectdatafromtheminoritygroupdependingonthescoreVal,computedbyeitherUSalgorithmorBSalgorithm.LetMDidenotedataith,for1≤ith≤nminor.Thenumberofdatainminoritygroupandmajoritygroupsarenminorandnmajor,respectively.Anoutputofthisalgorithmisasetofadditionaldataaddedtotheminoritygroup,samp_data.1znmajor−nminor2fori1to|scoreVal|do3sumSV1sumSV1scoreVali4endfor5fori1to|scoreVal|do6sumSV207forj1toido8sumSV2sumSV2scoreValj9endfor10mapScoreisumSV2/sumSV111endfor12fori1tozdo13selectPosrand114ifselectPos≥0andselectPos≤mapScore1then15samp_dataiMD116else17forj2to|scoreVal|do18ifselectPosmapScorej–1andselectPos≤mapScorejthen19samp_dataiMDj20endif21endfor22endif23endforIV.EXPERIMENTSANDRESULTSTableIIshowsthecomparisonoftheIFGABoostModeSVM,ORF9,andCART10by10foldcrossvalidationofThalassemiasandCrohnsdiseases.OurIFGABoostModeSVMperformsbetterclassificationthanthestandardORFandCARTmethods.Notethat,nofeat.,acc.,sen.,andspec.inTableIIarethenumberoffeatures,accuracy,sensitivity,andspecificity,respectively.Thalassemiadataset503patientswith835SNPswereobtainedfromtheThalassemiaResearchCenter,MahidolUniversityandtheCrohndataset357patientswith103SNPsareobtainedfrom11.Missingdatawereinferredby2SNPphasingmethod12.ForSVM,asoftmarginRBFkernelfunctionwithσ0.5wasdeployedtoanalyzebothCrohnsandThalasemiasdataset.DummyencodingisappliedforSVMasvectors100,010,and001whereagenotypeismajorhomozygote,minorhomozygote,andheterozygote,respectively.InIFGA,eachchromosomesizeisvariedfrom1to10.Therefore,featureselectionfrom1featureto10featuresisprocessed.ParametersintheIFGAweresetasfollowsthenumberofchromosomesis1000,thecrossoverrateis0.7forThalassemiasand0.8forCrohnsdiseases,andthemutationrateis0.035forThalassemiasand0.001forCrohnsdiseases.TABLEI.THEEXPERIMENTALRESULTSDatasetAlgorithmnofeat.acc.sen.spec.Thal.IFGABoostModeSVM671.5776.3964.14Thal.ORF654.2769.8430.30Thal.CART669.3876.0759.09CrohnIFGABoostModeSVM871.0664.5874.90CrohnORF857.8820.1480.25CrohnCART863.3123.6186.83V.CONCLUSIONAnewIFGAwithBoostModeSVMwasproposedtoidentifythesusceptiblelocifromthecasecontrolassociationstudies.TheIFGAtechniqueencodeschromosomesasdifferentintegersizes.TheSOStechniquesamplestheminoritydatasetbytwoscoringapproachesUSandBSareproposed.Thismethodcanverywellbeappliedinthecasecontrolassociationstudies.TheexperimentalresultsfromtworealdatasetsCrohnsandThalassemiasdiseasesshowthatfeatureselectionandclassificationbytheIFGAwithBoostModeSVMoutperformsthestandardORF,andCARTtechniques.REFERENCES1J.Marchini,P.Donnelly,andL.R.Cardon,Genomewidestrategiesfordetectingmultiplelocithatinfluencecomplexdiseases,NatureGenetics,vol.37,pp.413–417,March2005.2D.J.Weatherall,Science,medicine,andthefutureSinglegenedisordersorcomplextraitsLessonsfromthethalassaemiasandothermonogenicdiseases,BMJ,vol.321,pp.1117–1120,November2000.3Y.Saeys,I.Inza,andP.Larranaga,Areviewoffeatureselectiontechniquesinbioinformatics,Bioinformatics,vol.23,pp.2507–2517,October2007.4X.P.Zeng,Y.M.Li,andJ.Qin,Adynamicchainlikeagentgeneticalgorithmforglobalnumericaloptimizationandfeatureselection,Neurocomputing,vol.72,pp.1214–1228,January2009.5J.E.Baker,Reducingbiasandinefficiencyintheselectionalgorithm,inProc.theSecondInternationalConferenceonGeneticAlgorithmandtheirApplication,Hillsdale,NJ,USA,1987,pp.14–21.6A.K.BhatiaandS.K.Basu,Implicitelitismingeneticsearch,inProc.ICONIP13thInt.Conf.,HongKong,China,2006,pp.781–788.7C.CortesandV.Vapnik,Supportvectornetworks,MachineLearning,vol.20,pp.273–297,September1995.8R.Akbani,S.Kwek,andN.Japkowicz,Applyingsupportvectormachinestoimbalanceddatasets,inProc.ECMLthe15thEuropeanConf.onMachineLearning,Pisa,Italy,2004,pp.39–50.9W.MaoandS.Kelly,Anoptimumrandomforestmodelforpredictionofgeneticsusceptibilitytocomplexdiseases,inProc.PacificAsiaConferenceonKnowledgeDiscoveryandDataMining,Nanjing,Chaina,2007,pp.193–204.10A.Gergeretal.,Amulitgenicapproachtopredictbreastcancerrisk,Epidemiology,vol.104,pp.159–164,August2007.11M.J.Daly,J.D.Rioux,S.F.Schaffner,T.J.Hudson,andE.S.Lander,Highresolutionhaplotypestructureinthehumangenome,NatureGenetics,vol.29,pp.229–232,October2001.12D.BrinzaandA.Zelikovsky,2snpScalablephasingmethodfortriosandunrelatedindividuals,JournalofIEEE/ACMTransactionsonComputationalBiologyandBioinformatics,vol.5,pp.313–318,April2008.

注意事项

本文(外文资料--Searching Single Nucleotide Polymorphism Markers.PDF)为本站会员(图纸帝国)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网([email protected]),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5