会员注册 | 登录 | 微信快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

外文资料--Searching Single Nucleotide Polymorphism Markers.PDF外文资料--Searching Single Nucleotide Polymorphism Markers.PDF -- 1 元

宽屏显示 收藏 分享

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

SearchingSingleNucleotidePolymorphismMarkerstoComplexDiseasesusingGeneticAlgorithmFrameworkandaBoostModeSupportVectorMachineKhantharatAnekboon,SuphakantPhimoltares,andChidchanokLursinsapAVIC,DepartmentofMathematics,ChulalongkornUniversity,Bangkok,ThailandKhantharat.AStudent.chula.ac.th,suphakant.pchula.ac.th,andlchidchachula.ac.thSissadesTongsimaGenomeInstitute,NationalCenterforGeneticEngineeringandBiotechnology,Pathumtani,Thailandsissadesbiotec.or.thSuthatFucharoenThalassemiaResearchCenter,InstituteofMolecularBiosciences,MahidolUniversity,SalayaCampus,Nakhonpathom,Thailandgrsfcmahidol.ac.thAbstractWiththeadventoflargescalehighdensitysinglenucleotidepolymorphismSNParrays,casecontrolassociationstudieshavebeenperformedtoidentifypredisposinggeneticfactorsthatinfluencemanycommoncomplexdiseases.ThesegenotypingplatformsprovideverydenseSNPcoverageperonechip.Muchresearchhasbeenfocusingonmultivariategeneticmodeltoidentifygenesthatcanpredictthediseasestatus.However,increasingthenumberofSNPsgenerateslargenumberofcombinedgeneticoutcomestobetested.ThisworkpresentsanewmathematicalalgorithmforSNPanalysiscalledIFGAthatusesaBoostModesupportvectormachineSVMtoselectthebestsetofSNPmarkersthatcanpredictastateofcomplexdiseases.Theproposedalgorithmhasbeenappliedtotestfortheassociationstudyintwodiseases,namelyCrohnsandseverityspectrumofβ0/HbEThalassemiadiseases.TheresultsrevealedthatourpredictedSNPscanrespectivelybestclassifybothdiseasesat71.57and71.06accuracyusing10foldcrossvalidationcomparingwiththeoptimumrandomforestORFandclassificationandregressiontreesCARTtechniques.KeywordsSingleNucleotidePolymorphismSupportVectorMachineGeneticAlgorithmI.INTRODUCTIONScientistshavelongbeeninterestedinidentifyinggeneticfactorsthatinfluencetheoccurrenceofcomplexdiseases.Withtheadventofparallelgenotypingtechnology,costandtimeinfindingSNPsarenotoutofreach.LargecasecontrolcohortsgeneratedfromverydenseSNParraysDNAchipcontainsdensearrayofSNPschallengingresearcherstosearchforSNPsthatareassociatedwiththediseases.Incontrasttothesinglegenedisorders,thestateofcomplexdiseasescouldbetriggeredfrommultiplegeneswhenexposingtocertainenvironmentalfactors1,2.However,searchingformultiplemarkerinteractionsfromalargepoolofSNPsimposeshighcomputationalandmemorycomplexity.Atechniqueofselectingsubsetofrelevantfeatures,namedFeatureSelection3,hasbeenwidelyusedinalmostfields,includingbioinformatics.Thistechniqueprovidesmoreeffectivewaytoimprovelearningaccuracytounderstandtheimportanceofthefeaturesbyremovingirreverentorredundantones.II.THEPROPOSEDIFGAMETHODInthissection,weintroduceanewencodingmethodcalledIFGA.Fig.1demonstratesthesummaryoftheIFGAmethod.Thefirstpopulationisconstructedbyourproposedintegerencodingapproach.ThedatainthechromosomeinGeneticAlgorithmGAcontextarerepresentedbyasetofselectedfeatures.Afterthepopulationisgenerated,eachchromosomeisevaluatedbyafitnessscore.ThisscoreisobtainedbyusingtheBoostModeSVMapproach.Then,theIFGAregeneratesthenextpopulationbyIFGAselection,IFGAcrossover,andIFGAmutationuntilaterminationcriterionissatisfied.A.TheIntegerEncodingMethodUtilizingGAtoperformfeatureselectioncanbedonebyconvertinginputdatausingbinaryencoding4.Thelengthof9781424447138/10/25.00©2010IEEEFigure1.TheoverallIFGAflowchar.achromosomeequalsanumberofallfeatures.Thesizeofencodedchromosomecorrespondsdirectlytothenumberofinputfeatures.This,however,presentsaproblemduetotworeasons.First,therunningtimehighlydependsonthelengthofchromosome.Second,ageneralbinaryencodingdoesnotfixanumberofselectedfeatures.Itfixesonlythelengthofthechromosome.TheIFGAintegerencodingmethodisproposedtosolvetheseproblems.Assumethatacasecontroldatausedinthisstudyhavemnumberofgenotypes.LetQibetheithchromosomeprocessedinthealgorithm.ThelengthofQi,denotedby|Qi|,issettoaconstantlessthanorequaltom.Then,random|Qi|numbers,representthelocationstoselectthecorrespondinggenotypesfromagivenfeaturesequence.DuringtheIFGA,thelengthofeachchromosomeisnotnecessarilyidentical.Forexample,supposem7,thechromosomesize|Qi|issetto3,andtherandomlyselectedlocationsare1,5,and6.So,thechromosomeQi{1,5,6}.B.IFGASelectionEachindividualchromosomeisselectedbasedonitsfitnessscoreintoamatingpoolbyastochasticuniversalsamplingmethodSUS5.TheIFGAalsousesanelitismtechnique6,inwhichthenextgenerationchromosomederivesfromthebestchromosomeinacurrentgeneration.C.IFGACrossOverThecrossoverfunctionoftraditionalGArandomlyselectstherecombinationpointandswapsthetwochromosomesflankingthispoint.CrossoverfromtheoriginalGA,however,cannotbeappliedtotheIFGAapproachbecauseallchromosomesmusthavethesamesizeandfeaturesfromthesamelocicannotbeonthesamechromosome.WemustdeviseanIFGAcrossovertechniquetoovercomethisproblem.Assumethat,parent1andparent2aretheparentalchromosomeswhereeachlocusisthepositionofselectedfeature.Eithernumberofparent1sorparent2slocusmustbemorethan1.Numberofbothparentslociparent1andparent2mustbegreaterthanorequaltoone.Outputsfromthisalgorithmareoffspring1sandoffspring2.1x←ø2y←ø3tmp1←parent14fori0to|parent1|do5v←|tmp1|6sel←random1,2,...,v7x←x∪sel8tmp1←tmp1–parent1selsuppress9endfor10tmp2←parent211fori0to|parent2|do12v←|tmp2|13sel←random1,2,...,v14y←y∪sel15tmp2←tmp2–parent2sel16endfor17c←random1,min|parent1|,|parent2|–118offspring1←{x1,x2,...,xc,yc1,...,y|parent2|}19offspring2←{y1,y2,...,yc,xc1,...,x|parent1|}D.IFGAMutationMutationfunctionaltersthevalueofaspecifiedlocus.Ithardlyoccurswhencomparingwiththecrossoverprocess.IFGAmutationispresentedhere.Letmdenotethelengthofagivengenotypesequence,input_chromisachromosomethatwillbemutated,andoutput_chromisamutatedchromosome.Eachelementinachromosomeisaselectedfeature.1pos_out←random1,|input_chrom|2pos_in←random1,m3fori1to|input_chrom|do4ifipos_outthen5output_chromi←pos_in6else7output_chromi←input_chromi8endif9endforE.GeneratingaPopulationTherearetwokindsofpopulation,theinitialpopulationandthenextgenerationpopulation.TogeneratetheinitialpopulationwithPchromosomes,wherePisauserdefinednumberofchromosomesinthepopulation,thealgorithmrepeatedlygeneratesthechromosomesbyintegerencodingmethodandaddsthemintothesetofpopulationuntilthenumberofthechromosomesinthepopulationisequaltoP.Ontheotherhand,thepopulationinthenextgenerationconsistsofthechromosomeb,thebestfitnessscorefromthecurrentgeneration,egroupsoffeaturesfromevolution,crossoverandmutation,andrgroupsofthefeaturesfromthenewreselectedfeatures.Afteraddingbandetothenextgeneration,thosechromosomesarecheckedforredundancy.Eachchromosomemustbeidenticalinthenextgeneration.Duplicatedchromosomeswillberemoved.Ifthenumberofchromosomesinthenextgenerationislessthanthenumberofchromosomesinthecurrentgenerationthenanewsubsetsoffeatures,r,willberandomlycreatedandaddedtothenextgeneration.F.TerminationThisIFGAalgorithmconsistsofasetofrecursivestepsforgeneratingthepopulation,evaluationbyaBoostModeSVM,IFGAselection,IFGAcrossover,andIFGAmutation.Thesestepsareexecuteduntilthenumberofthebestresultsremainsconstantinthenext300iterations.III.THEPROPOSEDBOOSTMODESVMMETHODThegoalofSVM7istofindamaximalseparatinghyperplaneeitherfor1linearlyseparablecaseor2thenonlinearlyseparablecase.Notedthat,wTisatransposevectorofweight,xiisaninputvector,ϕisamappingfunction,andbisabiasvalue.yisignwΤ⋅xib1yisignwΤ⋅ϕxib2Theseequationsfacethesameproblemoccurredwhentheinputdataareimbalanced.Thelearnedseparatinghyperplanefromimbalanceddatasetmayshifttoomuchinthedirectiontowardsthesmallergroupcomparedwiththetrueseparatinghyperplane8.Tosolvethisproblem,thedecisionhyperplaneshouldbeadjusted.Itcanbeseenfrom1and2thattheparameterweffectstheclassificationoutput.So,modifyingwwilladjustthedecisionhyperplane,whichmayimprovetheclassifier.A.BoostModeSVMAnewtechniqueofoversamplingfornominalfeatureisproposedtoimprovetheperformanceoftheSVM.TheBoostModeSVMFig.1generatestwoSVMs,namelySVM1andSVM2.TheSVM1isconstructedforgeneratingthescoreofthetrainingdatasetwhereastheSVM2isthefinalSVMmodelforclassificationthetestset.First,onlythetrainingsetisusedtoconstructtheSVM1andtofindtheBoostMode.ThisBoostModeistheindicatorvectoroftheminoritydataset.ItisbroughttotestwiththeSVM1.Twoscoringmethods,anUnbiasedScoringUSandaBiasScoringBS,areproposedtofindthescoringvalue.TheUSmethodisperformedwhentheSVM1correctlyclassifiestheBoostMode,otherwisetheBSmethodisperformed.Afterthat,aScoringOverSamplingapproachSOSisprocessedforaddingartificialdatatominoritygroupbysamplingthedataoftheminoritygroupuntilanumberofdataofbothgroupsareequal.Theminoritygroupinthispapermeansthegroupofdatahavingfewerelements.ThenewSVM2isconstructedfortheclassificationbytheprevioustrainingdatasetandnewsetofdatafromtheSOStechnique.Finally,thetestsetisrunintheSVM2fortheevaluation.TheerrorrateforthetestsetisthefitnessscorevalueusingintheIFGAsectionabove.B.FindingtheBoostModeTobalancethesizeofdatafrombothclasses,someadditionaldataintheminoritygroupmustbegenerated.TheselectedgeneratingmethodeitherUSorBSwilldependuponaBoostModevector.ThefollowingproceduredescribeshowtocomputetheBoostModevector.Letnminorbethenumberofdataintheminoritygroup.Boostrapsamplingwithreplacementisappliedontheminoritygrouptogeneratetdatasets,i.e.{BoostGroup1,...,BoostGroupt}.EachBoostGroupicontainsnminordata.1fori1totdo.2allmodei←modeBoostGroupi3endfor4BoostMode←modeallmodeiiC.TheUnbiasedScoringMethodThistechniqueisprocessedwhentheSVM1classifiestheBoostModecorrectly.Alldatapointshaveequalchancesequalscoringvaluestobeselectedfortheoversamplingtechnique.ThefollowingalgorithmdescribestheprocessoffindingthescoringvaluebytheUStechnique.ThescoreValisanoutputfromthisalgorithm.1fori1tonminordo2scoreVali1/nminor3endforD.TheBiasScoringMethodTheBStechniqueisrunwhentheSVM1incorrectlyclassifiesbytheBoostMode.Thescoringvalueiscalculatedfromthedistanceofitspointtothedecisionhyperplaneby3forlinearseparabilityor4fornonlinearseparability.distanceiwΤ⋅xib3distanceiwΤ⋅ϕxib4Thedatapointthatiscorrectlyclassifiedhaslesserchancelessscoringvaluetobeselectedfortheoversamplingprocessthantheonethatiswronglyclassified.Therefore,increasinginnumberofincorrectclassificationswouldinfluencethehigherchanceofsamplestobechosenandviceversa.ThescoringvaluefortheBSmethodisdescribedbythefollowingalgorithm.Letdistancebeasetofdistancesofalldatapointsintheminoritygroup.TheoutputfromthisalgorithmisasetofscoreVal.1sumSV1←02minVal←mindistanceii3addValabsoluteminVal14fori1tonminordo5tmpidistanceiaddVal6sumSV1sumSV1tmpi7endfor8iftheminoritygroupisthecontrolgroupthen9fori1tonminordo10tmpi2–tmpi11endfor12endif13fori1tonminordo14sumSV2015forj1toido16sumSV2sumSV2tmpj17endfor18scoreVali←sumSV2/sumSV119endforE.TheScoredOverSamplingMethodTheobjectiveoftheSOSalgorithmistoselectdatafromtheminoritygroupdependingonthescoreVal,computedbyeitherUSalgorithmorBSalgorithm.LetMDidenotedataith,for1≤ith≤nminor.Thenumberofdatainminoritygroupandmajoritygroupsarenminorandnmajor,respectively.Anoutputofthisalgorithmisasetofadditionaldataaddedtotheminoritygroup,samp_data.1znmajor−nminor2fori1to|scoreVal|do3sumSV1sumSV1scoreVali4endfor5fori1to|scoreVal|do6sumSV207forj1toido8sumSV2sumSV2scoreValj9endfor10mapScoreisumSV2/sumSV111endfor12fori1tozdo13selectPosrand114ifselectPos≥0andselectPos≤mapScore1then15samp_dataiMD116else17forj2to|scoreVal|do18ifselectPosmapScorej–1andselectPos≤mapScorejthen19samp_dataiMDj20endif21endfor22endif23endforIV.EXPERIMENTSANDRESULTSTableIIshowsthecomparisonoftheIFGABoostModeSVM,ORF9,andCART10by10foldcrossvalidationofThalassemiasandCrohnsdiseases.OurIFGABoostModeSVMperformsbetterclassificationthanthestandardORFandCARTmethods.Notethat,nofeat.,acc.,sen.,andspec.inTableIIarethenumberoffeatures,accuracy,sensitivity,andspecificity,respectively.Thalassemiadataset503patientswith835SNPswereobtainedfromtheThalassemiaResearchCenter,MahidolUniversityandtheCrohndataset357patientswith103SNPsareobtainedfrom11.Missingdatawereinferredby2SNPphasingmethod12.ForSVM,asoftmarginRBFkernelfunctionwithσ0.5wasdeployedtoanalyzebothCrohnsandThalasemiasdataset.DummyencodingisappliedforSVMasvectors100,010,and001whereagenotypeismajorhomozygote,minorhomozygote,andheterozygote,respectively.InIFGA,eachchromosomesizeisvariedfrom1to10.Therefore,featureselectionfrom1featureto10featuresisprocessed.ParametersintheIFGAweresetasfollowsthenumberofchromosomesis1000,thecrossoverrateis0.7forThalassemiasand0.8forCrohnsdiseases,andthemutationrateis0.035forThalassemiasand0.001forCrohnsdiseases.TABLEI.THEEXPERIMENTALRESULTSDatasetAlgorithmnofeat.acc.sen.spec.Thal.IFGABoostModeSVM671.5776.3964.14Thal.ORF654.2769.8430.30Thal.CART669.3876.0759.09CrohnIFGABoostModeSVM871.0664.5874.90CrohnORF857.8820.1480.25CrohnCART863.3123.6186.83V.CONCLUSIONAnewIFGAwithBoostModeSVMwasproposedtoidentifythesusceptiblelocifromthecasecontrolassociationstudies.TheIFGAtechniqueencodeschromosomesasdifferentintegersizes.TheSOStechniquesamplestheminoritydatasetbytwoscoringapproachesUSandBSareproposed.Thismethodcanverywellbeappliedinthecasecontrolassociationstudies.TheexperimentalresultsfromtworealdatasetsCrohnsandThalassemiasdiseasesshowthatfeatureselectionandclassificationbytheIFGAwithBoostModeSVMoutperformsthestandardORF,andCARTtechniques.REFERENCES1J.Marchini,P.Donnelly,andL.R.Cardon,Genomewidestrategiesfordetectingmultiplelocithatinfluencecomplexdiseases,NatureGenetics,vol.37,pp.413–417,March2005.2D.J.Weatherall,Science,medicine,andthefutureSinglegenedisordersorcomplextraitsLessonsfromthethalassaemiasandothermonogenicdiseases,BMJ,vol.321,pp.1117–1120,November2000.3Y.Saeys,I.Inza,andP.Larranaga,Areviewoffeatureselectiontechniquesinbioinformatics,Bioinformatics,vol.23,pp.2507–2517,October2007.4X.P.Zeng,Y.M.Li,andJ.Qin,Adynamicchainlikeagentgeneticalgorithmforglobalnumericaloptimizationandfeatureselection,Neurocomputing,vol.72,pp.1214–1228,January2009.5J.E.Baker,Reducingbiasandinefficiencyintheselectionalgorithm,inProc.theSecondInternationalConferenceonGeneticAlgorithmandtheirApplication,Hillsdale,NJ,USA,1987,pp.14–21.6A.K.BhatiaandS.K.Basu,Implicitelitismingeneticsearch,inProc.ICONIP13thInt.Conf.,HongKong,China,2006,pp.781–788.7C.CortesandV.Vapnik,Supportvectornetworks,MachineLearning,vol.20,pp.273–297,September1995.8R.Akbani,S.Kwek,andN.Japkowicz,Applyingsupportvectormachinestoimbalanceddatasets,inProc.ECMLthe15thEuropeanConf.onMachineLearning,Pisa,Italy,2004,pp.39–50.9W.MaoandS.Kelly,Anoptimumrandomforestmodelforpredictionofgeneticsusceptibilitytocomplexdiseases,inProc.PacificAsiaConferenceonKnowledgeDiscoveryandDataMining,Nanjing,Chaina,2007,pp.193–204.10A.Gergeretal.,Amulitgenicapproachtopredictbreastcancerrisk,Epidemiology,vol.104,pp.159–164,August2007.11M.J.Daly,J.D.Rioux,S.F.Schaffner,T.J.Hudson,andE.S.Lander,Highresolutionhaplotypestructureinthehumangenome,NatureGenetics,vol.29,pp.229–232,October2001.12D.BrinzaandA.Zelikovsky,2snpScalablephasingmethodfortriosandunrelatedindividuals,JournalofIEEE/ACMTransactionsonComputationalBiologyandBioinformatics,vol.5,pp.313–318,April2008.
编号:201311201910477495    大小:291.06KB    格式:PDF    上传时间:2013-11-20
  【编辑】
1
关 键 词:
外文资料
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:43次
图纸帝国上传于2013-11-20

官方联系方式

客服手机:13961746681   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

外文资料  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5