会员注册 | 登录 | 微信快捷登录 支付宝快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

外文资料--A Co-clustering Technique for Gene Expression.PDF外文资料--A Co-clustering Technique for Gene Expression.PDF -- 1 元

宽屏显示 收藏 分享

页面加载中... ... 广告 0 秒后退出

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

ACoclusteringTechniqueforGeneExpressionDatausingBipartiteGraphApproachSuvenduKanungoDept.ofComp.Sc.BITMesra,RanchiAllhabadCampus,UP,Indiafkanungorediff.comGadadharSahooDept.ofITMCABITMesra,RanchiJharkhand,Indiagsahoobitmesra.ac.inManojMadhavaGoreDept.ofComp.Sc.Engg.MNNITAllahabad,UP,Indiagoremnnit.ac.inAbstractMiningmicroarraydatasetsisvitalinbioinformaticsresearchandmedicalapplications.TherehasbeenextensiveresearchoncoclusteringofgeneexpressiondatageneratedusingcDNAmicroarrays.Coclusteringapproachisanimportantanalysistoolingeneexpressionmeasurement,whensomegeneshavemultiplefunctionsandexperimentalconditionsarediverse.Inthispaper,weintroduceanewframeworkformicroarraygeneexpressiondatacoclustering.Thebasisofthisframeworkisabipartitegraphrepresentationof2dimensionalgeneexpressiondata.Wehaveconstructedthisbipartitegraphbypartitioningthesamplesetintotwodisjointsets.Thekeypropertyofthisrepresentationisthat,foragenesamplematrix,itconstructstherangebipartitegraph,acompactrepresentationofallsimilarvaluerangesbetweensamplecolumns.Inordertoproducethesetofcoclusters,itsearchesforconstrainedmaximalcliquesinthisbipartitegraph.Ourmethodisscalabletopracticalgeneexpressiondataandcanfindsomeinterestingcoclustersinrealmicroarraydatasetsthatmeetspecificinputconditions.KeywordsMicroarrayCoclusteringGeneexpressionDataBipartiteGraphI.INTRODUCTIONClusteringisanunsupervisedlearningtechnique1whichisusedforgroupingasetofobjectsintosubsets,orclusters,suchthatthosewithineachclusteraremorecloselyrelatedtooneanotherthanobjectsassignedtodifferentclusters.Itisoneofthemostcommonlyperformedanalysesongeneexpressiondata.GeneexpressiondataareasetofmeasurementsaccumulatedviathecDNAmicroarrayortheoligonucleotidechipexperiment.Ingeneexpressionanalysis,theapproachofclusteringgroupsthegenesintobiologicallyrelevantclusterswithsimilarexpressionpatternssothatthegenesthatareclusteredtogethertendtobefunctionallyrelated.Standardclusteringtechniquesconsiderthevalueofeachpointinalldimensions,inordertoformgroupofsimilarpoints.Thistypeofonewayclusteringtechniques2,arebasedonsimilaritybetweensubjectsacrossallvariables.Howevergenesmaybecoregulatedunderlimitedconditionsandshowlittlesimilarityoutsidetheseconditions.Coclustering3,istraditionallyappliedtoamatrixofdatavalues,wheretherowsaredatapointsandthecolumnsarefeatures,e.g.inmicroarraydata,therowsaregenesandthecolumnsareexperiment.Eachelementofthismatrixrepresentstheexpressionlevel4ofageneunderaspecificcondition,andisrepresentedbyarealnumber,whichisusuallythelogarithmoftherelativeabundanceofthemRNAofthegeneunderspecificcondition.Unlikeclusteringwhichseekssimilarrowsorcolumns,coclustering,alsocalledbiclustering,seeksblocksofrowsandcolumnsthatareinterrelated.Coclusteringhasbeenprovedtobeofgreatvalueforfindingtheinterestingpatternsinthemicroarrayexpressiondata,whichrecordstheexpressionlevelsofmanygenes,fordifferentbiologicalsamples.Moreover,Coclusteringcanidentifyoverlappingpatternsandhenceleadstothepossibilitythatagenemaybeamemberofmultiplepathways5.Itisanapproachthatfindslocalpatternwhereasubsetofobjectsmightbesimilartoeachotherbasedononlyasubsetofattributes.II.RELATEDWORKCoclustering,orbiclustering,isaninterestingparadigmforunsuperviseddataanalysisasitismoreinformative,haslessparameters,isscalableandisabletoeffectivelyinterwinerowandcolumninformation.SeveralapproacheshavebeenproposedforsolvingCoclusteringproblembutweconcentrateongraphbasedclustering2,5,6,7,10approaches.AsitisknowntobeNPhardproblem,severalalgorithmsforminingcoclustersuseheuristicmethodsorprobabilisticapproximation,whichdecreasestheaccuracyoffinalclusteringresults.Anillustrativediscussiononmanyofthesealgorithmscanbefoundin8,9.TherearefourmajorclassesofCoclusteriCoclusterwithconstantvaluesiiCoclusterwithconstantvaluesonrowsorcolumnsiiiCoclusterwithcoherentvaluesivCoclusterwithcoherentevolutionAlargenumberofsuchalgorithmsassumeeitheradditiveormultiplicativemodels.ThebiclusterdefinedbyChengChurch3isasubsetofrowsandsubsetofcolumnswithahighsimilarityscore.Thissimilarityscoreis9781424447138/10/25.00©2010IEEEcalledmeansquaredresidue,H,wasusedasameasureofthecoherenceoftherowsandcolumnsinthebicluster.Asubmatrix,JIisconsideredaδbiclusterifδ,JIHforsome0≥δ.Inordertoassesstheoverallqualityofaδbicluster,ChengChurchdefinedthemeansquaredresidue,H,ofabicluster,JIasthesumofthesquaredresidues∑∈∈JjIiijarJIJIH,21,whereraijaijaiJaIjaIJ,aiJ,aIjandaIJrepresentrowmean,columnmeanandbiclustermeanrespectively.Heretheyassumethattherearenomissingvaluesinthedatamatrixandhencetheyreplacethemissingvaluesbyrandomnumbers,duringapreprocessingphase.Tanayetal.7modeledthedatamatrixasabigraphandusedstatisticalmodelstosolvetheprobleminordertoidentifybicliques.Theyhaveusedameritfunctiontoevaluatethequalityofacomputedbicluster.Ahsanetal.2proposedagraphdrawingbasedbiclusteringtechniqueusingthecrossingminimizationparadigmthatemploysastaticdiscretizationoftheinputdatamatrix.Ahmadetal.5proposedagraphdrawingbasedbiclusteringtechniqueusingspectralpartitioningbasedoncrossingminimizationparadigm.TheyshowedthatminimizationofHallsenergyfunctioncorrespondstofindingthenormalizedcutofthebigraph.III.BASICCONCEPTSAgeneexpressiondatasetcanberepresentedbyarealvaluedexpressionmatrix⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎣⎡ndnnddxxxxxxxxxD.............................212222111211wherenisthenumberofgenes,disthenumberofexperimentalconditionsorsamplesandijxisthemeasuredexpressionlevelofgeneiinsamplej.Letusconsider},,.........,{110−ngggGbeasetofngenesand},........,{110−msssSbeasetofmbiologicalsamples.AD−2microarraydatasetisarealvaluedmnmatrix}{ijdSGDwhere1,0−∈ni,1,0−∈mj.LetBbeasubmatrixofdatasetD.Cocluster}{ijbYXBwhereGX⊆andSY⊆,providedcertainconditionsofhomogeneityaresatisfied.Let⎥⎦⎤⎢⎣⎡jqiqjpipbbbbB2,2beanyarbitrarysubmatrixofB.ThenBisascalingclusterifipiiqbbαandjpjjqbbαandραα≤−ji,whereαisaconstantmultiplicativefactor.Bisashiftingclusteriffipiiqbbβandjpjjqbbβandρββ≤−ji,whereβisconstantadditivefactor.WesaythattheclusterYXBisasubsetofYXB,iffXX⊆andYY⊆.LetSbethesetofallcoclustersthatsatisfythegivenhomogeneityconditions,thenSB∈iscalledamaximalcoclusterifftheredoesntexistanotherclusterSB∈suchthatBB⊂.WecallBisavalidclusteriffitisamaximalcoclustersatisfyingthefollowingconditionsaLetusconsiderMMiijibGM11⎭⎬⎫⎩⎨⎧∏bethegeometricmeanbetweentwospecifiedcolumnvaluesforagivenrowand∑iiiGMGMWbetheweightoftherowforthisspecifiedtwocolumnvalues.bLetusconsideripiqiibbWr−andjqiqjjbbWr−betheweighteddifferenceoftwocolumnvaluesforagivenrowiorj.Weneedthatρ≤−,min,maxjijirrrrwhereρisaweightinthecorrespondinggeneset.cWeneedthatxXσ≥|andyYσ≥,wherexσandyσdenoteminimumcardinalitythresholdsforeachdimension.Inordertominelargeenoughclusters,theminimumsizeconstraintsi.exσ,andyσareimposed.DEFINITION3.1BipartiteGraphAgraph,EVGiscalledBipartiteifitsvertexsetVcanbedecomposedintotwodisjointsubsets1Vand2Vi.e.21VVV∪suchthateveryedgeinGjoinsavertexin1Vwithavertexin2Vi.e.φ∩21VV.Weconsiderweightedbipartitegraph,,,21WEVVGwithijwWwhere0≥ijwdenotestheweightoftheedge},{jibetweenverticesiandj.IV.THECOCLUSTERALGORITHMFromthedatamatrix,thecoclusteringalgorithmminesarbitrarilypositionedandoverlapping,scalingandshiftingpatterns.ThisalgorithmhastwostepsiForSGmatrix,findthevalidweighteddifferencerangesforallpairofsamplesandconstructarangebipartitegraph.iiCoclustersidentificationfromtheweightedrangebipartitegraph.A.ConstructingRangeBipartiteGraphForagivendatasetD,theminimumsizethreshold,xσandyσ,andthemaximumweighteddifferencethresholdρ,letusandvsbeanytwosamplecolumnsofDandletxvxuxuvxddwr−betheweighteddifferenceoftheexpressionvaluesofgenexgincolumnsusandvssuchthatvu,where1,0−∈nx.Inordertoincorporatetheideaofmutualimportancebetweentwocolumns,wehavecomputedtheweightofallrowsforspecifiedcolumns.Adifferencerangeisdefinedasanintervalofdifferencevalues,1hrr,withhrr1.Let},{,1huvxxhlrrrgrrJ∈bethesetofgenes,whosedifferencew.r.t.columnsusandvslieinthegivenweighteddifferencerange.Adifferencerangeiscalledvalidiffρ≤−,min,max11rrrrhh,whereρistherowweightinthevalidrange.Normally,formicroarrayexperimentdata,genesandsamplesarerepresentedby1Vand2Vvertexsetsrespectively,andtheedgeweightijwrepresentstheresponseofithgenetojthsample.However,inordertohaveaverycompactrepresentation,inthispaper,weconstructtheweightedundirectedbipartitegraphbypartitioningthesamplesetintotwodisjointsetscalledupperlayer1Vandlowerlayer2V.Thesamplesthatdonothaveanydatavaluesarenotconsideredintheformationofdisjointsets.Here,eachedgeintherangebipartitegraphhasassociatedwithittheweightandgenesetcorrespondingtotherangeonthatedge.Differentbipartitegraphsemergedfordifferentthresholdvalue,whichisanyweightvalueinthecorrespondinggeneset.Consequently,wewillhavedifferenttypesofcoclusters.Inordertoincludeedgeswithlargegeneset,wehaveassignedranktovalidedges,whicharedefinedbelow⎪⎪⎪⎪⎩⎪⎪⎪⎪⎨⎧≥1|,|2|,|12hlhlrrJountfrequencycandSrrJifotherwiseEdgeRankTheinclusionanddeletionofedgesdependsuponthevalueofthisRankEdgeandorderinwhichweprocessthesesamples.Wecomputefrequencycountforeachedge,whichistheoccurrenceofcardinalityofgeneset.Table1ExampleofMicroarrayDatasets0s1s2s3s4s5s6g03.61.01.01.01.01.0g13.02.52.01.0g25.05.05.0g36.65.52.0g49.07.56.03.0g56.64.42.0g63.03.03.0g78.08.08.08.0g86.05.04.02.0g94.04.04.04.04.0Figure1showstheweighteddifferencevaluesfordifferentgenesusingcolumnss4ands6,fortable1.Herethevalueof04.0ρ,whichistheminimumrowweightinthegeneset,andconsidering3xσ,yσ2,thenthereisonlyonevalidweighteddifferencerange0.0,0.0andthecorrespondinggenesetis},,,{0.0,0.09620,64ggggJss.Inthiscase,thenumberofvalidrangesdependsonthevalueofρ.Forthesorteddifferencevalues,thisalgorithmfindallvalidweighteddifferencerangesforallpairofcolumnsSssvu∈,.Herewemayhaveoverlappingofdifferentranges.Thealgorithmforpartitioningvertexsetinto1Vand2V,forconstructionofbipartitegraphisgiveninFigure2.FromTable1,wehaveconstructedamaximalweightedrangebipartitegraphFigure3.LetS′bethesetofcolumnswithmissingvalueineachrow.Wehavetakenweightoftheedgeasmaximumweightinthecorrespondinggeneset.Thisalgorithmgivesemphasistotheedgeshavinglargegenesetinordertocompensatethedeletionoffewvalidedges,whilepartitioningthesamplesetintotwodisjointsets.Aswedealwithnoisydata,additiveandmultiplicativemethodsoffindingclustersmaynotalwaysleadtogoodresults.Therefore,insteadofcomparingtwocolumnvaluesindependently10,wehavecomputedweightofeachrowforanytwospecifiedcolumnvalues.Webuildbipartitegraphmodelofdata,afterproperlyconditioningtheinputdata.Usingcolumnss4ands6fordifferentgenesintable1,wehaveinsortedorderWi|s4s6|0.00.00.00.00.060.240.290.51Rowg0g2g6g9g1g8g5g4Figure–1GraphBipartiteofonConstructiFigureendwhileendforendforendifbsasLLVtobsAddVtoasAddthenbsasEdgeRankifdoLbseachfordoLaseachfordoSVVwhileendforendforendifbsadjasadjbsasLLVasadjthatsuchasadjandbsAddVbsadjthatsuchbsadjandasAddthenbsasEdgeRankifdoLbseachfordoLaseachforVLetVLetSLLetSSStionInitializa2.20.19.18.17},{.162.151.142.13.12.1121.10.9.8.7},,,{.61.52.41.3.2.121−∈∈′′≠∪−∉∉∈∈′′′−′′φφB.CoClusterIdentificationWehaveincorporatedtheconceptofmutualimportanceofanytwosamplecolumnsbyfindingouttheweightofeachrowforthosespecifiedcolumns.Thiscompactrepresentationofgeneexpressiondatausingrangebipartitegraphcanbeusedtominesignificantcoclustersandhencefiltersoutmostoftheunrelateddata.Inordertomineallsignificantcoclusters,thealgorithmCOCLUSTERappliesdepthfirstsearchingmethodontherangebipartitegraph,asshowninpseudocodeinFigure4.Itrequireasinputthevalueofρ,σx,σytherangebipartitegraphR,thesetofallgenesGandthesamplesetsV1andV2.ThisalgorithmwilloutputthefinalsetofcoclustersinE.Forexample,letusconsiderthevalueofparametersσx3andσy2.TheCOCLUSTERalgorithmstartsatvertexs0withreferencecluster{g0,g1,..,g9{s0}.Whenweexplorethevertexs1ands6,wegetnewreferenceclusters{g3,g5,g8}{s0,s6}and{g3,g4,g8}{s0,s1}.Asweexploreothervertices,thecorrespondingedgesdonothaveenoughgenesandthisreferenceclustersE1E2becomesafinalclusters.Thenwestartexploringfromvertexs2andgetafinalclustersE3{g0,g7,g9}{s1,s2,s4,s5}.Similarly,whenwestartfroms4,wewillgetanotherclusterE4{g0,g2,g6,g9}{s1,s4,s6}.WesearchforthesemaximalcliquesandputinthefinalcoclustersetE.Figure3Weighted,undirectedRangeBipartitegraphHere,theupperlayerV1{s1,s5,s6}andlowerlayerV2{s0,s2,s4}1015202530200250300350400450timesecnumberofgenesFigure5
编号:201311201910387489    大小:242.47KB    格式:PDF    上传时间:2013-11-20
  【编辑】
1
关 键 词:
外文资料
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:18次
图纸帝国上传于2013-11-20

官方联系方式

客服手机:13961746681   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

外文资料  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5