会员注册 | 登录 | 微信快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

   首页 人人文库网 > 资源分类 > PDF文档下载

外文资料--A Co-clustering Technique for Gene Expression.PDF

  • 资源星级:
  • 资源大小:242.47KB   全文页数:5页
  • 资源格式: PDF        下载权限:注册会员/VIP会员
您还没有登陆,请先登录。登陆后即可下载此文档。
  合作网站登录: 微信快捷登录 支付宝快捷登录   QQ登录   微博登录
友情提示
2:本站资源不支持迅雷下载,请使用浏览器直接下载(不支持QQ浏览器)
3:本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰   

外文资料--A Co-clustering Technique for Gene Expression.PDF

ACoclusteringTechniqueforGeneExpressionDatausingBipartiteGraphApproachSuvenduKanungoDept.ofComp.Sc.BITMesra,RanchiAllhabadCampus,UP,Indiafkanungorediff.comGadadharSahooDept.ofITMCABITMesra,RanchiJharkhand,Indiagsahoobitmesra.ac.inManojMadhavaGoreDept.ofComp.Sc.Engg.MNNITAllahabad,UP,Indiagoremnnit.ac.inAbstractMiningmicroarraydatasetsisvitalinbioinformaticsresearchandmedicalapplications.TherehasbeenextensiveresearchoncoclusteringofgeneexpressiondatageneratedusingcDNAmicroarrays.Coclusteringapproachisanimportantanalysistoolingeneexpressionmeasurement,whensomegeneshavemultiplefunctionsandexperimentalconditionsarediverse.Inthispaper,weintroduceanewframeworkformicroarraygeneexpressiondatacoclustering.Thebasisofthisframeworkisabipartitegraphrepresentationof2dimensionalgeneexpressiondata.Wehaveconstructedthisbipartitegraphbypartitioningthesamplesetintotwodisjointsets.Thekeypropertyofthisrepresentationisthat,foragenesamplematrix,itconstructstherangebipartitegraph,acompactrepresentationofallsimilarvaluerangesbetweensamplecolumns.Inordertoproducethesetofcoclusters,itsearchesforconstrainedmaximalcliquesinthisbipartitegraph.Ourmethodisscalabletopracticalgeneexpressiondataandcanfindsomeinterestingcoclustersinrealmicroarraydatasetsthatmeetspecificinputconditions.KeywordsMicroarrayCoclusteringGeneexpressionDataBipartiteGraphI.INTRODUCTIONClusteringisanunsupervisedlearningtechnique1whichisusedforgroupingasetofobjectsintosubsets,orclusters,suchthatthosewithineachclusteraremorecloselyrelatedtooneanotherthanobjectsassignedtodifferentclusters.Itisoneofthemostcommonlyperformedanalysesongeneexpressiondata.GeneexpressiondataareasetofmeasurementsaccumulatedviathecDNAmicroarrayortheoligonucleotidechipexperiment.Ingeneexpressionanalysis,theapproachofclusteringgroupsthegenesintobiologicallyrelevantclusterswithsimilarexpressionpatternssothatthegenesthatareclusteredtogethertendtobefunctionallyrelated.Standardclusteringtechniquesconsiderthevalueofeachpointinalldimensions,inordertoformgroupofsimilarpoints.Thistypeofonewayclusteringtechniques2,arebasedonsimilaritybetweensubjectsacrossallvariables.Howevergenesmaybecoregulatedunderlimitedconditionsandshowlittlesimilarityoutsidetheseconditions.Coclustering3,istraditionallyappliedtoamatrixofdatavalues,wheretherowsaredatapointsandthecolumnsarefeatures,e.g.inmicroarraydata,therowsaregenesandthecolumnsareexperiment.Eachelementofthismatrixrepresentstheexpressionlevel4ofageneunderaspecificcondition,andisrepresentedbyarealnumber,whichisusuallythelogarithmoftherelativeabundanceofthemRNAofthegeneunderspecificcondition.Unlikeclusteringwhichseekssimilarrowsorcolumns,coclustering,alsocalledbiclustering,seeksblocksofrowsandcolumnsthatareinterrelated.Coclusteringhasbeenprovedtobeofgreatvalueforfindingtheinterestingpatternsinthemicroarrayexpressiondata,whichrecordstheexpressionlevelsofmanygenes,fordifferentbiologicalsamples.Moreover,Coclusteringcanidentifyoverlappingpatternsandhenceleadstothepossibilitythatagenemaybeamemberofmultiplepathways5.Itisanapproachthatfindslocalpatternwhereasubsetofobjectsmightbesimilartoeachotherbasedononlyasubsetofattributes.II.RELATEDWORKCoclustering,orbiclustering,isaninterestingparadigmforunsuperviseddataanalysisasitismoreinformative,haslessparameters,isscalableandisabletoeffectivelyinterwinerowandcolumninformation.SeveralapproacheshavebeenproposedforsolvingCoclusteringproblembutweconcentrateongraphbasedclustering2,5,6,7,10approaches.AsitisknowntobeNPhardproblem,severalalgorithmsforminingcoclustersuseheuristicmethodsorprobabilisticapproximation,whichdecreasestheaccuracyoffinalclusteringresults.Anillustrativediscussiononmanyofthesealgorithmscanbefoundin8,9.TherearefourmajorclassesofCoclusteriCoclusterwithconstantvaluesiiCoclusterwithconstantvaluesonrowsorcolumnsiiiCoclusterwithcoherentvaluesivCoclusterwithcoherentevolutionAlargenumberofsuchalgorithmsassumeeitheradditiveormultiplicativemodels.ThebiclusterdefinedbyChengChurch3isasubsetofrowsandsubsetofcolumnswithahighsimilarityscore.Thissimilarityscoreis9781424447138/10/25.00©2010IEEEcalledmeansquaredresidue,H,wasusedasameasureofthecoherenceoftherowsandcolumnsinthebicluster.Asubmatrix,JIisconsideredaδbiclusterifδ,JIHforsome0≥δ.Inordertoassesstheoverallqualityofaδbicluster,ChengChurchdefinedthemeansquaredresidue,H,ofabicluster,JIasthesumofthesquaredresidues∑∈∈JjIiijarJIJIH,21,whereraijaijaiJaIjaIJ,aiJ,aIjandaIJrepresentrowmean,columnmeanandbiclustermeanrespectively.Heretheyassumethattherearenomissingvaluesinthedatamatrixandhencetheyreplacethemissingvaluesbyrandomnumbers,duringapreprocessingphase.Tanayetal.7modeledthedatamatrixasabigraphandusedstatisticalmodelstosolvetheprobleminordertoidentifybicliques.Theyhaveusedameritfunctiontoevaluatethequalityofacomputedbicluster.Ahsanetal.2proposedagraphdrawingbasedbiclusteringtechniqueusingthecrossingminimizationparadigmthatemploysastaticdiscretizationoftheinputdatamatrix.Ahmadetal.5proposedagraphdrawingbasedbiclusteringtechniqueusingspectralpartitioningbasedoncrossingminimizationparadigm.TheyshowedthatminimizationofHallsenergyfunctioncorrespondstofindingthenormalizedcutofthebigraph.III.BASICCONCEPTSAgeneexpressiondatasetcanberepresentedbyarealvaluedexpressionmatrix⎥⎥⎥⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎢⎢⎢⎣⎡ndnnddxxxxxxxxxD.............................212222111211wherenisthenumberofgenes,disthenumberofexperimentalconditionsorsamplesandijxisthemeasuredexpressionlevelofgeneiinsamplej.Letusconsider},,.........,{110−ngggGbeasetofngenesand},........,{110−msssSbeasetofmbiologicalsamples.AD−2microarraydatasetisarealvaluedmnmatrix}{ijdSGDwhere1,0−∈ni,1,0−∈mj.LetBbeasubmatrixofdatasetD.Cocluster}{ijbYXBwhereGX⊆andSY⊆,providedcertainconditionsofhomogeneityaresatisfied.Let⎥⎦⎤⎢⎣⎡jqiqjpipbbbbB2,2beanyarbitrarysubmatrixofB.ThenBisascalingclusterifipiiqbbαandjpjjqbbαandραα≤−ji,whereαisaconstantmultiplicativefactor.Bisashiftingclusteriffipiiqbbβandjpjjqbbβandρββ≤−ji,whereβisconstantadditivefactor.WesaythattheclusterYXBisasubsetofYXB,iffXX⊆andYY⊆.LetSbethesetofallcoclustersthatsatisfythegivenhomogeneityconditions,thenSB∈iscalledamaximalcoclusterifftheredoesntexistanotherclusterSB∈suchthatBB⊂.WecallBisavalidclusteriffitisamaximalcoclustersatisfyingthefollowingconditionsaLetusconsiderMMiijibGM11⎭⎬⎫⎩⎨⎧∏bethegeometricmeanbetweentwospecifiedcolumnvaluesforagivenrowand∑iiiGMGMWbetheweightoftherowforthisspecifiedtwocolumnvalues.bLetusconsideripiqiibbWr−andjqiqjjbbWr−betheweighteddifferenceoftwocolumnvaluesforagivenrowiorj.Weneedthatρ≤−,min,maxjijirrrrwhereρisaweightinthecorrespondinggeneset.cWeneedthatxXσ≥|andyYσ≥,wherexσandyσdenoteminimumcardinalitythresholdsforeachdimension.Inordertominelargeenoughclusters,theminimumsizeconstraintsi.exσ,andyσareimposed.DEFINITION3.1BipartiteGraphAgraph,EVGiscalledBipartiteifitsvertexsetVcanbedecomposedintotwodisjointsubsets1Vand2Vi.e.21VVV∪suchthateveryedgeinGjoinsavertexin1Vwithavertexin2Vi.e.φ∩21VV.Weconsiderweightedbipartitegraph,,,21WEVVGwithijwWwhere0≥ijwdenotestheweightoftheedge},{jibetweenverticesiandj.IV.THECOCLUSTERALGORITHMFromthedatamatrix,thecoclusteringalgorithmminesarbitrarilypositionedandoverlapping,scalingandshiftingpatterns.ThisalgorithmhastwostepsiForSGmatrix,findthevalidweighteddifferencerangesforallpairofsamplesandconstructarangebipartitegraph.iiCoclustersidentificationfromtheweightedrangebipartitegraph.A.ConstructingRangeBipartiteGraphForagivendatasetD,theminimumsizethreshold,xσandyσ,andthemaximumweighteddifferencethresholdρ,letusandvsbeanytwosamplecolumnsofDandletxvxuxuvxddwr−betheweighteddifferenceoftheexpressionvaluesofgenexgincolumnsusandvssuchthatvu,where1,0−∈nx.Inordertoincorporatetheideaofmutualimportancebetweentwocolumns,wehavecomputedtheweightofallrowsforspecifiedcolumns.Adifferencerangeisdefinedasanintervalofdifferencevalues,1hrr,withhrr1.Let},{,1huvxxhlrrrgrrJ∈bethesetofgenes,whosedifferencew.r.t.columnsusandvslieinthegivenweighteddifferencerange.Adifferencerangeiscalledvalidiffρ≤−,min,max11rrrrhh,whereρistherowweightinthevalidrange.Normally,formicroarrayexperimentdata,genesandsamplesarerepresentedby1Vand2Vvertexsetsrespectively,andtheedgeweightijwrepresentstheresponseofithgenetojthsample.However,inordertohaveaverycompactrepresentation,inthispaper,weconstructtheweightedundirectedbipartitegraphbypartitioningthesamplesetintotwodisjointsetscalledupperlayer1Vandlowerlayer2V.Thesamplesthatdonothaveanydatavaluesarenotconsideredintheformationofdisjointsets.Here,eachedgeintherangebipartitegraphhasassociatedwithittheweightandgenesetcorrespondingtotherangeonthatedge.Differentbipartitegraphsemergedfordifferentthresholdvalue,whichisanyweightvalueinthecorrespondinggeneset.Consequently,wewillhavedifferenttypesofcoclusters.Inordertoincludeedgeswithlargegeneset,wehaveassignedranktovalidedges,whicharedefinedbelow⎪⎪⎪⎪⎩⎪⎪⎪⎪⎨⎧≥1|,|2|,|12hlhlrrJountfrequencycandSrrJifotherwiseEdgeRankTheinclusionanddeletionofedgesdependsuponthevalueofthisRankEdgeandorderinwhichweprocessthesesamples.Wecomputefrequencycountforeachedge,whichistheoccurrenceofcardinalityofgeneset.Table1ExampleofMicroarrayDatasets0s1s2s3s4s5s6g03.61.01.01.01.01.0g13.02.52.01.0g25.05.05.0g36.65.52.0g49.07.56.03.0g56.64.42.0g63.03.03.0g78.08.08.08.0g86.05.04.02.0g94.04.04.04.04.0Figure1showstheweighteddifferencevaluesfordifferentgenesusingcolumnss4ands6,fortable1.Herethevalueof04.0ρ,whichistheminimumrowweightinthegeneset,andconsidering3xσ,yσ2,thenthereisonlyonevalidweighteddifferencerange0.0,0.0andthecorrespondinggenesetis},,,{0.0,0.09620,64ggggJss.Inthiscase,thenumberofvalidrangesdependsonthevalueofρ.Forthesorteddifferencevalues,thisalgorithmfindallvalidweighteddifferencerangesforallpairofcolumnsSssvu∈,.Herewemayhaveoverlappingofdifferentranges.Thealgorithmforpartitioningvertexsetinto1Vand2V,forconstructionofbipartitegraphisgiveninFigure2.FromTable1,wehaveconstructedamaximalweightedrangebipartitegraphFigure3.LetS′bethesetofcolumnswithmissingvalueineachrow.Wehavetakenweightoftheedgeasmaximumweightinthecorrespondinggeneset.Thisalgorithmgivesemphasistotheedgeshavinglargegenesetinordertocompensatethedeletionoffewvalidedges,whilepartitioningthesamplesetintotwodisjointsets.Aswedealwithnoisydata,additiveandmultiplicativemethodsoffindingclustersmaynotalwaysleadtogoodresults.Therefore,insteadofcomparingtwocolumnvaluesindependently10,wehavecomputedweightofeachrowforanytwospecifiedcolumnvalues.Webuildbipartitegraphmodelofdata,afterproperlyconditioningtheinputdata.Usingcolumnss4ands6fordifferentgenesintable1,wehaveinsortedorderWi|s4s6|0.00.00.00.00.060.240.290.51Rowg0g2g6g9g1g8g5g4Figure–1GraphBipartiteofonConstructiFigureendwhileendforendforendifbsasLLVtobsAddVtoasAddthenbsasEdgeRankifdoLbseachfordoLaseachfordoSVVwhileendforendforendifbsadjasadjbsasLLVasadjthatsuchasadjandbsAddVbsadjthatsuchbsadjandasAddthenbsasEdgeRankifdoLbseachfordoLaseachforVLetVLetSLLetSSStionInitializa2.20.19.18.17},{.162.151.142.13.12.1121.10.9.8.7},,,{.61.52.41.3.2.121−∈∈′′≠∪−∉∉∈∈′′′−′′φφB.CoClusterIdentificationWehaveincorporatedtheconceptofmutualimportanceofanytwosamplecolumnsbyfindingouttheweightofeachrowforthosespecifiedcolumns.Thiscompactrepresentationofgeneexpressiondatausingrangebipartitegraphcanbeusedtominesignificantcoclustersandhencefiltersoutmostoftheunrelateddata.Inordertomineallsignificantcoclusters,thealgorithmCOCLUSTERappliesdepthfirstsearchingmethodontherangebipartitegraph,asshowninpseudocodeinFigure4.Itrequireasinputthevalueofρ,σx,σytherangebipartitegraphR,thesetofallgenesGandthesamplesetsV1andV2.ThisalgorithmwilloutputthefinalsetofcoclustersinE.Forexample,letusconsiderthevalueofparametersσx3andσy2.TheCOCLUSTERalgorithmstartsatvertexs0withreferencecluster{g0,g1,..,g9{s0}.Whenweexplorethevertexs1ands6,wegetnewreferenceclusters{g3,g5,g8}{s0,s6}and{g3,g4,g8}{s0,s1}.Asweexploreothervertices,thecorrespondingedgesdonothaveenoughgenesandthisreferenceclustersE1E2becomesafinalclusters.Thenwestartexploringfromvertexs2andgetafinalclustersE3{g0,g7,g9}{s1,s2,s4,s5}.Similarly,whenwestartfroms4,wewillgetanotherclusterE4{g0,g2,g6,g9}{s1,s4,s6}.WesearchforthesemaximalcliquesandputinthefinalcoclustersetE.Figure3Weighted,undirectedRangeBipartitegraphHere,theupperlayerV1{s1,s5,s6}andlowerlayerV2{s0,s2,s4}1015202530200250300350400450timesecnumberofgenesFigure5

注意事项

本文(外文资料--A Co-clustering Technique for Gene Expression.PDF)为本站会员(图纸帝国)主动上传,人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知人人文库网(发送邮件至[email protected]或直接QQ联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。

[email protected] 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5