会员注册 | 登录 | 微信快捷登录 QQ登录 微博登录 | 帮助中心 人人文库renrendoc.com美如初恋!
站内搜索 百度文库

热门搜索: 直缝焊接机 矿井提升机 循环球式转向器图纸 机器人手爪发展史 管道机器人dwg 动平衡试验台设计

外文资料--Using spectral components for predicting treatment.PDF外文资料--Using spectral components for predicting treatment.PDF -- 1 元

宽屏显示 收藏 分享

资源预览需要最新版本的Flash Player支持。
您尚未安装或版本过低,建议您

UsingspectralcomponentsforpredictingtreatmenteffectsontimeseriesmicroarraygeneexpressionprofilesQianXuBioengineeringProgramHKUSTClearwaterBay,Kowloon,HongKongEmailfleurxqust.hkHongXueDept.ofBiochemistryHKUSTClearwaterBay,Kowloon,HongKongEmailhxueust.hkQiangYangDept.ofComputerScienceandEngineeringHKUSTClearwaterBay,Kowloon,HongKongEmailqyangcse.ust.hkAbstractAnalyzingtimeseriesgeneexpressionprofilesisanincreasinglypopularmethodforunderstandingthebehaviorofawiderangeofbiologicalsystems.Onecouldstudythestatusofadiseasebyanalyzingtheinductionorrepressionactivityandeffectsfromanumberoraspecificgroupofgenes.Insuchascenario,itisoftennaturalforbiologicalresearcherstoposeoutthequestionofwhetheronecouldpredictthetreatmenteffectsbyusingsuchtimeseriesmicroarraygeneexpressionprofiles.However,suchproblemisabigchallengeconsideringtheirspecificnatureusuallysuchtimeseriesgeneexpressionprofilesareshortandthesamplingratesarenotuniform.Ourexperimentswitharealworlddatasetshowthattraditionalmachinelearningmethodssuchassupportvectormachinewillnotperformwellinsuchacase.Inthispaper,wedecomposeatimeseriesgeneexpressionprofileintofrequencycomponentsandapplymachinelearningalgorithmstohelpimprovethepredictionaccuracy.Experimentalresultsshowthatouralgorithmisbothaccurateandeffective.KeywordstimeseriesgeneexpressiontreatmentpredictionspectralcomponentsI.INTRODUCTIONGeneexpressionistheprocessbywhichinheritableinformationfromageneismadeintoafunctionalgeneproduct,suchasproteinorRNA.Whileinthefieldofmolecularbiology,geneexpressionprofilingisthemeasurementoftheactivityofthousandsofgenesatonce.Therefore,aglobalpictureofcellularfunctioncanbecreatedandanalyzedviatheexpressionprofiles.Formicroarraytechnology,itmeasurestherelativeactivityofpreviouslyidentifiedtargetgenes.Timeseriesexpressionprofiling,differentfromstaticexpressionprofiling,providesatemporalprocessoftheexpressionofgenes,whilestaticexpressionprofilingofgenesonlyprovidesasinglesnapshotofthegenesrelated.Fortimeseriesdata,itisintuitivetoseethatthesuccessivepointsarenotindependentidenticallydistributed,hence,whenanalyzingsuchdata,understandingthecorrelationbetweenthesuccessivedatapointsisalsoveryimportant.Oneoftheapplicationsofanalyzingtimeseriesmicroarraygeneexpressiondataistopredicttreatmenteffects.Formanychronicdiseases,treatmenteffectsareoftennonnegligibleandthesideeffectscausedbyimpropertreatmentareveryserious.Forexample,ourdatasetusedforanalyzinginthispaperreflectsthetreatmenteffectsofinterferonandribavirintoHCVHepatitisCVirusinfectedpatients.HCVisoneofthecausesofchronichepatitis,cirrhosis,andhepatocellularcarcinoma.ThecurrentmethodofHCVtreatmentisacombinationofpegylatedinterferonalphaandtheantiviraldrugribavirinfor24or48weeks.Nevertheless,usingthesetwokindsofdrugstogethermayleadtosideeffectsasthepatientsmaygetheadachesorevenmyeloiddisordersandneuropsychiatricsymptoms.Therefore,itisnaturaltoaskthequestionaboutwhetherwecouldpredictthetreatmenteffectsatanearlystage,insteadofafter24or48weekswhenthepatientsmayalreadyhaveshownthesymptomsofsideeffects.However,conventionalmethodsinbiologycannothandlesuchproblemsinasatisfyingway.In1,itissuggestedthatmachinelearningmethodscouldhelppredictthetreatmenteffectsoftimeseriesmicroarraygeneexpressionprofiles.Successfulanalysisandcomprehensionofwhatwashiddenbehindthesegeneexpressionprofilesisanimportantprobleminbioinformaticsandmanyresearchershaveproposedvariousalgorithmsforanalyzinggeneexpression.Earlierworkonanalyzingtimeseriesgeneexpressiondatafrequentlyusedmethodsthatarethesameforstaticexpression2.Later,algorithmsweredevelopedforspecificallytargetingtimeseriesdata3.However,timeseriesdatahavemanyspecificchallenges.Sinceitsveryexpensivetoperformtimeseriesexperiments,manytimeseriesareveryshort.Itisshownin4thatmorethan80ofalltimeseriesdatasetsinStanfordMicroarrayDatabaseSMDcontainlessthan8timepoints.Thenumberofgenesthathavebeenprofiledisratherlarge,usuallyoverthousands.Theconflictbetweensuchalargenumberofgenesandthesmalltimepointsposesanevengreaterchallengeforanalyzingsuchtimeseriesdata.AnotherchallengeisthatourspecificproblemofpredictingthetreatmenteffectsbasedonmicroarraytimeseriesgeneexpressionprofilesisaclassificationproblemNeverthelessatpresent,alargenumberofcurrentresearchinsteadfocusesonclusteringmethodsofthetimeseriesdata5,6.Eventhoughonecouldtrytousesomedensitybasedclustering9781424447138/10/25.00©2010IEEEmethodsandadaptthemtoaclassificationframework,manyoftheseclusteringalgorithmswilloverfitinourcase,whenthetimedatapointsareextremelysmall.Therefore,itisrathernecessaryanddifficulttodesignanalgorithmtoaccuratelypredictthetreatmenteffectsofshorttimeseriesmicroarraygeneexpressionprofiles.Therehavealsobeenmanypreviousresearchonclassifyinggeneexpressions,however,mostofthesemethodsfocusonstaticexpressions.UsingSupportVectorMachines,Fureyetal.7classifiedcancertissuesamples.Bicciatoetal.8usedPrincipalComponentAnalysisformulticlasscanceranalysis.Aspecificchallengefortimeseriesgeneexpressionclassification,aspointedoutby9,isthatthediseasedevelopmentortreatmentresponseisnotuniformandispatientspecific.Theoveralltrajectorymaybesimilarbetweenpatientsbutdifferentpatientswillprogressatdifferentspeeds,evengiventhesametreatment.Therefore,aclassifiershouldbeabletotakethevaryingresponseratesanddevelopmentspeedintoaccount.Hence,traditionalmachinelearningmethods,suchasSupportVectorMachines,willnotperformsowellforthisspecificproblem.Ourexperimentalresultsinthelatersectionwillalsoconfirmthisfinding.Inthispaper,wepresentanalgorithmforpredictingtreatmenteffectsbasedontimeseriesmicroarraygeneexpressiondatabytransformingtheoriginalgeneexpressiondatatoitsspectralcomponentcounterpart.Later,weemploytraditionalSVMforfurtherclassification.WecompareouralgorithmwithdirectlyclassifyontheoriginaldatasetinthetimedomainviaSVMinarealworlddatasetandconfirmthatouralgorithmissimpleandeffectivebyexperiments.Therestofthispaperisorganizedasfollows.InSection2,wewilldescribesomerelatedworksinclusteringgeneexpressiondatabothinstaticexpressionandtimeseriesexpressionclassificationwithtimeseriesgeneexpressiondataotherdataminingmethodsinanalyzingtimeseries.InSection3,wewilldescribeouralgorithmforpredictingtreatmenteffectsviaspectralcomponents.InSection4,wewillconductsomeexperimentsandshowtheeffectivenessofouralgorithm.Finally,wewillmakeconclusionsanddiscusssomepossibledirectionsforfutureresearch.II.RELATEDWORKA.ClusteringGeneExpressionDataManygeneralclusteringapproacheshavealreadybeenappliedtoclustergeneexpressiondata10.In11,Eisenetal.developedaclusteringmethodbasedonthewidelyknownhierarchicalclusteringalgorithm.AKmeansbasedclusteringalgorithmwasdevelopedbyHerwigetal.12toclustercDNAoligofingerprints.Thisalgorithmdoesnotrequireapredefinedspecifiednumberofclusters.TheHCS13algorithmrepresentsthedataasasimilaritygraphandthenrecursivelypatternsthecurrentsetofelementsintotosubsetsbyconsideringwhetherthesubgraphinducedbycurrentsetofelementssatisfiesthestoppingcriterion.However,thesealgorithmsarelargelybasedonthegeneralmethodofclusteringinthefieldofdatamining,withouttakingthespecificnatureoftimeseriesgeneexpressiondataintoconsideration.Takingthesequentialpropertyoftimeseriesgeneexpressiondataintoconsideration,manyclusteringalgorithmsspecificallydesignedfortimeseriesgeneexpressiondatahavebeenproposed.In14,aBayesianmethodformodelbasedclusteringofgeneexpressiondynamicswasproposed,whichrepresentsgeneexpressiondynamicsasautoregressiveequationsandsearchesthemostprobablesetofclustersgiventheavailabledata.Inthisway,thedynamicnatureoftimeseriesgeneexpressiondataistakenintoaccount.Inpractice,experimentsshowthatsuchanalgorithmworksforlongtimeseriesgeneexpressiondatabutnotforshorttimeseriesgeneexpressiondata.ZivBarJoseph5proposedaclusteringalgorithmusingsplinestoclusterthecontinuousrepresentationoftimeseriesgeneexpression,yetitstillcannothandleshorttimeseriesgeneexpressiondataverywell.In4,aclusteringalgorithm,whichusesasetofmodelprofilestoclustertheresultsoftheseexperimentsspecifically,designedforshorttimeseriesgeneexpressiondatawasproposed.Therearemanyotherclusteringalgorithmsdealingwithtimeseriesgeneexpressiondata.Forexample,ageneclusteringalgorithmbasedonmixtureofHMMwasproposedin6.GenesareassociatedwiththeHMMmostlikelytogeneratethetimecoursesofthecorrespondingexpressiondata.In15,amultistepapproachforclusteringtimeseriesgeneexpressiondatawasintroduced,consistingnonlinearPCA,probabilisticprincipalsurfacesbasedonNegentropy.In16,ageneexpressiondataisdecomposedintofrequencycomponentsandthecorrelationbetweenthedatafromapairofgenesismeasuredinthefrequencydomain.Anextensivereviewofclusteringmethodsingeneexpressiondataisbeyondthescopeandpagelimitofourpaper.B.ClassificationwithGeneExpressionDataAnotherimportanttopicrelatedtoourproblemistheclassificationproblemofgeneexpressiondata.Oneofthemostimportantproblemslyinginthiscategoryistumorclassification.Forexample,severalmulticategoryclassificationalgorithmshavebeenproposedinrecentyearsusingsupportvectormachines,showingthatsomemulticlassSVMsperformwellinisolatedgeneexpressioncancerdiagnosticexperiments17.Moreover,itcanbebelievedthatthefinalperformanceoftheclassifierswillimprovewhenwecombinetheclassificationresultsanddifferentkindsofclassifiers,hence,ensemblelearningalgorithmsmaybeusedinsuchascenario.In18,traditionalensemblelearningmethodssuchasbaggingandboostingwereappliedtotumorclassificationproblems.OtherapplicationsincludetheworkbyFureyetal.7toclassifycancertissuesamplesandBicciatoetal.8toanalyzemulticlasscancerusingPrincipalComponentAnalysis.However,aswehavementionedabove,thesegeneexpressionclassificationalgorithmscannotbedirectlyappliedtotimeseriesgeneexpressionclassificationsincetheydonothandlethetemporalrelationshipbetweendifferenttimeslicesofthegeneexpressiondata.C.TimeSeriesDataClassificationFurthermore,timeseriesdataclassificationtaskisalsohighlyrelevanttoourproblemsinceourworkfocusesondealingwithpredictingtreatmenteffectsintimeseriesgeneexpressionprofiledata.However,thegeneraltimeseriesdataclassificationalgorithmisoftenonlyappliedtolongtimeseriesandwillnotperformsowellinshorttimeseries.OneofthemostimportantworkintimeseriesdataclassificationisdynamictimewarpingDTWforaligningtimeseriesdataandmeasurethedissimilaritybetweendifferentsequences.19usedaDTWbaseddecisiontreeforclassifyingtimeseriessequences.In20,firstorderlogicruleswithboostingwasemployedforclassifyingtimeseries.MuchresearchworkinthisareahasbeenconductedbyEamonnKeogh.In21,amodificationofDTWonahigherlevelabstractionofthedata,namely,PiecewiseAggregateApproximationwasproposedandshowntobeoutperformingDTWbyoneortwoofmagnitude.Anewsymbolicrepresentationoftimeseries,SAX,wasproposedin22.Itallowsdimensionalityornumerosityreductionandalsoallowsdistancemeasurestobedefinedonthesymbolicapproachthatlowerboundcorrespondingdistancemeasuresdefinedontheoriginalseries.23proposedasemisupervisedtimeseriesclassificationalgorithmforthefirsttime,whereaccuratetimeseriesclassifiershavebeenbuiltwhenonlyasmallsetoflabeledexamplesareavailable.Therefore,selftrainingmethodsofusingunlabeleddatahasapotentialforsignificantbenefitsintimeseriesclassification.In24,iSAX,arepresentationthatsupportsindexingofmassivedatasetsuptoterabyteswasproposedandshowntobeabletoindexuptoonehundredmilliontimeseries.Itallowsbothfastexactsearchaswellasapproximatesearch.Despitethelargeamountofavailabilityofpapersintheareasofgeneexpressiondataclustering,geneexpressiondataclassificationandtimeseriesdataclassification,ourknowledge,nopaperhasbeenformallyproposedandtryingtosolvetheproblemoftimeseriesgeneexpressiondataclassificationandprediction.Thus,ourworkisthefirstoneaimingtodealwiththisprobleminthisarea.III.PROPOSEDMETHODSInthissection,wewilldescribeourproposedalgorithmforspectraltransformationofthetimeseriesgeneexpressiondataandclassificationinsuchacontinuousdomain.Wedefineatimeseriesgeneexpressiondataasavectorx.Thisvectorxcanberepresentedasx2Ksummationdisplayk1ckznk2Ksummationdisplayk1ckeσknjωkn1Insucharepresentation,thefollowingrelationshipwillholdinthat⎡⎢⎢⎣x0x1...x2K−1x1x2...x2K............xN−2K−1xN−2K...xN−2⎤⎥⎥⎦⎡⎢⎢⎣p2Kp2K−1...p1⎤⎥⎥⎦−⎡⎢⎢⎣x2Kx2K1...xN−1⎤⎥⎥⎦2Herepk1≤k≤2Karecoefficientsofthepolynomialpz2Kproductdisplayk1z−zk2Ksummationdisplayk0pkz2K−kp01.3Equation2istheAutoRegressiveARmodeloftimeseriesx,andinEquation1,thedampingratesσkandfrequenciesωkcanbedeterminedfromtherootsofthepolynomialinEquation3afterwehadcalculatedpkfromEquation2.Therefore,givenzkσkωk,wecancalculatezkandthenckcanbederivedinEquation1.Sincexisrealvalued,zkandckwilloccurincomplexconjugatepairs.Soweletckαkejρk.ThenwecanrewriteEquation1asxnsummationdisplaykxknϕk2Ksummationdisplayk1αkeϕkncosωknϕk0≤n≤N−1Hereαkandϕkaretheamplitudeandphaseofthekthspectralcomponent,thereforewecanrewritetheaboveequationintheformofeachspectralcomponent,whichisxknϕkαkeσkncosωknϕk.Byapplyingthesesteps,wecandirectlytransformthetimeseriesgeneexpressiondataintoitscorrespondingspectralcomponents.Weplantousethisspectralcomponentrepresentationforclassificationtaskbasedonthefollowingreasons.Firstly,wetakedependencebetweensuccessivedatapointsintoaccount.Itseasytoverifythisclaimfromtheabovetransformationsteps,afterthespectrumtransformation,eachspectralcomponentisnowrelatedtomanysuccessivedatapointsandthereforesucharepresentationovercomestheoriginaldrawbackoflooseconnectionbetweensuccessivedatapointsinthetemporaldomain.Secondly,wecanestimatetheparametersofallspectralcomponentsandcansetthephaseofeachcomponenttozero.Therefore,thephaseshiftproblemencounteredbytimeseriesgeneexpressiondatacanbesolved.Sucharepresentationisinsensitivetonoiseasdescribedin25.Sequentially,weclassifytheoriginaltimeseriesgeneexpressiondatabytheconventionalclassifiersupportvectormachineusingourspectralcomponentrepresentation.SinceSVMsareconventionalclassificationalgorithms,weomitthedetailsofdescribingSVMandtheinterestedreaderscanlookintotechnicaldetailsin26.IV.EXPERIMENTALRESULTSInthissection,wewilldescribethedatasetweusedinthispaper,analyzetheperformanceofourproposedmethodandcompareouralgorithmwiththebaselinemethod.Ourobjectiveistoshowthattraditionalandconventionalclassificationmethodscannothandlesuchshorttimeseriesgeneexpressiondataclassificationproblemwellandillustratetheadvantageofouralgorithmoverthebaseline.A.DatasetDescriptionsOurtimeseriesmicroarraygeneexpressiondatawaspublishedbyM.Taylor27,anditispubliclyavailablefordownload1withaccessionnumberGSE7123.Thisdatasetrecordsthegeneexpressiondataof33AfricanAmericansand36CaucasianAmericanpatientsgivenHCVgenotype1infectiononday1,2,4,7,14and28,withpegylatedinterferonandribavirintherapy.TheglobalgeneexpressioninperipheralbloodmononuclearcellsPBMCwasanalyzedvia22283probesinHGU133AGeneChip.Notethatthedatasetdoesnotincludesomepatientswhodidnothaveall6daytretmentgeneexpressionprofiledata,whichmaybecausedbyeitherlossofdataorbecausethepatientdidnotreceivespecifictreatmentduringthatday.Weonlyincludedthepatientswith6fulltreatmentsinourclassificationproblem,therefore,forAfricanAmericans,wehadpreserved28ofthem,whichhad19goodresponsesand9responseswhileforCaucasianAmericans,wehadpreserved30ofthem,with17goodresponsesand13poorresponses.WhetherthetreatmenteffectispositiveornotisdeterminedbythedifferenceoftheHCVRNAlevelatday0andthecorrespondingHCVRNAlevelatday28.Ifthetreatmentislabeledasagoodresponse,thenatleast1.4log10IU/mlofHCVRNAleveldecreaseisrequired.Otherwise,iftheHCVRNAleveldecreaseislessthan1.4log10IU/mlthenitisdenotedasapoorresponse.B.AnalysisofourresultsWeusedourproposedmethodfortransformingthetimeseriesgeneexpressiondataintospectralcomponentsandemployedsupportvectormachineforclassification.TheSVMpackageweareusingisSVMLightimplementedbyJoachims2.AndwehadusedRadialBasisFunctionRBFkernelKxi,xje−γxi−xj2inourexperiments.WeclassifiedourdatasetwithtuningparametersofγinRBFfunctionandcalculatedtheproportionofaccuratelyclassifieddatainbothAAAfricanAmericansandCACaucasianAmericans,showninthetablebelow.γAAAccuraciesTPNumberCAAccuraciesTPNumberγ0.2585.72493.328γ0.589.32593.328γ1.089.32590.027γ2.085.72496.729γ4.096.427100.030γ8.096.427100.030γ16.096.42793.328γ32.092.92696.729γ64.092.92696.729γ128.089.32590.027γ256.085.72493.328TABLEIPERFORMANCEONAAANDCACATEGORIESWITHTUNINGPARAMETERγAsshowninthetableabove,ouraccuraciesarequitehighandthealgorithmisrelativelystablesincethechange1http//www.ncbi.nlm.nih.gov/geo2http//www.cs.cornell.edu/People/tj/ofγwillnotaffectmuchoftheaccuracies.However,isthishighaccuracymainlycausedbysupportvectormachineclassificationframeworkorbytransformingthetimeseriesgeneexpressiondatatospectralcomponentsInthenextsubsection,toanalyzetheeffectivenessofourapproach,wewillcompareouralgorithmwiththeSVMapproach,wheretherepresentationofthetimeseriesmicroarraygeneexpressiondataisrepresentedinitsoriginaltemporaldomain.C.ComparisonwiththebaselineAsmentionedabove,inordertofigureoutwhetherthehighperformanceisachievedbyourproposedapproachusingspectracomponentsorisjustachievedbyclassificationalgorithmSVM,wecomparedouralgorithmwithSVMwherethetimeseriesgeneexpressionprofilesarerepresentedinitsoriginaltemporaldomain.ThefollowingtwotablesshowtheperformanceofSVMusingRBFkernelandPolynomialkernelrespectivelyonAAandCAcategorieswithtuningparameters.γAACategoryCACategoryγ0.2567.857156.6667γ0.567.857156.6667γ1.067.857156.6667γ2.067.857156.6667γ4.067.857156.6667γ8.067.857156.6667γ16.067.857156.6667γ32.067.857156.6667γ64.067.857156.6667γ128.067.857156.6667γ256.067.857156.6667TABLEIIPERFORMANCEOFSVMUSINGRBFKERNELONAAANDCACATEGORIESWITHTUNINGPARAMETERγFromtheresultsshowninthetableIIandtableIII,wedemonstratethattheresultsofapplyingSVMdirectlytopredicttreatmenteffectsbasedontimeseriesgeneexpressiondataarenotpromising.Theperformanceimprovementmainlyreliesonembeddingspectralcomponentsintoouralgorithm.V.CONCLUSIONANDFUTUREWORKInthispaper,wehadproposedanalgorithmusingtimeseriesgeneexpressiondatatopredicttreatmenteffectsinadvance.Ouralgorithmwasbasedonspectralcomponenttransformationtogetherwithsupportvectormachinesoastodealwiththechallengeofclassifyingwithshorttimeseriesgeneexpressiondata.OurexperimentalresultswithrealworlddatasetfocusingonusingpegylatedinterferonandribarivintherapyonHCVinfectedpatientshaveconfirmedtheeffectivenessofouralgorithms.Weplantoextendourworkinthefollowingdirections.Firstly,itisnoteworthythattimeseriesdatahasstrongcorrelationsbetweensuccessivetimepoints.However,transformingtheoriginaldataontospectraldomainmaylosesuchkindofinformation.Itisnaturaltoaskthequestionaboutwhetheritispossibletoemployothergraphicalmodelsinmachinelearning
编号:201311201910487497    大小:120.93KB    格式:PDF    上传时间:2013-11-20
  【编辑】
1
关 键 词:
外文资料
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
  人人文库网所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
0条评论

还可以输入200字符

暂无评论,赶快抢占沙发吧。

当前资源信息

4.0
 
(2人评价)
浏览:42次
图纸帝国上传于2013-11-20

官方联系方式

客服手机:17625900360   
2:不支持迅雷下载,请使用浏览器下载   
3:不支持QQ浏览器下载,请用其他浏览器   
4:下载后的文档和图纸-无水印   
5:文档经过压缩,下载后原文更清晰   

相关资源

相关资源

相关搜索

外文资料  
关于我们 - 网站声明 - 网站地图 - 友情链接 - 网站客服客服 - 联系我们
copyright@ 2015-2017 人人文库网网站版权所有
苏ICP备12009002号-5