外文资料--Using spectral components for predicting treatment.PDF_第1页
外文资料--Using spectral components for predicting treatment.PDF_第2页
外文资料--Using spectral components for predicting treatment.PDF_第3页
外文资料--Using spectral components for predicting treatment.PDF_第4页
外文资料--Using spectral components for predicting treatment.PDF_第5页
全文预览已结束

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

UsingspectralcomponentsforpredictingtreatmenteffectsontimeseriesmicroarraygeneexpressionprofilesQianXuBioengineeringProgramHKUSTClearwaterBay,Kowloon,HongKongEmail:fleurxqust.hkHongXueDept.ofBiochemistryHKUSTClearwaterBay,Kowloon,HongKongEmail:hxueust.hkQiangYangDept.ofComputerScienceandEngineeringHKUSTClearwaterBay,Kowloon,HongKongEmail:qyangcse.ust.hkAbstractAnalyzingtimeseriesgeneexpressionprofilesisanincreasinglypopularmethodforunderstandingthebehaviorofawiderangeofbiologicalsystems.Onecouldstudythestatusofadiseasebyanalyzingtheinductionorrepressionactivityandeffectsfromanumberoraspecificgroupofgenes.Insuchascenario,itisoftennaturalforbiologicalresearcherstoposeoutthequestionofwhetheronecouldpredictthetreatmenteffectsbyusingsuchtimeseriesmicroarraygeneexpressionprofiles.However,suchproblemisabigchallengeconsideringtheirspecificnature:usuallysuchtimeseriesgeneexpressionprofilesareshortandthesamplingratesarenotuniform.Ourexperimentswithareal-worlddatasetshowthattraditionalmachinelearningmethodssuchassupportvectormachinewillnotperformwellinsuchacase.Inthispaper,wedecomposeatimeseriesgeneexpressionprofileintofrequencycomponentsandapplymachinelearningalgorithmstohelpimprovethepredictionaccuracy.Experimentalresultsshowthatouralgorithmisbothaccurateandeffective.Keywordstimeseriesgeneexpression;treatmentprediction;spectralcomponentsI.INTRODUCTIONGeneexpressionistheprocessbywhichinheritableinfor-mationfromageneismadeintoafunctionalgeneproduct,suchasproteinorRNA.Whileinthefieldofmolecularbiology,geneexpressionprofilingisthemeasurementoftheactivityofthousandsofgenesatonce.Therefore,aglobalpictureofcellularfunctioncanbecreatedandanalyzedviatheexpressionprofiles.Formicroarraytechnology,itmeasurestherelativeactivityofpreviouslyidentifiedtargetgenes.Timeseriesexpressionprofiling,differentfromstaticexpressionprofiling,providesatemporalprocessoftheexpressionofgenes,whilestaticexpressionprofilingofgenesonlyprovidesasinglesnapshotofthegenesrelated.Fortimeseriesdata,itisintuitivetoseethatthesuccessivepointsarenotindependentidenticallydistributed,hence,whenanalyzingsuchdata,un-derstandingthecorrelationbetweenthesuccessivedatapointsisalsoveryimportant.Oneoftheapplicationsofanalyzingtimeseriesmicroarraygeneexpressiondataistopredicttreatmenteffects.Formanychronicdiseases,treatmenteffectsareoftennon-negligibleandthesideeffectscausedbyimpropertreatmentareveryserious.Forexample,ourdatasetusedforanalyzinginthispaperreflectsthetreatmenteffectsofinterferonandribavirintoHCV(HepatitisCVirus)infectedpatients.HCVisoneofthecausesofchronichepatitis,cirrhosis,andhepatocellularcarcinoma.ThecurrentmethodofHCVtreatmentisacombinationofpegylatedinterferonalphaandtheantiviraldrugribavirinfor24or48weeks.Nevertheless,usingthesetwokindsofdrugstogethermayleadtosideeffectsasthepatientsmaygetheadachesorevenmyeloiddisordersandneuropsychiatricsymptoms.Therefore,itisnaturaltoaskthequestionaboutwhetherwecouldpredictthetreatmenteffectsatanearlystage,insteadofafter24or48weekswhenthepatientsmayalreadyhaveshownthesymptomsofsideeffects.However,conventionalmethodsinbiologycannothandlesuchproblemsinasatisfyingway.In1,itissuggestedthatmachinelearningmethodscouldhelppredictthetreatmenteffectsoftimeseriesmicroarraygeneexpressionprofiles.Successfulanalysisandcomprehensionofwhatwashiddenbehindthesegeneexpressionprofilesisanimportantprobleminbioinformaticsandmanyresearchershaveproposedvariousalgorithmsforanalyzinggeneexpression.Earlierworkonanalyzingtimeseriesgeneexpressiondatafrequentlyusedmethodsthatarethesameforstaticexpression2.Later,algorithmsweredevelopedforspecificallytargetingtimese-riesdata3.However,timeseriesdatahavemanyspecificchallenges.Sinceitsveryexpensivetoperformtimeseriesexperiments,manytimeseriesareveryshort.Itisshownin4thatmorethan80%ofalltimeseriesdatasetsinStanfordMicroarrayDatabase(SMD)containlessthan8timepoints.Thenumberofgenesthathavebeenprofiledisratherlarge,usuallyoverthousands.Theconflictbetweensuchalargenumberofgenesandthesmalltimepointsposesanevengreaterchallengeforanalyzingsuchtimeseriesdata.Anotherchallengeisthatourspecificproblemofpredictingthetreatmenteffectsbasedonmicroarraytimeseriesgeneexpressionprofilesisaclassificationproblem;Neverthelessatpresent,alargenumberofcurrentresearchinsteadfocusesonclusteringmethodsofthetimeseriesdata5,6.Eventhoughonecouldtrytousesomedensity-basedclustering978-1-4244-4713-8/10/$25.002010IEEEmethodsandadaptthemtoaclassificationframework,manyoftheseclusteringalgorithmswilloverfitinourcase,whenthetimedatapointsareextremelysmall.Therefore,itisrathernecessaryanddifficulttodesignanalgorithmtoaccuratelypredictthetreatmenteffectsofshorttimeseriesmicroarraygeneexpressionprofiles.Therehavealsobeenmanypreviousresearchonclassifyinggeneexpressions,however,mostofthesemethodsfocusonstaticexpressions.UsingSupportVectorMachines,Fureyetal.7classifiedcancertissuesamples.Bicciatoetal.8usedPrincipalComponentAnalysisformulticlasscanceranalysis.Aspecificchallengefortimeseriesgeneexpressionclassifi-cation,aspointedoutby9,isthatthediseasedevelopmentortreatmentresponseisnotuniformandispatient-specific.Theoveralltrajectorymaybesimilarbetweenpatientsbutdifferentpatientswillprogressatdifferentspeeds,evengiventhesametreatment.Therefore,aclassifiershouldbeabletotakethevaryingresponseratesanddevelopmentspeedintoaccount.Hence,traditionalmachinelearningmethods,suchasSupportVectorMachines,willnotperformsowellforthisspecificproblem.Ourexperimentalresultsinthelatersectionwillalsoconfirmthisfinding.Inthispaper,wepresentanalgorithmforpredictingtreat-menteffectsbasedontimeseriesmicroarraygeneexpressiondatabytransformingtheoriginalgeneexpressiondatatoitsspectralcomponentcounterpart.Later,weemploytraditionalSVMforfurtherclassification.WecompareouralgorithmwithdirectlyclassifyontheoriginaldatasetinthetimedomainviaSVMinareal-worlddatasetandconfirmthatouralgorithmissimpleandeffectivebyexperiments.Therestofthispaperisorganizedasfollows.InSection2,wewilldescribesomerelatedworksinclusteringgeneexpressiondatabothinstaticexpressionandtimeseriesexpression;classificationwithtimeseriesgeneexpressiondata;otherdataminingmethodsinanalyzingtimeseries.InSection3,wewilldescribeouralgorithmforpredictingtreatmenteffectsviaspectralcomponents.InSection4,wewillconductsomeexperimentsandshowtheeffectivenessofouralgorithm.Finally,wewillmakeconclusionsanddiscusssomepossibledirectionsforfutureresearch.II.RELATEDWORKA.ClusteringGeneExpressionDataManygeneralclusteringapproacheshavealreadybeenappliedtoclustergeneexpressiondata10.In11,Eisenetal.developedaclusteringmethodbasedonthewidelyknownhierarchicalclusteringalgorithm.AK-meansbasedclusteringalgorithmwasdevelopedbyHerwigetal.12toclustercDNAoligo-fingerprints.Thisalgorithmdoesnotrequireapredefinedspecifiednumberofclusters.TheHCS13algorithmrepresentsthedataasasimilaritygraphandthenrecursivelypatternsthecurrentsetofelementsintotosubsetsbyconsideringwhetherthesubgraphinducedbycur-rentsetofelementssatisfiesthestoppingcriterion.However,thesealgorithmsarelargelybasedonthegeneralmethodofclusteringinthefieldofdatamining,withouttakingthespecificnatureoftimeseriesgeneexpressiondataintoconsideration.Takingthesequentialpropertyoftimeseriesgeneex-pressiondataintoconsideration,manyclusteringalgorithmsspecificallydesignedfortimeseriesgeneexpressiondatahavebeenproposed.In14,aBayesianmethodformodel-basedclusteringofgeneexpressiondynamicswasproposed,whichrepresentsgeneexpressiondynamicsasautoregressiveequationsandsearchesthemostprobablesetofclustersgiventheavailabledata.Inthisway,thedynamicnatureoftimeseriesgeneexpressiondataistakenintoaccount.Inpractice,experimentsshowthatsuchanalgorithmworksforlongtimeseriesgeneexpressiondatabutnotforshorttimeseriesgeneexpressiondata.ZivBar-Joseph5proposedaclusteringalgorithmusingsplinestoclusterthecontinuousrepresentationoftimeseriesgeneexpression,yetitstillcannothandleshorttimeseriesgeneexpressiondataverywell.In4,aclusteringalgorithm,whichusesasetofmodelprofilestoclustertheresultsoftheseexperimentsspecifically,designedforshorttimeseriesgeneexpressiondatawasproposed.Therearemanyotherclusteringalgorithmsdealingwithtimeseriesgeneexpressiondata.Forexample,ageneclus-teringalgorithmbasedonmixtureofHMMwasproposedin6.GenesareassociatedwiththeHMMmostlikelytogeneratethetimecoursesofthecorrespondingexpressiondata.In15,amulti-stepapproachforclusteringtimeseriesgeneexpressiondatawasintroduced,consistingnon-linearPCA,probabilisticprincipalsurfacesbasedonNegentropy.In16,ageneexpressiondataisdecomposedintofrequencycomponentsandthecorrelationbetweenthedatafromapairofgenesismeasuredinthefrequencydomain.Anextensivereviewofclusteringmethodsingeneexpressiondataisbeyondthescopeandpagelimitofourpaper.B.ClassificationwithGeneExpressionDataAnotherimportanttopicrelatedtoourproblemistheclassificationproblemofgeneexpressiondata.Oneofthemostimportantproblemslyinginthiscategoryistumorclassification.Forexample,severalmulticategoryclassificationalgorithmshavebeenproposedinrecentyearsusingsup-portvectormachines,showingthatsomemulticlassSVMsperformwellinisolatedgeneexpressioncancerdiagnosticexperiments17.Moreover,itcanbebelievedthatthefinalperformanceoftheclassifierswillimprovewhenwecombinetheclassificationresultsanddifferentkindsofclassifiers,hence,ensemblelearningalgorithmsmaybeusedinsuchascenario.In18,traditionalensemblelearningmethodssuchasbaggingandboostingwereappliedtotumorclassificationproblems.OtherapplicationsincludetheworkbyFureyetal.7toclassifycancertissuesamplesandBicciatoetal.8toanalyzemulticlasscancerusingPrincipalComponentAnalysis.However,aswehavementionedabove,thesegeneexpressionclassificationalgorithmscannotbedirectlyappliedtotimeseriesgeneexpressionclassificationsincetheydonothandlethetemporalrelationshipbetweendifferenttimeslicesofthegeneexpressiondata.C.TimeSeriesDataClassificationFurthermore,timeseriesdataclassificationtaskisalsohighlyrelevanttoourproblemsinceourworkfocusesondealingwithpredictingtreatmenteffectsintimeseriesgeneexpressionprofiledata.However,thegeneraltimeseriesdataclassificationalgorithmisoftenonlyappliedtolongtimeseriesandwillnotperformsowellinshorttimeseries.Oneofthemostimportantworkintimeseriesdataclassificationisdynamictimewarping(DTW)foraligningtimeseriesdataandmeasurethedissimilaritybetweendifferentsequences.19usedaDTWbaseddecisiontreeforclassifyingtimeseriessequences.In20,firstorderlogicruleswithboostingwasemployedforclassifyingtimeseries.MuchresearchworkinthisareahasbeenconductedbyEamonnKeogh.In21,amodificationofDTWonahigherlevelabstractionofthedata,namely,PiecewiseAggregateApproximationwasproposedandshowntobeoutperform-ingDTWbyoneortwoofmagnitude.Anewsymbolicrepresentationoftimeseries,SAX,wasproposedin22.Itallowsdimensionalityornumerosityreductionandalsoallowsdistancemeasurestobedefinedonthesymbolicapproachthatlowerboundcorrespondingdistancemeasuresdefinedontheoriginalseries.23proposedasemi-supervisedtimeseriesclassificationalgorithmforthefirsttime,whereaccuratetimeseriesclassifiershavebeenbuiltwhenonlyasmallsetofla-beledexamplesareavailable.Therefore,self-trainingmethodsofusingunlabeleddatahasapotentialforsignificantbenefitsintimeseriesclassification.In24,iSAX,arepresentationthatsupportsindexingofmassivedatasetsuptoterabyteswasproposedandshowntobeabletoindexuptoonehundredmilliontimeseries.Itallowsbothfastexactsearchaswellasapproximatesearch.Despitethelargeamountofavailabilityofpapersintheareasofgeneexpressiondataclustering,geneexpressiondataclassificationandtimeseriesdataclassification,ourknowl-edge,nopaperhasbeenformallyproposedandtryingtosolvetheproblemoftimeseriesgeneexpressiondataclassificationandprediction.Thus,ourworkisthefirstoneaimingtodealwiththisprobleminthisarea.III.PROPOSEDMETHODSInthissection,wewilldescribeourproposedalgorithmforspectraltransformationofthetimeseriesgeneexpressiondataandclassificationinsuchacontinuousdomain.Wedefineatimeseriesgeneexpressiondataasavectorx.Thisvectorxcanberepresentedas:x=2Ksummationdisplayk=1ckznk=2Ksummationdisplayk=1ckekn+jkn(1)Insucharepresentation,thefollowingrelationshipwillholdinthat:x0x1.x2K1x1x2.x2K.xN2K1xN2K.xN2p2Kp2K1.p1=x2Kx2K+1.xN1(2)Herepk(1k2K)arecoefficientsofthepolynomial:p(z)=2Kproductdisplayk=1(zzk)=2Ksummationdisplayk=0pkz2Kk(p0=1).(3)Equation2istheAutoRegressive(AR)modeloftimeseriesx,andinEquation1,thedampingrateskandfrequencieskcanbedeterminedfromtherootsofthepolynomialinEquation3afterwehadcalculatedpkfromEquation2.Therefore,givenzk=k+k,wecancalculatezkandthenckcanbederivedinEquation1.Sincexisreal-valued,zkandckwilloccurincomplexconjugatepairs.Soweletck=kejk.ThenwecanrewriteEquation1as:xn=summationdisplaykx(k)n;k=2Ksummationdisplayk=1kekncos(kn+k)(0nN1)Herekandkaretheamplitudeandphaseofthekthspectralcomponent,thereforewecanrewritetheaboveequationintheformofeachspectralcomponent,whichis:x(k)n;k=kekncos(kn+k).Byapplyingthesesteps,wecandirectlytransformthetimeseriesgeneexpressiondataintoitscorrespondingspec-tralcomponents.Weplantousethisspectralcomponentrepresentationforclassificationtaskbasedonthefollowingreasons.Firstly,wetakedependencebetweensuccessivedatapointsintoaccount.Itseasytoverifythisclaimfromtheabovetransformationsteps,afterthespectrumtransformation,eachspectralcomponentisnowrelatedtomanysuccessivedatapointsandthereforesucharepresentationovercomestheoriginaldrawbackoflooseconnectionbetweensuccessivedatapointsinthetemporaldomain.Secondly,wecanestimatetheparametersofallspectralcomponentsandcansetthephaseofeachcomponenttozero.Therefore,thephaseshiftproblemencounteredbytimeseriesgeneexpressiondatacanbesolved.Sucharepresentationisinsensitivetonoiseasdescribedin25.Sequentially,weclassifytheoriginaltimeseriesgeneexpressiondatabytheconventionalclassifiersupportvectormachineusingourspectralcomponentrepresentation.SinceSVMsareconventionalclassificationalgorithms,weomitthedetailsofdescribingSVMandtheinterestedreaderscanlookintotechnicaldetailsin26.IV.EXPERIMENTALRESULTSInthissection,wewilldescribethedatasetweusedinthispaper,analyzetheperformanceofourproposedmethodandcompareouralgorithmwiththebaselinemethod.Ourobjec-tiveistoshowthattraditionalandconventionalclassificationmethodscannothandlesuchshorttimeseriesgeneexpressiondataclassificationproblemwellandillustratetheadvantageofouralgorithmoverthebaseline.A.DatasetDescriptionsOurtimeseriesmicroarraygeneexpressiondatawaspub-lishedbyM.Taylor27,anditispubliclyavailablefordownload1withaccessionnumberGSE7123.Thisdatasetrecordsthegeneexpressiondataof33African-Americansand36Caucasian-AmericanpatientsgivenHCVgenotype1infectiononday1,2,4,7,14and28,withpegylatedinterferonandribavirintherapy.Theglobalgeneexpressioninperipheralbloodmononuclearcells(PBMC)wasanalyzedvia22283probesinHG-U133AGeneChip.Notethatthedatasetdoesnotincludesomepatientswhodidnothaveall6daytretmentgeneexpressionprofiledata,whichmaybecausedbyeitherlossofdataorbecausethepatientdidnotreceivespecifictreatmentduringthatday.Weonlyincludedthepatientswith6fulltreatmentsinourclassificationproblem,therefore,forAfricanAmericans,wehadpreserved28ofthem,whichhad19goodresponsesand9responses;whileforCaucasianAmericans,wehadpreserved30ofthem,with17goodresponsesand13poorresponses.WhetherthetreatmenteffectispositiveornotisdeterminedbythedifferenceoftheHCVRNAlevelatday0andthecorrespondingHCVRNAlevelatday28.Ifthetreatmentislabeledasagoodresponse,thenatleast1.4log10IU/mlofHCVRNAleveldecreaseisrequired.Otherwise,iftheHCVRNAleveldecreaseislessthan1.4log10IU/mlthenitisdenotedasapoorresponse.B.AnalysisofourresultsWeusedourproposedmethodfortransformingthetimeseriesgeneexpressiondataintospectralcomponentsandemployedsupportvectormachineforclassification.TheSVMpackageweareusingisSVMLightimplementedbyJoachims2.AndwehadusedRadialBasisFunction(RBF)kernel:K(xi,xj)=e(xixj)2inourexperiments.WeclassifiedourdatasetwithtuningparametersofinRBFfunctionandcalculatedtheproportionofaccuratelyclassifieddatainbothAA(African-Americans)andCA(Cau-casianAmericans),showninthetablebelow.AAAccuracies(TPNumber)CAAccuracies(TPNumber)=0.2585.7%(24)93.3%(28)=0.589.3%(25)93.3%(28)=1.089.3%(25)90.0%(27)=2.085.7%(24)96.7%(29)=4.096.4%(27)100.0%(30)=8.096.4%(27)100.0%(30)=16.096.4%(27)93.3%(28)=32.092.9%(26)96.7%(29)=64.092.9%(26)96.7%(29)=128.089.3%(25)90.0%(27)=256.085.7%(24)93.3%(28)TABLEIPERFORMANCEONAAANDCACATEGORIESWITHTUNINGPARAMETERAsshowninthetableabove,ouraccuraciesarequitehighandthealgorithmisrelativelystablesincethechange1/geo2/People/tj/ofwillnotaffectmuchoftheaccuracies.However,isthishighaccuracymainlycausedbysupportvectormachineclassificationframeworkorbytransformingthetimeseriesgeneexpressiondatatospectralcomponents?Inthenextsubsection,toanalyzetheeffectivenessofourapproach,wewillcompareouralgorithmwiththeSVMapproach,wheretherepresentation

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论