专业英语论文-Online Feature Selection for Mining Big Data

上传人：f*** IP属地：天津上传时间：2023-03-03 格式：DOC 页数：11 大小：81.50KB 积分：25 举报 版权申诉

专业英语论文-Online Feature Selection for Mining Big Data_第2页

专业英语论文-Online Feature Selection for Mining Big Data_第3页

专业英语论文-Online Feature Selection for Mining Big Data_第4页

专业英语论文-Online Feature Selection for Mining Big Data_第5页

已阅读5页，还剩6页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

OnlineFeatureSelectionforMiningBigData挖掘大数据的在线特征选择StevenC.H.Hoi∗,JialeiWang∗,PeilinZhao∗,RongJin†∗SchoolofComputerEngineering,NanyangTechnologicalUniversity,Singapore†DepartmentofComputerScienceandEngineering,MichiganStateUniversity,USA{chhoi,JL.Wang,zhao0106}@.sg,rongjin@ABSTRACT摘要Moststudiesofonlinelearningrequireaccessingalltheattributes/featuresoftraininginstances.Suchaclassicalsettingisnotalwaysappropriateforreal-worldapplicationswhendatainstancesareofhighdimensionalityortheaccesstoitisexpensivetoacquirethefullsetofattributes/features.Toaddressthislimitation,weinvestigatetheproblemofOnlineFeatureSelection(OFS)inwhichtheonlinelearnerfixednumberoffeatures.ThekeychallengeofOnlineFeatureSelectionishowtomakeaccuratepredictionusingasmallandfixednumberofactivefeatures.Thisisincontrasttotheclassicalsetupofonlinelearningwhereallthefeaturesareactiveandcanbeusedforprediction.Weaddressthischallengebystudyingsparsityregularizationandtruncationtechniques.Specifically,wepresentaneffectivealgorithmtosolvetheproblem,givethetheoreticalanalysis,andevaluatetheempiricalperformanceoftheproposedalgorithmsforonlinefeatureselectiononseveralpublicdatasets.Wealsodemonstratetheapplicationofouronlinefeatureselectiontechniquetotacklereal-worldproblemsofbigdatamining,whichissignificantlymorescalablethansomewell-knownbatchfeatureselectionalgorithms.Theencouragingresultsofourexperimentsvalidatetheefficacyandefficiencyoftheproposedtechniquesforlarge-scaleapplications.大多数在线学习的研究需要访问所有的属性/培训实例特点。这样一个经典的设置并不总是适用于真实世界的应用当数据实例的高维或访问它是昂贵的，以获得全套的属性/功能。为了解决这个限制，我们调查的问题在线特征选择（OFS），在线学只允许保持一个分类涉及一个小的和固定数目。在线功能的关键挑战选择是如何使用一个准确的预测小型和固定数量的活动特征。这是对比以经典的在线学习的设置，所有的功能都是主动的，可用于预测。我们解决这个问题研究稀疏正则化和截断的挑战技术。具体而言，我们提出了一种有效的算法解决问题，给出理论分析和评价建议算法的经验性能几种公共数据集的在线特征选择。我们也演示了我们的在线功能选择的应用大数据挖掘中的现实世界问题的处理技术，这比一些知名的更具可扩展性批量特征选择算法。令人鼓舞的结果我们的实验验证的有效性和效率CategoriesandSubjectDescriptors分类和主题描述H.2.8[DatabaseManagement]:DatabaseApplicationsdatamining;I.2.6[ArtificialIntelligence]:Learning[数据库管理：数据库applicationsdata]挖掘；[人工智能]：学习GeneralTerms一般条款Algorithms,ExperimentationKeywords关键词FeatureSelection,OnlineLearning,Classification1.INTRODUCTION简介Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.BigMine’12,August12,2012Beijing,China.Copyright2012ACM8-1-4503-1547-0/12/08...$10.00.Featureselectionisanimportanttopicindataminingandmachinelearningandhasbeenstudiedformanyyearsinliterature[10,19,26,31].Theobjectiveoffeatureselectionistoselectasubsetofrelevantfeaturesforbuildingeffectivepredictionmodels.Byremovingirrelevantandredundantfeatures,featureselectioncanimprovetheperformanceofpredictionmodelsbyalleviatingtheeffectofthecurseofdimensionality,enhancingthegeneralizationperformance,speedingupthelearningprocess,andimprovingthemodelinterpretability.Featureselectionhasfoundapplicationsinmanydomains,especiallyfortheproblemsinvolvedhighdimensionaldata.Despitebeingstudiedextensively,mostexistingstudiesonfeatureselectionoftenassumethefeatureselectiontaskisconductedinanoff-linelearningfashionandallthefeaturesoftraininginstancesaregivenapriori.Suchassumptionsmaynotalwaysholdforsomereal-worldapplicationswheretrainingexamplesmayarriveinanonlinemanneroritisexpensivetocollectthefullinformationoftrainingdata.Forexample,inanonlinespamemaildetectionsystem,trainingdatausuallyarrivesequentially,makingitdifficulttodeployaregularbatchfeatureselectiontechniqueinatimely,efficient,andscalablemanner.Anotherexampleisfeatureselectioninbioinformatics,whereacquiringtheentiresetoffeatures/attributesforeveryinstanceisexpensiveduetothehighcostofconductingwetlabexperiments.Unliketheexistingstudiesonfeatureselection,inthispaper,weinvestigatetheproblemofOnlineFeatureSelection(OFS)thataimstosolvethefeatureselectionproblembyanonlinelearningapproach.Thegoalofonlinefeatureselectionistodeveloponlineclassifiersthatinvolveonlyasmallandfixednumberoffeatures.Inordertominebigdatainreal-worldapplications,wemustbeabletoefficientlyidentifyafixednumberofrelevantfeaturesforbuildingaccuratepredictionmodelsintheonlinelearningprocess.Inthispaper,weproposealgorithmstosolvetheOFStask,analyzethetheoreticalproperty,andvalidatetheirempiricalperformancebyextensiveexperiments.Finally,weapplyourtechniquetosolveapplicationswithlarge-scaledatasets.Therestofthispaperisorganizedasfollows.Section2reviewsrelatedwork.Section3presentstheproblemandtheproposedalgorithmsaswellasthetheoreticalanalysis.Section4discussesourempiricalstudiesandSection5concludesthiswork.特征选择是数据挖掘和机器学习中的一个重要课题，在文献研究中已经有10、31、26、19。特征选择的目标是选择一个子集的相关功能，建立有效的预测模型。通过去除不相关和冗余的功能，特征选择可以减少维数灾难的影响，提高预测模型的性能，提高了网络的泛化性能，加快学习进程，提高了模型的可解释性。特征选择在许多领域都有应用，特别是对于高维数据的问题。尽管被广泛研究，大多数现有的特征选择的研究中经常假设特征选择任务是在离线学习的方式进行的，所有的训练实例的功能，给出了先验。这样的假设可能不总是为一些现实世界的应用程序，在训练的例子可能会到达一个在线的方式或是昂贵的收集信息的训练数据。例如，在一个在线垃圾邮件检测系统，训练数据通常到达顺序，使它难以部署一个常规的批量特征选择技术，及时，高效，可扩展的方式。另一个例子是在生物信息学中的特征选择，在那里获取整个集合每一个实例的特征/属性是昂贵的，由于高成本的进行湿实验室实验。与现有的研究特征选择，在本文中，我们探讨在线特征选择问题（OFS），旨在通过一个在线的学习方法解决特征选择问题。在线特征选择的目标是开发的在线分类，只涉及一个小的和固定数量的功能。为了在现实世界中的应用程序的大数据，我们必须能够有效地识别一个固定数量的相关功能，建立准确的预测模型，在在线学习过程。在本文中，我们提出了解决的任务的算法，从理论上分析性能，并通过大量的实验验证他们的经验表现。最后，我们应用我们的技术来解决应用程序与大型数据集。本文的其余部分组织如下。第2节评论相关工作。第3节提出的问题和所提出的算法，以及理论分析。第4节讨论我们的实证研究和5节总结了这项工作。2.RELATEDWORK相关工作Ourworkiscloselyrelatedtothestudiesofonlinelearningandfeatureselectioninmachinelearninganddatamining.Belowwereviewsomeimportantrelatedworkinbothareas.Foronlinelearningstudies,manytechniqueshavebeenproposedforsolvingdifferenttasksinliterature[30,6,39,20,41,25].Themostwell-knownmethodisthePerceptronalgorithm[30,16],whichupdatesthemodelbyaddinganewexamplewithsomeconstantweightintothecurrentsetofsupportvectorswhentheincomingexampleismisclassified.Moreover,alotofnewonlinelearningalgorithmshavebeendevelopedrecently,inwhichmanyofthemusuallyfollowthecriterionofmaximummarginlearningprinciple[17,21,6,40,34].Forexample,thePassive-Aggressivealgorithm[6]proposestoupdateaclassifierthatisneartothepreviousfunctionwhilesufferinglesslossonthecurrentinstance.ThePAalgorithmislimitedinthatitonlyexploitsthefirstorderinformation,whichhasbeenaddressedbytherecentlyproposedconfidenceweightedonlinelearningalgorithmsthatexploitthesecondorderinformation[14,7,8].Despitetheextensiveinvestigation,moststudiesofonlinelearningexploitthefullfeatures.Incontrast,weconsideranonlinelearningproblemwherethenumberofactivefeaturesisfixed,asignificantlymorechallengingproblemthantheconventionalsetupofonlinelearning.FeatureSelection(FS)hasbeenactivelystudied.ThegoalofFSistoselectthemostrelevantfeaturesfromthewholefeaturespaceinordertoimprovethepredictionperformanceofthepredictors.VariousFSmethodshavebeenproposed.Basedontheselectioncriterionchoice,thesemethodscanberoughlydividedintothreecategories:Filtermethods,Wrappermethods,andEmbeddedmethodsapproaches.Filtermethods[38,9,1]reliesonthecharacteristicsofthedatasuchascorrelation,distanceandnformation,withoutinvolvinganylearningalgorithm;wrappermethods[22]requireoneredeterminedlearningalgorithmforevaluatingtheperformanceofselectedfeaturesset.enerally,wrappermethodswillsearchthefeaturessuitableforthepredeterminedlearningalgorithmtoimprovetheperformance,butwillbemorecomputationallyexpensive;whileEmbeddedmethods[2,5,37]aimtointegratethefeatureselectionprocessintothemodeltrainingprocess,whicharefasterthanthewrappermethodsandstillprovidesuitablefeaturesubsetforthelearningalgorithm,butthoseresultantfeaturesmaybenotsuitableforotherlearningalgorithms.Itisimportanttodistinguishonlinefeatureselection,theproblemaddressedinthiswork,fromOnlineFeatureSelectionorStreamingFeatureSelectionstudiedin[29,18,36].Intheseworks,featuresareassumedtoarriveoneatatimewhileallthetraininginstancesareavailablebeforethelearningprocessstarts.Theirgoalistoselectasubsetoffeaturesandreturnanappropriatemodelateachtimestepgiventhefeaturesobservedsofar.Incontrast,wefocusononlinelearningwheretraininginstancesarrivesequentially.Theproposedworkiscloselyrelatedtosparseonlineearning[15,24],whosegoalistolearnasparselinearclassifier.Ourworkdiffersfromthesestudiesinthatweimposeahardconstraintonthenumberofnon-zeroelementsinclassifierw,whileallthestudiesofsparseonlinelearningonlyhavesoftconstraintsonthesparsityoftheclassifier.Ourworkisalsorelatedtobudgetonlinelearning[3,11,27,42]wherethenumberofsupportvectorsisboundedbyapredefinednumber.Acommonstrategybehindmanybudgetonlinelearningalgorithmsistoremovethe“oldest”supportvectorwhenthemaximumnumberofsupportvectorsisreached.Thissimplestrategyhoweverisnotapplicabletoonlinefeatureselection.Finally,ourworkiscloselyrelatedtotheonlinelearningalgorithmthataimstolearnaclassifierthatperformsaswellasthebestsubsetofexperts[35],whichisincontrasttomostonlinelearningworkonpredictionwithexpertadvicethatonlycomparestothebestexpertintheensemble[4].Unlikethework[35]whereonlypositiveweightsareassignedtoindividualexperts,inthisstudy,theweightsassignedtoindividualfeaturescanbebothnegativeandpositive,makingitmoreflexible.我们的工作是密切相关的，在机器学习和数据挖掘的在线学习和特征选择的研究。下面我们回顾一些重要的相关工作，在这两个领域。在线学习的研究，提出了许多技术解决在文献[30，6，39，20，41个不同的任务，25]。最著名的方法是感知器算法[30，16]，更新模型，通过添加新的例子和一些恒重为支持向量的电流设置当输入的例子是错误的。此外，最近开发了许多新的在线学习算法，其中许多通常遵循最大限度的学习原则的标准[17，21，6，40，34]。例如，被动攻击算法[6]提出更新一个分类，是附近的以前的功能，而遭受的损失，目前的实例。该算法是有限的，它只利用一阶信息，这已经解决了最近提出的信心加权的在线学习算法，利用二阶信息[14，7，8]。尽管进行了广泛的调查，大多数研究的在线学习利用的全部功能。相反，我们认为一个在线学习的问题，活跃的功能是固定的，一个显着更具挑战性的问题比传统的在线学习的设置。特征选择（财政司）已积极研究。的目标是选择最相关的功能，从整个特征空间，以提高预测性能的预测，已提出的各种方法。根据选择标准的选择，这些方法大致可以分为三类：过滤器的方法，包装方法，和嵌入式方法的方法。滤波方法[38，9，1]依托等数据的相关性的特点，距离信息，不涉及任何学习算法；包装方法[22]，需要重新确定为选定的功能集的性能评估算法。一般，包装方法将搜索适合预定的学习算法来提高性能，但会更昂贵的计算；而嵌入式方法[2，5，37]旨在整合特征选择过程为模型的训练过程，这是比包装方法更快，还是对于学习算法提供合适的特征子集，但那些合成的功能可能不适用于其他学习算法。重要的是要区分在线功能选择，在这项工作中解决的问题，从在线特征选择或流特征选择研究[29，18，36]。在这些作品中，功能被假定为一个在一段时间内，而所有的训练实例可在学习过程开始。他们的目标是选择一个子集的功能，并返回一个适当的模型，在每个时间步长的特点，观察到迄今为止。相比之下，我们专注于在线学习，训练的情况下到达的顺序。建议的工作是密切相关的稀疏在线收入[15，24]，其目标是要学习稀疏线性分类。我们的工作不同于这些研究中，我们强加一个硬约束的非零元素的数量分类宽，而所有的稀疏在线学习的研究只有软约束的稀疏的分类。我们的工作也涉及到预算在线学习[3，11，27，42]的支持向量的数目是有界的一个预定义的数量。一种常见的策略背后的许多预算的在线学习算法是删除“最古老”的支持向量时，支持向量的最大数量。这种简单的策略，但不适用于在线功能选择。最后，我们的工作是密切相关的在线学习算法，其目的是学习一个分类，进行以及最佳的子集的专家[35]，这是在与专家建议，只有最好的专家在预测与最在线学习工作相比合奏[4]。与工作[35]，只有积极的权重分配到个别专家，在这项研究中，分配给个人功能的权重可以是负的，积极的，使它更灵活。3.ONLINEFEATURESELECTION3.1ProblemSetting问题设置Inthispaper,weconsidertheproblemofonlinefeatureselectionforbinaryclassification.Let{(xt,yt)|t=1,...,T}beasequenceofinputpatternsreceivedoverthetrials,whereeachxt∈Rdisavectorofddimensionandyt∈{−1,+1}.Inourstudy,weassumethatdisalargenumberandforcomputationalefficiencyweneedtoselectarelativelysmallnumberoffeaturesforlinearclassification.Morespecifically,ineachtrialt,thelearnerpresentsaclassifierwt∈Rdthatwillbeusedtoclassifyinstancextbyalinearfunctionsgn(w_txt).Insteadofusingallthefeaturesforclassification,werequiretheclassifierwttohaveatmostBnon-zeroelements,i.e.,_wt_0≤BwhereB>0isapredefinedconstant,andconsequentlyatmostBfeaturesofxtwillbeusedforclassification.Werefertothisproblemasonlinefeatureselection.Ourgoalistodesignaneffectivestrategyforonlinefeatureselectionthatisabletomakeasmallnumberofmistakes.Throughoutthepaper,weassume_xt_2≤1,t=1,...,T.3.2Algorithms算法Inthisproblem,weassumethelearnerisprovidedwiththefullinputsofeverytraininginstance(i.e.x1,...,xT)3.2.1ASimpleTruncationApproachAstraightforwardapproachtoonlinefeatureselectionistomodifythePerceptronalgorithmbyapplyingtruncation.Specifically,Inthet-thtrial,whenbeingaskedtomakeprediction,wewilltruncatetheclassifierwtbysettingeverythingbuttheBlargest(absolutevalue)elementsinwttobezero.Thistruncatedclassifier,denotedbywBt,isthenusedtoclassifythereceivedinstancext.SimilartothePerceptronalgorithm,whentheinstanceismisclassified,wewillupdatetheclassifierbyaddingthevectorytxtwhere(xt,yt)isthemisclassifiedtrainingexample.Algorithm1showsthestepsofthisapproach.Unfortunately,thissimpleapproachdoesnotwork:itcannotguaranteeasmallnumberofmistakes.Toseethis,considerthecasewheretheinputpatternxcanonlytaketwopossiblepatterns,eitherxaorxb.Forxa,wesetitsfirstBelementstobeoneandtheremainingelementstobezero.Forxb,wesetitsfirstBelementstobezeroandtheremainingelementstobeones.Aninstancexisassignedtothepositiveclass(i.e.,y=+1)whenitisxa,andassignedtothenegativeclass(i.e.,y=−1)whenit4.EXPERIMENTALRESULTS实验结果Inthissection,weconductanextensivesetofexperimentstoevaluatetheperformanceoftheproposedonlinefeatureselectionalgorithms.WewillevaluatetheonlinepredictiveperformanceoftheOFStaskonseveralbenchmarkdatasetsfromUCImachinelearningrepository,andthendemonstratetheapplicationsoftheproposedonlinefeatureselectiontechniqueforreal-worldapplicationsbycomparingtheproposedOFStechniquewithsomewell-knownandsuccessfulbatchfeatureselectiontechniqueinliterature[28].Finally,wealsoexaminethescalabilityoftheproposedtechniqueformininglarge-scaledatasets.在这一节中，我们进行了大量的实验，以评估建议的在线特征选择算法的性能。我们将探讨OFS任务在线预测性能从UCI机器学习数据库的几个基准数据集，然后通过比较该OFS技术与一些知名的和验证所提出的在线特征选择技术在实际应用中的应用一批成功的批量特征选择技术在文献[28]。最后，我们还研究了可扩展性的建议的技术，挖掘大型数据集。4.1ExperimentI:EvaluationofOnlineLearningPerformanceonUCIDataSets4.1.1ExperimentalTestbedonUCIDatasets在UCI数据集上的实验测试平台Toextensivelyexaminetheperformance,wetestthealgorithmsonanumberofpublicdatasetsfromwebmachinelearningrepositories.AllofthedatasetscanbedownloadedeitherfromLIBSVMwebsite1orUCImachinelearningrepository2.Table1showsalistofsixbinarydatasetsusedinourexperiments,whichwerechosenrandomly.4.1.2ExperimentalSetupandComparedAlgorithms实验装置和比较算法WecomparetheproposedOFSalgorithmwithtwobaselines:(i)themodifiedperceptronbysimpletruncationinAlgorithm1,denotedas“PEtrun”,and(ii)arandomizedfeatureselectionalgorithm,whichrandomlyselectsafixednumberofactivefeaturesinanonlinelearningtask,denoted“RAND”forshort.Tomakeafaircomparison,allalgorithmsadoptthesameexperimentalsettings.Wesetthenumberofselectedfeaturesasround(0.1∗dimensionality)foreverydataset,the1.tw/~cjlin/libsvmtools/2/~mlearn/MLRepository.htmlregularizationparameterλto0.01,andthelearningrateηto0.2.Allparameterswerechosenbythesameapproach.Afterthat,alltheexperimentswereconductedover20randompermutationsforeachdataset.Alltheexperimentalresultswerereportedbyaveragingoverthese20runs.4.1.3EvaluationofOnlinePredictivePerformance4.1.3评价在线预测性能Table8summarizestheonlinepredictiveperformanceofthecomparedalgorithmswithafixedfractionofselectedfeatures(10%ofalldimensions)onthedatasets.Severalobservationscanbedrawnfromtheresults.Firstofall,wefoundthatamongallthecomparedalgorithms,theRANDalgorithmhasthehighestmistakerateforallthecases.ThisshowsthatitisimportanttolearntheactivefeaturesinanOFStask.Second,wefoundthatthesimple“PEtrun”algorithmcanoutperformtheRANDalgorithmconsiderably,whichindicatesthealgorithmiseffectivetochooseinformativefeaturesforonlinelearningtasks.Finally,amongthethreealgorithms,wefoundthattheOFSalgorithmachievedthesmallestmistakerate,whichissignificantlysmallerthanthetwoalgorithms.Thisshowsthattheproposedalgorithmisabletoconsiderablyboosttheperformanceofthesimple“PEtrun”approach.Tofurtherexaminetheonlinepredictiveperformance,Figure1showsthedetailsofonlineaveragemistakeratesvaryingaccordtheentireOFSprocessonthethreerandomlychosendatasets(similarobservationscanbefoundontheotherthreedatasets,wesimplyomitthemduetospacelimitation).Similartothepreviousobservations,wecanseethattheproposedOFSalgorithmconsistentlysurpassedtheothertwoalgorithmsforallthesituations.ThisagainverifiestheefficacyoftheproposedOFSalgorithm.Finally,Figure2furthershowsthedetailsoftheonlineperformanceofthecomparedonlinefeatureselectionalgorithmswithvariedfractionsofselectedfeatures.TheproposedOFSalgorithmoutperformtheothertwobaselinesformostcases.Thisencouragingresultfurtherverifiestheefficacyoftheproposedtechnique.4.2ExperimentII:Onlinevs.BatchComparisonforReal-worldApplications在线与批量比较对于现实世界的应用Inthisexperiment,weapplytheproposedOFStechniquetotacklefeatureselectiontasksofreal-worldapplicationsincomputervisionandBioinformatics.4.2.1ExperimentalDatasetsandSetup实验数据集和设置Thefirstapplicationistosolvefeatureselectionforimageclassification.WeadopttheCIFAR-10imagedataset[23]3,3/~kriz/cifar.html96whichconsistsof10classesofimages,whichwereasubsetofthewell-known80-millionimages.Inthisexperiment,werandomlychoosetwoclasses“airplane”and“bird”toabinaryclassificationtask.Inourdataset,theCIFAR-10datasetconsistsofatotalof3,992imagesandeachimageisrepresentedbya3073-dimensionalfeaturevector.Thesecondapplicationistosolvefeatureselectionofmicroarraygeneexpressiondatainbioinformatics.WeadopttheColondataset,whichisamicroarraygeneexpressiondataoftumorandnormalcolontissues[33]4.Thedatasethas62samplesandeachsamplecontains2000genefeatures.Theparametersettingsarethesameastheprevioussection.Alltheexperimentswereconductedover20randompermutationsforeachofthetwodatasets.Alltheresultswerereportedbyaveragingoverthese20runs.4.2.2EvaluationofOnlinePredictionPerformance评价网络性能预测TheexperimentalresultisshowninTable3andTable4.NotethattheaveragemistakeratesoftheoriginalonlinealgorithmwithfullsetoffeaturesonCIFAR-10andColondatasetsare0.2266±0.0030and0.3548±0.0541,respectively.Severalobservationcouldbedrawn.Firstly,inbothdatasetsourOFSalgorithmsperformssignificantlybetterthantherandomfeatureselectionmethodandthesimpletruncationapproach,whichdemonstratetheeffectivenessofouralgorithms;Secondly,inCIFAR-10dataset,ourOFSalgorithmperformsbetterasthefractionoffeaturesselectedincrease,andobtainthesameperformancewithOGDwhenthefractionreach64%,inColondataset,theperformance,asthefractionoffeaturesusedincrease,firstimprovethendrop,andachievesthebestperformancewhenselecting2%ofthefeatures,whichissuperiortousingallthefeatures,thismaybebecausethereexistssomenoisefeaturesinColondatasetwhichmayaffecttheperformance.4.2.3Onlinevs.BatchFeatureSelectionMethods4.2.3在线和批的特征选择方法Alltheaboveexperimentsareconductedinanonlinelearningsetting.Itwillalsobeusefultocomparethepro-4/oncology/affydata/index.htmlposedalgorithmagainstsomeexistingbatchfeatureselectionmethod.Tothispurpose,wecompareourOFSalgorithmwithastate-of-the-artbatchfeatureselectionmethod:minimumRedundancyMaximumRelevanceFeatureSelection(mRMR)[28,12].Wedividethedatasetsequallyintotwoparts:thefirstpartisusedbyrunningfeatureselectionalgorithms(OFSandmRMR),andthesecondpartisusedtotesttheperformanceofthefeatureselectionalgorithms.Toexaminetheefficacyoftheselectedfeaturesinvarianttodifferentclassifiers,weadopttwotypesofwidelyusedclassifiers:(i)Onlinegradientdescent(OGD)whichisanonlinelearningclassifier,and(ii)K-nearestneighborclassifier(KNN),whichisabatchlearningclassifier.Inthisexperiment,wesimplyfixK=5fortheparameterKintheKNNclassifier.Weevaluatetheperformanceintermsofboththeclassificationerrorratesandthecomputationaltimeefficiencyofthetwodifferentfeatureselectionalgorithms.TheexperimentalresultswereshowninTable5andTable6,respectively,andthetimeefficiencycomparisonwasshowninFigure4.Fromtheexperimentalresults,wecanseethattheproposedOFSalgorithmingeneraloutperformsmRMRformostcasesintermsofclassificationefficacyforbothdifferentclassifiers.Intermsoftimeefficiencyandscalability,weobservethattheOFSalgorithmhasasignificantadvantageoverthebatchfeatureselectionalgorithm,especiallywhenthenumberoffeaturestobeselectedislarge.Forexample,whenchoosing32%offeaturesontheCIFAR-10dataset,themRMRalgorithmspentabout1045.83secondsforlearning,whiletheproposedOFSalgorithmtookonly1.08seconds,whichisalmost1000timesfaster.4.3ExperimentIII:EvaluationonBigData大数据的评价Inthissection,weevaluatetheperformanceoftheproposedalgorithmsforminingbigdatasets,inwhicheachofthesedatasetscontainsatleast100,000instances.Thestatisticsofthesedatasetsareshownintable7.Table8showstheexperimentalresultsoftheaveragenumbersofmistakesandthetimecostsbythreedifferentalgorithms,inwhichtheexperimentswereimplementedinmatlabandruninaregularPCwithaDual-coreCPU.Fromtheresults,itiscleartoseethattheproposedOFSalgorithmsignificantlyoutperformstheothertwobaselines.Inaddition,byexaminingthetimecosts,wefoundthatallthethreeonlinealgorithmsareveryefficient,whichgenerallyrequireonlyafewminutesinlearningthefeatureselectionontheselarge-scaledatasets.5.CONCLUSIONSInthispaper,weinvestigatedanewresearchproblemofOnlineFeatureSelection(OFS),whichaimstoselectafixednumberoffeaturesforpredictionbyanonlinelearningfashion.WepresentedanovelOFSalgorithmtosolvethelearningtask,andofferedtheoreticalanalysisonthemistakeboundoftheproposedOFSalgorithm.WeextensivelyexaminedtheirempiricalperformanceonbothregularUCIdatadatasetsandlarge-scaledatasets.Wealsocomparedtheproposedonlinefeatureselectiontechniquewitharegularstate-of-the-artbatchfeatureselectionalgorithmforsolvingreal-worldapplications:imageclassificationincomputervisionandmicroarraygeneexpressionanalysisinbioinformatics.Encouragingresultsshowtheproposedalgorithmsarefairlyeffectiveforfeatureselectiontasksofonlineapplications,andsignificantlymoreefficientandscalablethansomestate-of-the-artbatchfeatureselectiontechnique.CknowledgementsThisworkwasinpartsupportedbyMOEtier1grant(RG33/11)andMicrosoftResearchgrant(M4060936).在本文中，我们研究了在线特征选择一个新的研究问题（OFS），其目的是选择一个固定数量的功能，通过在线学习的方式预测。我们提出了一个新的OFS算法解决的学习任务，并提供了理论上的分析错误，该OFS算法结合。我们深入研究了他们的经验表现在UCI数据集和正规大型数据集。我们也比较建议的在线特征选择技术与一个普通的国家的最先进的批量特征选择算法解决现实世界中的应用：计算机视觉的图像分类和微阵列基因表达分析，在生物信息学。令人鼓舞的结果表明，该算法是相当有效的在线应用程序的特征选择任务，并显着更高效和可扩展性比一些国家的最先进的批量特征选择技术。6.REFERENCES[1]R.Bekkerman,R.El-Yaniv,N.Tishby,andY.Winter.Distributionalwordclustersvs.wordsfortextcategorization.JournalofMachin

人人文库> 全部分类> 行业资料 > 工业设计

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

专业英语论文-Online Feature Selection for Mining Big Data

文档简介

温馨提示

最新文档

评论

相关文档