专业英语翻译_第1页
专业英语翻译_第2页
专业英语翻译_第3页
专业英语翻译_第4页
专业英语翻译_第5页
全文预览已结束

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

姓名姓名 王光华 王光华 20113106872011310687 第三组第三组 4 4 4 4 IndexingIndexing索引 AnAn indexingindexing schemescheme allowsallows toto havehave anan efficientefficient organizationorganization ofof datadata forfor quickquick retrievalretrieval inin largelarge databases databases MostMost ofof thethe solutionssolutions presentedpresented involveinvolve a a dimensionalitydimensionality reductionreduction inin orderorder toto indexindex thisthis representationrepresentation usingusing a a spatialspatial accessaccess method method SeveralSeveral studiesstudies suggestsuggest thatthat thethe variousvarious representationsrepresentations differdiffer butbut slightlyslightly inin termsterms ofof indexingindexing powerpower Keogh Keogh andand KasettyKasetty 2003 2003 大型数据库中的高效的组织索引方案允许数据的快速检索 提出的大多数解决方案涉及使 用空间的访问方法 以一个降维索引此表示 一些研究表明 不同的表述 但略有不同的 索引能力方面 KeoghKeogh andand KasettyKasetty 20032003 页页下角的下角的ACMACM ComputingComputing Surveys Surveys Vol Vol 45 45 No No 1 1 ArticleArticle 12 12 PublicationPublication date date NovemberNovember 2012 2012 ACM计算调查 卷45 1号 第12 出版日期 Time SeriesTime Series DataData MiningMining 时间序列数据挖掘时间序列数据挖掘 12 2112 21 TableTable I I ComparisonComparison ofof thethe DistanceDistance MeasuresMeasures surveyedsurveyed inin ThisThis ArticleArticle withwith thethe ofof RobustnessRobustness比较距离的措施调查在本文的鲁棒性 表此表此处处省掉省掉 EachEach distancedistance measuremeasure isis thusthus distinguisheddistinguished asas scalescale amplitude amplitude warpwarp time time noisenoise oror outliersoutliers robust robust TheThe nextnext columncolumn showsshows whetherwhether thethe proposedproposed distancedistance isis a a metric metric TheThe costcost isis givengiven asas a a simplifiedsimplified factorfactor ofof computationalcomputational plexity TheThe lastlast columncolumn givesgives thethe minimumminimum numbernumber ofof parametersparameters settingsetting requiredrequired byby thethe distancedistance measuremeasure Keogh Keogh andand KasettyKasetty 2003 2003 因此 每个距离测度作为区分量表 振幅 经 时间 噪声或离群强劲 下一列显示建 议的距离是否是一个度量 成本给出的计算复杂性的简化因子 最后一列给出了距离测量 基奥和2003 Kasetty 所需的参数设置的最小数目 However However widerwider differencesdifferences arisearise concerningconcerning thethe qualityquality ofof resultsresults andand thethe speedspeed ofof querying querying ThereThere areare twotwo mainmain issuesissues whenwhen designingdesigning anan indexingindexing scheme scheme completenesscompleteness no no falsefalse dismissals dismissals andand soundnesssoundness no no falsefalse alarms alarms 然而 会出现更大分歧于查询结果的质量和速度 这是一个索引方案设计的时候的两个要 点 完整性 无虚假解雇 和可靠性 无假警报 InIn anan earlyearly paper paper FaloutsosFaloutsos etet al al 1994 1994 listlist thethe propertiesproperties requiredrequired forfor indexingindexing schemes schemes 在早期的论文 Faloutsos等 1994 列表所需的索引属性方案 1 1 ItIt shouldshould bebe muchmuch fasterfaster thanthan sequentialsequential scanning scanning 2 2 TheThe methodmethod shouldshould requirerequire littlelittle spacespace overhead overhead 3 3 TheThe methodmethod shouldshould bebe ableable toto handlehandle queriesqueries ofof variousvarious lengths lengths 1 应该比顺序扫描快得多 2 该方法需要空间开销小 3 该方法应该能够处理各种长度的查询 4 4 TheThe methodmethod shouldshould allowallow insertionsinsertions andand deletionsdeletions withoutwithout rebuildingrebuilding thethe index index 5 5 ItIt shouldshould bebe correct correct thatthat is is therethere shouldshould bebe nono falsefalse dismissals dismissals 4 该方法应该允许插入和删除而无需重建索引 5 它应该是正确的 也就是说 应该有完整性 AsAs notednoted byby KeoghKeogh etet al al 2001b 2001b therethere areare twotwo additionaladditional desirabledesirable properties properties 1 1 ItIt shouldshould bebe possiblepossible toto buildbuild thethe indexindex withinwithin reasonable reasonable time time 2 2 TheThe indexindex shouldshould bebe ableable toto handlehandle differentdifferent distancedistance measures measures 由 Keogh 等人指出 2003 有两个额外的性质 1 可以在 合理的时间建立索引 2 指数应能处理不同距离的措施 A A timetime seriesseries X X cancan bebe consideredconsidered asas a a pointpoint inin anan n n dimensional dimensional space space ThisThis immediatelyimmediately suggestssuggests thatthat timetime seriesseries couldcould bebe indexedindexed byby SpatialSpatial AccessAccess MethodsMethods SAMs SAMs TheseThese allowallow toto partitionpartition spacespace intointo regionsregions alongalong a a hierarchicalhierarchical structurestructure forfor efficientefficient retrieval retrieval 一个时间序列 X 可以视为一个 n 维空间点 这直接表明时间序列可以通过空间索引访问方 法 SAMsSAMs 这些允许空间分割成区域并沿着层次结构的高效检索 B treesB trees Bayer Bayer andand McCreightMcCreight 1972 1972 onon whichwhich mostmost hierarchicalhierarchical indexingindexing structuresstructures areare based based werewere originallyoriginally developeddeveloped forfor one dimensionalone dimensional data data TheyThey useuse prefixprefix separators separators thusthus nono overlapoverlap 重叠 重叠 forfor uniqueunique datadata objectsobjects isis guaranteedguaranteed 保证 保证 B 树 BayerBayer andand McCreightMcCreight 19721972 在大多数分层索引结构的基础 最初是为一维数据 他 们使用分隔符前缀 因此独特的数据对象不重叠是有保证的 MultidimensionalMultidimensional indexingindexing structures structures suchsuch asas thethe R treeR tree Beckmann Beckmann etet al al 1990 1990 useuse datadata organizedorganized inin MinimumMinimum BoundingBounding RectanglesRectangles MBR MBR However However whenwhen summarizingsummarizing datadata inin minimumminimum boundingbounding regions regions thethe sequentialsequential naturenature ofof timetime seriesseries cannotcannot bebe captured captured TheirTheir mainmain shortcomingshortcoming isis thatthat a a widewide MBRMBR producesproduces largelarge overlapoverlap withwith a a majoritymajority ofof emptyempty space space QueriesQueries thereforetherefore intersectintersect withwith manymany ofof thesethese MBRs MBRs 多维索引结构 如对 R 树 贝克曼等 1990 使用组织最小边界矩形 MBR 的数据 然 而 总结了最小边界地区的数据时 时间序列的顺序自然不能被捕获 其主要缺点是大的 MBR 会产生大量的重叠与大多数空的空间 因此查询与很多这些 Mbr 相交 TypicalTypical timetime seriesseries containcontain overover a a thousandthousand datapointsdatapoints andand mostmost SAMSAM approachesapproaches areare knownknown toto degradedegrade quicklyquickly atat dimensionalitydimensionality greatergreater thanthan 8 8 toto 1212 Chakrabarti Chakrabarti andand MehrotraMehrotra 1999 1999 TheThe degenerationdegeneration withwith highhigh dimensionsdimensions causedcaused byby overlappingoverlapping cancan resultresult inin havinghaving toto accessaccess almostalmost thethe entireentire datasetdataset byby randomrandom I O I O Therefore Therefore anyany benefitbenefit gainedgained whenwhen indexingindexing isis lost lost 典型的时间序列 包含超过一千个数据点和最接近SAMSAM 是众所周知的降低很快在维数大于8至12 Chakrabarti Chakrabarti andand MehrotraMehrotra 1999 1999 高维度的变性引起的重叠会导致不得不访问几乎整个 数据集的随机I O 因此 索引时所获得的任何利益都将丢失 AsAs R treesR trees andand theirtheir variantsvariants areare victimsvictims ofof thethe phenomenonphenomenon knownknown asas thethe dimensionality dimensionality curse curse Bohm Bohm etet al al 2001 2001 a a solutionsolution forfor theirtheir usageusage isis toto firstfirst performperform dimensionalitydimensionality reduction reduction TheThe X treeX tree extended extended nodenode tree tree forfor example example usesuses a a differentdifferent splitsplit strategystrategy toto reducereduce overlapoverlap Berchtold Berchtold etet al al 2002 2002 因为R trees及其变体现象的受害者称为 维数灾难 玻姆et al 2001年 其使用的解决 方案是首先进行降维在X树 扩展节点树 例如 使用了不同的分割策略 以减少重叠 德 迈等 TheThe A treeA tree approximation approximation tree tree usesuses VA file styleVA file style vector vector approximationapproximation file file quantizationquantization ofof thethe datadata spacespace toto storestore bothboth MBRMBR andand VBRVBR Virtual Virtual BoundingBounding Rectangle lowerRectangle lower andand upperupper boundsbounds Sakurai Sakurai etet al al 2000 2000 TheThe TV treeTV tree telescopic telescopic vectortree vectortree isis anan extensionextension ofof thethe R tree R tree 数据空间来存储MBR和 VBR 虚拟边界矩形 的A 树 近似树 使用VA文件式 矢量近似文件 量化下限和上限 Sakurai等 2000 电视树 伸缩矢量树 是R 树的扩展 ItIt usesuses minimumminimum boundingbounding regionsregions spheres spheres rectangles rectangles oror diamonds diamonds dependingdepending onon thethe typetype ofof LpLp normnorm used used restrictedrestricted toto a a subsetsubset ofof activeactive dimensions dimensions However However notnot allall methodsmethods relyrely onon SAMSAM toto provideprovide efficientefficient indexing indexing ParkPark etet al al 2000 2000 proposedproposed thethe useuse ofof suffixsuffix treestrees Gusfield Gusfield 1997 1997 toto indexindex timetime series series 它使用最小边界地区 球体 矩形或菱形 取决于使用的Lp规范类型 限于积极维度的一个子集 然而 并不是所有的方法 都依赖于SAMSAM提供高效的索引 Park等人 2000 提出的使用后缀树 Gusfield 1997 指数时 间序列 TheThe ideaidea isis thatthat distancedistance computationcomputation reliesrelies onon comparingcomparing prefixesprefixes first first soso itit isis possiblepossible toto storestore everyevery seriesseries withwith identicalidentical prefixesprefixes inin thethe samesame nodes nodes TheThe subtreessubtrees willwill thereforetherefore onlyonly containcontain thethe suffixessuffixes ofof thethe series series However However thisthis approachapproach seemsseems hardlyhardly scalablescalable forfor longerlonger timetime seriesseries oror moremore subtlesubtle notionsnotions ofof similarity similarity InIn FaloutsosFaloutsos etet al al 这个想法是 距离的计算依赖于 第一比较前缀 所以它可以存储每个系列具有相同前缀相同的节点 因此子树将只包含后 缀的系列 然而 这种做法似乎很难扩展较长的时间序列或相似的更微妙的概念 在 Faloutsos等 1994 1994 thethe authorsauthors introducedintroduced thethe GEnericGEneric MultimediaMultimedia INdexIngINdexIng methodmethod GEMINI GEMINI whichwhich cancan applyapply anyany dimensionalitydimensionality reductionreduction methodmethod toto produceproduce efficientefficient indexing indexing YiYi andand FaloutsosFaloutsos 2000 2000 studiedstudied thethe problemproblem ofof multimodalmultimodal similaritysimilarity searchsearch inin whichwhich usersusers cancan choosechoose betweenbetween multiplemultiple similaritysimilarity modelsmodels dependingdepending onon theirtheir needs needs TheyThey introducedintroduced anan indexingindexing schemescheme forfor timetime seriesseries wherewhere thethe distancedistance functionfunction cancan bebe anyany LpLp norm norm 1994 作者介绍了通用多媒体索 引方法 GEMINI 它可以适用于任何的降维方法生产高效的索引 Yi和Faloutsos 2000 研 究了多模式的相似性搜索中 用户可以根据自己的需求多的相似模型之间进行选择的问题 他们推出的时间序列 其中距离函数可以是任何Lp范数的索引方案 OnlyOnly oneone indexindex structurestructure isis neededneeded forfor allall LpLp norms norms ToTo analyzeanalyze thethe efficiencyefficiency ofof indexingindexing schemes schemes HellersteinHellerstein etet al al 1997 1997 consideredconsidered thethe generalgeneral problemproblem ofof databasedatabase indexingindexing workloadsworkloads combinations combinations ofof datasetsdatasets andand setssets ofof potentialpotential queries queries TheyThey defineddefined a a frameworkframework toto measuremeasure thethe efficiencyefficiency ofof anan indexingindexing schemescheme basedbased onon twotwo characterizations characterizations storagestorage redundancyredundancy how how manymany timestimes eacheach itemitem inin thethe datasetdataset isis stored stored andand accessaccess overheadoverhead how how manymany unnecessaryunnecessary blocksblocks areare retrievedretrieved forfor a a query query 只有一个索引结构需要所有Lp规范 分析索引的效率 海勒斯 坦等人 1997 考虑数据库索引的工作量一般问题 数据和潜在的查询集的组合 他们 定义了一个框架来衡量一个基于两个特征的索引方案的效率 存储冗余 多少次 每个项 目中的数据存储和访问开销 多少不必要的块检索查询 ForFor indexingindexing purposes purposes envelope styleenvelope style upperupper andand lowerlower boundsbounds forfor DTWDTW havehave beenbeen proposedproposed Keogh Keogh andand RatanamahatanaRatanamahatana 2005 2005 thethe indexingindexing procedureprocedure ofof shortshort timetime seriesseries isis efficientefficient butbut similaritysimilarity searchsearch typicallytypically entailsentails moremore pagepage reads reads ThisThis frameworkframework hashas beenbeen extendedextended Vlachos Vlachos etet al al 2006 2006 inin orderorder toto indexindex multidimensionalmultidimensional timetime seriesseries withwith DTWDTW asas wellwell asas LCSS LCSS AssentAssent etet al al 2008 2008 proposedproposed thethe TS tree TS tree anan indexingindexing methodmethod offeringoffering efficientefficient similaritysimilarity searchsearch onon timetime series series 对于索引的 目的 信封的上界和下界的DTW已经提出 基奥和ratanamahatana 2005 短的时间序列 索引程序是有效的 但相似性的搜索 往往需要更多的页读取 这个框架扩展 Vlachos et al 2006年 以指数与DTW多维时间序列以及lcss等 2008 提出了TS 树索引方法提供时 间序列的高效相似搜索 ItIt avoidsavoids overlapoverlap andand providesprovides compactcompact metadatametadata informationinformation onon thethe subtrees subtrees thusthus reducingreducing thethe searchsearch space space InIn KontakiKontaki etet al al 2007 2007 thethe useuse ofof anan IncrementalIncremental DFTDFT ComputationComputation indexindex IDC Index IDC Index hashas beenbeen proposedproposed toto handlehandle streamsstreams basedbased onon a a deferreddeferred updateupdate policypolicy andand anan incrementalincremental computationcomputation ofof thethe DFTDFT atat differentdifferent updateupdate speeds speeds However However thethe maintenancemaintenance ofof thethe R treeR tree forfor thethe wholewhole streamingstreaming seriesseries mightmight causecause a a constantlyconstantly growinggrowing overheadoverhead andand thethe latterlatter couldcould resultresult inin performanceperformance loss loss 它避免重叠 并提供其子 树紧凑的元数据信息 从而减少了搜索空间 在kontaki等 2007 一种增量DFT计算 使用指数 IDC指数 已经提出了处理基于延迟更新政策流在不同的更新速度的增量计算 DFT 然而 R 树为全流系列的维护可能导致不断增长的开销 而后者则可能会导致性 能损失 ItIt isis alsoalso possiblepossible toto useuse indexingindexing methodsmethods toto speedspeed upup DTWDTW calculation calculation however however itit inducesinduces a a trade offtrade off betweenbetween efficiencyefficiency andand I OI O cost cost However However ShiehShieh andand KeoghKeogh 2008 2008 recentlyrecently showedshowed thatthat forfor datasetsdatasets thatthat areare largelarge enough enough thethe benefitsbenefits ofof usingusing DTWDTW insteadinstead ofof EuclideanEuclidean distancedistance isis almostalmost null null asas thethe largerlarger thethe dataset dataset thethe higherhigher thethe probabilityprobability toto findfind anan exactexact matchmatch forfor anyany timetime series series TheyThey proposedproposed anan extensionextension ofof thethe SAXSAX representation representation calledcalled indexableindexable SAXSAX iSAX iSAX allowingallowing toto indexindex timetime seriesseries withwith zerozero overlapoverlap atat leafleaf nodes nodes 另外 也可以使用索引的方法来加快DTW的计算 但是 它导致一个效率和I O成本之间的权衡 然而Shieh 和 Keogh最近发现 对数据集足够大 使用DTW代替欧 氏距离的好处几乎是零 当数据集越大 概率越高 因为更大的数据集 更高的找到任何时间 序列完全匹配的概率 他们提出了SAX表示的一个延伸 称为可转位SAX iSAX 允许 在叶节点零重叠指数时间序列 5 5 RESEARCHRESEARCH TRENDSTRENDS ANDAND ISSUESISSUES 研究的趋势和问题 Time seriesTime series datadata miningmining hashas beenbeen anan evergrowingevergrowing andand stimulatingstimulating fieldfield ofof studystudy thatthat hashas continuouslycontinuously raisedraised challengeschallenges andand researchresearch issuesissues overover thethe pastpast decade Wedecade We discussdiscuss inin thethe followingfollowing openopen researchresearch issuesissues andand trendstrends inin time time seriesseries datadata miningmining forfor thethe nextnext decade decade StreamStream analysis analysis TheThe lastlast yearsyears ofof researchresearch inin hardwarehardware andand networknetwork researchresearch havehave witnessedwitnessed anan explosionexplosion ofof streamingstreaming technologiestechnologies withwith thethe continuouscontinuous advancesadvances ofof bandwidthbandwidth capabilities capabilities 时间序列数据挖掘已经成为一个日益增长和刺激的研究领域 不断提高的挑战和问题的研 究 在过去的十年 我们讨论在接下来的开放研究问题和时间序列数据挖掘未来十年的趋 势 数据流分析 在过去多年的研究在硬件和网络的研究目睹数据流技术具有带宽能力的 不断进步 StreamsStreams areare seenseen asas continuouslycontinuously generatedgenerated measurementsmeasurements thatthat havehave toto bebe processedprocessed inin massivemassive andand fluctuatingfluctuating datadata rates rates AnalyzingAnalyzing andand miningmining suchsuch datadata flowsflows areare computationallycomputationally extremeextreme tasks tasks SeveralSeveral paperspapers reviewreview researchresearch issuesissues forfor datadata streamsstreams miningmining Gaber Gaber etet al al 2005 2005 oror managementmanagement Golab Golab andand OzsuOzsu 2003 2003 AlgorithmsAlgorithms designeddesigned forfor staticstatic datasetsdatasets havehave usuallyusually notnot beenbeen sufficientlysufficiently optimizedoptimized toto bebe capablecapable ofof handlinghandling suchsuch continuouscontinuous volumesvolumes ofof data data 流被视为不断生成测量处理的大规模和数据率波动 分析和挖掘这些数据流计算极 端的任务 一些论文审查数据流挖掘的研究问题 Gaber et al 2005 或管理 Golab和Ozsu 2003 针对静态数据集的算法通常没有得到充分的优化 不能够处理这样的连续的数据量 ManyMany modelsmodels havehave alreadyalready beenbeen extendedextended toto controlcontrol datadata streams streams suchsuch asas clusteringclustering Domingos Domingos andand HultenHulten 2000 200

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论