OracleText使用小结.doc_第1页
OracleText使用小结.doc_第2页
OracleText使用小结.doc_第3页
OracleText使用小结.doc_第4页
OracleText使用小结.doc_第5页
已阅读5页,还剩11页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Oracle Text使用小结一、Oracle Text介绍Oracle从7.3开始支持全文检索,即用户可以使用Oracle服务器的上下文(ConText)完成基于文本的查询(具体可采用通配符查找、模糊匹配、相关分类、近似查找、条件加权和词意扩充等方法);在Oracle 8.0.x中称为ConText ;在Oracle 8i中称为interMedia Text ; Oracle9i中称为Oracle Text。Oracle Text是9i标准版和企业版的一部分,Oracle9i将全文检索功能做为内置功能提供给用户,使得用户在创建数据库实例时自动安装全文检索。Oracle Text使Oracle 9i具备了强大的文本检索能力和智能化的文本管理能力。使用Oracle Text,可以方便而有效地利用标准的SQL工具来构建基于文本的新的开发工具或对现有应用程序进行扩展。应用程序开发人员可以在任何使用文本的Oracle数据库应用程序中充分利用Oracle Text搜索,应用范围可以是现有应用程序中可搜索的注释字段,也可是实现涉及多种文档格式(包括doc,excel,txt,pdf等)和复杂搜索标准的大型文档管理系统,还可是来自Internet和文件系统的文本数据搜索XML应用程序。Oracle Text支持Oracle数据库所支持的大多数语言的基本全文搜索功能。要使用Oracle Text,必须具有CTXAPP角色或者是CTXSYS用户。Oracle Text 为系统管理员提供CTXSYS用户,为应用程序开发人员提供CTXAPP角色。CTXSYS 用户可执行以下任务:启动Oracle Text服务器,执行CTXAPP角色的所有任务。具有CTXAPP角色的用户可执行以下任务:创建索引,管理Oracle Text数据字典,包括创建和删除首选项,进行Oracle Text查询,使用Oracle Text PL/SQL程序包。 二、Oracle Text索引a、Index简介:索引就是将文本打碎分成很多标记(token),这些标记通常是用空格分开的一个个单词。Oracle Text应用的实现可以理解为就是一个“装载数据 配置索引 索引数据执行检索 维护索引”的过程。Index的索引类型有:CONTEXT,CTXCAT,CTXRULE。下面就对索引做简单的描述:Index TypeApplication TypeQuery OperatorCONTEXTUse this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as MS Word, HTML, XML, or plain text. With a context index, you can customize your index in a variety of ways.CONTAINSCTXCATUse this index type to index small text fragments such as item names, prices and descriptions that are stored across columns. With this index, query performance is improved for mixed queries.CATSEARCHCTXRULEUse a CTXRULE index to build a document classification application. The CTXRULE index is an index created on a table of queries, where each query has a classification. Single documents (plain text, HTML, or XML) can be classified using the MATCHES operator.MATCHES最常用的就是 CONTEXT索引,使用最通用的CONTAINS操作符进行查询。b、CONTEXT 索引Oracle Text CONTEXT 索引是反向索引(inverted index),每个标记 (token)都映射着包含它自己的文本位置。在索引建好后,可以查到Oracle自动产生的表(假设索引名为myindex):DR$myindex$I、DR$myindex$K、DR$myindex$R、DR$myindex$N,其中以I表最重要,该表保存的是Oracle 分析文档后生成的token记录,包括token出现的位置、次数、hash值等。包括一下参数:Datastore Types,Filter Types,Lexer Types,Wordlist Type,Storage Types,Section Group Types,Stoplists,System-Defined Preferences,System Parameters,每个参数设置的目的是:Preference ClassAnswers the QuestionDatastoreHow are your documents stored?FilterHow can the documents be converted to plain text?LexerWhat language is being indexed?WordlistHow should stem and fuzzy queries be expanded?StorageHow should the index tables be stored?Stop ListWhat words or themes are not to be indexed?Section GroupIs querying within sections enabled, and how are the document sections defined?下面就对每个参数包含的值、值的意义及目的做简单描述:1、Datastore TypesDatastoreType Use WhenDIRECT_DATASTOREData is stored internally in the text column. Each row is indexed as a single document.MULTI_COLUMN_DATASTOREData is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one per row.DETAIL_DATASTOREData is stored internally in the text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a master table.FILE_DATASTOREData is stored externally in operating system files. Filenames are stored in the text column, one per row.NESTED_DATASTOREData is stored in a nested table.URL_DATASTOREData is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) are stored in the text column.USER_DATASTOREDocuments are synthesized at index time by auser-defined stored procedure.2、Filter TypesFilter Preference typeDescriptionCHARSET_FILTERCharacter set converting filterINSO_FILTERInso filter for filtering formatted documentsNULL_FILTERNo filtering required. Use for indexing plain text, HTML, or XML documentsUSER_FILTERUser-defined external filter to be used for custom filteringPROCEDURE_FILTERUser-defined stored procedure filter to be used for custom filtering.3、Lexer TypestypeDescriptionBASIC_LEXERLexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.MULTI_LEXERLexer for indexing tables containing documents of different languagesCHINESE_VGRAM_LEXERLexer for extracting tokens from Chinese textJAPANESE_VGRAM_LEXERLexer for extracting tokens from Japanese text.JAPANESE_LEXERLexer for extracting tokens from Japanese text.KOREAN_LEXERLexer for extracting tokens from Korean text.KOREAN_MORPH_LEXERLexer for extracting tokens from Korean text (recommended).basic_lexer,针对英语;chinese_vgram_lexer,专门的汉语分析器,支持所有汉字字符集;chinese_lexer,这是一个新的汉语分析器,只支持utf8字符集(也支持zhs16gbk字符集)。4、Wordlist TypeUse the wordlist preference to enable the query options such as stemming, fuzzy matching for your language. You can also use the wordlist preference to enable substring and prefix indexing which improves performance for wildcard queries with CONTAINS and CATSEARCH.To create a wordlist preference, you must use BASIC_WORDLIST, which is the only type available.AttributeAttribute ValuesstemmerSpecify which language stemmer to use. You can specify one of the following: NULL (no stemming), ENGLISH (English inflectional), DERIVATIONAL (English derivational), DUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, AUTO (automatic language-detection for stemming)fuzzy_matchSpecify which fuzzy matching cluster to use. You can specify one of the following: GENERIC, JAPANESE_VGRAM, KOREAN, CHINESE_VGRAM, ENGLISHDUTCH, FRENCH, GERMAN, ITALIAN, SPANISH, OCRAUTO (automatic language detection for stemming) fuzzy_scoreSpecify a default lower limit of fuzzy score. Specify a number between 0 and 80. Text with scores below this number is not returned. Default is 60.fuzzy_numresultsSpecify the maximum number of fuzzy expansions. Use a number between 0 and 5,000. Default is 100.substring_indexSpecify TRUE for Oracle to create a substring index. A substring index improves left-truncated and double-truncated wildcard queries such as %ing or %benz%. Default is FALSE. prefix_indexSpecify YES to enable prefix indexing. Prefix indexing improves performance for right truncated wildcard searches such as TO%. Defaults to NO.prefix_length_minSpecify the minimum length of indexed prefixes. Defaults to 1. prefix_length_maxSpecify the maximum length of indexed prefixes. Defaults to 64. wlidcard_maxtermsSpecify the maximum number of terms in a wildcard expansion. Use a number between 1 and 15,000. Default is 5,000. 5、Storage TypesUse the storage preference to specify tablespace and creation parameters for tables associated with a Text index. The system provides a single storage type called BASIC_STORAGE:typeDescriptionBASIC_STORAGEIndexing type used to specify the tablespace and creation parameters for the database tables and indexes that constitute a Text index.BASIC_STORAGE has the following attributes:AttributeAttribute Valuei_table_clauseParameter clause for dr$indexname$I table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.The I table is the index data table.k_table_clauseParameter clause for dr$indexname$K table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.The K table is the keymap table.r_table_clauseParameter clause for dr$indexname$R table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.The R table is the rowid table.The default clause is: LOB(DATA) STORE AS (CACHE)n_table_clauseParameter clause for dr$indexname$N table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.The N table is the negative list table.i_index_clauseParameter clause for dr$indexname$X index creation. Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement. The default clause is: COMPRESS ep_table_clauseParameter clause for the substring index if you have enabled SUBSTRING_INDEX in the BASIC_WORDLIST.Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement. The P table is an index-organized table so the storage clause you specify must be appropriate to this type of table.6、Section Group TypesSection Group PreferenceDescriptionNULL_SECTION_GROUPUse this group type when you define no sections or when you define only SENTENCE or PARAGRAPH sections. This is the default.BASIC_SECTION_GROUPUse this group type for defining sections where the start and end tags are of the form and .HTML_SECTION_GROUPUse this group type for indexing HTML documents and for defining sections in HTML documents.XML_SECTION_GROUPUse this group type for indexing XML documents and for defining sections in XML documents.AUTO_SECTION_GROUPUse this group type to automatically create a zone section for each start-tag/end-tag pair in an XML document. The section names derived from XML tags are case sensitive as in XML.Attribute sections are created automatically for XML tags that have attributes. Attribute sections are named in the form attributetag.Stop sections, empty tags, processing instructions, and comments are not indexed.The following limitations apply to automatic section groups: You cannot add zone, field, or special sections to an automatic section group. Automatic sectioning does not index XML document types (root elements.) However, you can define stop sections with document type. The length of the indexed tags, including prefix and namespace, cannot exceed 64 characters. Tagslonger than this are not indexed.PATH_SECTION_GROUPUse this group type to index XML documents. Behaves like the AUTO_SECTION_GROUP. The difference is that with this section group you can do path searching with the INPATH and HASPATH operators. Queries are also case-sensitive for tag and attribute names. Stop sections are not allowed.NEWS_SECTION_GROUPUse this group for defining sections in newsgroup formatted documents according to RFC 1036.7、StoplistsStoplists identify the words in your language that are not to be indexed. In English, you can also identify stopthemes that are not to be indexed. By default, the system indexes text using the system-supplied stoplist that corresponds to your database language.Oracle Text provides default stoplists for most languages including English, French, German, Spanish, Dutch, and Danish. These default stoplists contain only stopwords.c、Query OperatorsOperatorDescriptionABOUTABOUT query increases the number of relevant documents returned from the same query.ACCUMulate (,)Search for documents that contain at least one occurrence of any of the query terms.AND (&)Search for documents that contain at least one occurrence of each of the query terms.Broader Term (BT, BTG, BTP,BTI)Expand a query to include the term that has been defined in a thesaurus as a broader or higher level term.EQUIValence (=)Specifies an acceptable substitution for a word in a query.FuzzyExpands queries to include words that are spelled similarity to the specified term.HASPATHFind all XML documents that contain a specified section path.INPATHPath searching in XML documents.MINUS (-)Search for documents that contain one query term and you want the presence of a second query term to cause the document to be ranked lower.Narrow Term (NT, NTG, NTP,NTI)Expands a query to include all the terms that have been defined in a thesaurus as the narrower or lower level terms for a specified term.NEAR (;)Returns a score based on the proximity of two or more query terms.NOT ()Search for documents that contain one query term and not another.OR (|)Search for documents that contain at least one occurrence of any of the query terms.Preferred Term (PT)Replaces a term in a query with the preferred term that has been defined in a thesaurus for the term.Related Term (RT)Expands a query to include all related terms that have been defined in a thesaurus for the term.Soundex (!)Expands queries to include words that have similar sounds; that is, words that sound like other words.Stem ($)Search for terms that have the same linguistic root as the query term.Stored Query Expression (SQE)Calls a stored query expression created with the CTX_QUERY.STORE_SQE procedure.SYNonym (SYN)Expands a query to include all the terms that have been defined in a thesaurus as synonyms for the specified term.Threshold ()This operator at the expression level eliminates documents in the result set that score below a threshold number. This operator at the query term level selects a document based on how a term scores in the document.Translation Term (TR)Expands a query to include all defined foreign language equivalent terms.Translation Term Synonym(TRSYN)Expand a query to include all the defined foreign equivalents of the query terms, the synonyms of query term, and the foreign equivalents of the synonyms.Top Term (TT)Replaces a term in a query with the top term that has been defined for the term in the standard hierarchy in a thesaurus.Weight (*)Multiplies the score by the given factor, topping out at 100 when the score exceeds 100.三、Oracle Text示例对Oracle Text技术主要验证了文本存储在数据库中(对varchar2、blob字段),本地文件的检索(txt文件、pdf文件、doc文件、xls文件、ppt文件),网页的文本检索(网页、pdf文件的网页)。a、对于文本存储在数据库中:可以是CLOB, BLOB, BFILE, VARCHAR2, or CHAR类型的文本数据。1、查看Oracle支持的中文字符集:select * from V$NLS_PARAMETERS; / 或者:select * from nls_database_parameters; /2、创建中文语义的定义:Begin ctx_ddl.create_preference(lexer_cn,chinese_lexer); end; /删除定义:beginctx_ddl.drop_preference(chinese_lexer); end;/3、创建中文语义的索引:create index idx_ctx_fainfo on fainfo(faname) indextype is ctxsys.context parameters(lexer lexer_cn); /临时表空间要求建立Oracle Text索引需要消耗CTXSYS用户默认的临时表空间。如果空间不够的话将导致ORA-01652错误。你可以扩展CTXSYS的临时表空间,而不是发出命令的用户默认的临时表空间。对于索引全英文的文本列来说,需要临时表空间大小通常是其文本数据量的50%-200%不等。而对索引包含中文文本列来说需要的表空间会更多。索引重建:ALTER INDEX idx_ctx_fainfo REBUILD/若用新的参数重新建立索引:ALTER INDEX newsindex REBUILD PARAMETERS(replace lexer my_lexer);/删除索引:drop index idx_ctx_fainfo; /强制删除索引:DROP INDEX idx_ctx_fainfo FORCE;/4、索引的同步优化:全文检索中用于更新DR$ idx_ctx_fainfo$I、DR$ idx_ctx_fainfo$K、DR$ idx_ctx_fainfo$R和DR$ idx_ctx_fainfo$N表中的记录:begin ctx_ddl.sync_index(idx_ctx_fainfo);end synchronize_index;/优化索引:begin ctx_ddl.optimize_index(idx_ctx_fainfo, FULL);end;/ 可以单独优化经常被检索或经常被更新的标记,这样可以提高查询这个token的查询效率:BEGIN ctx_ddl.optimize_index (idx_ctx_fainfo, token, token = 计算机);END;/ 可以采用FAST MODE方法使碎片行紧凑,但是旧的数据并不从索引中删除:BEGIN ctx_ddl.optimize_index (idx_ctx_fainfo, fast);END;/ 5、查看索引错误:SELECT err_timestamp, err_text FROM ctx_user_index_errorsORDER BY err_timestamp DESC;/ 清除错误视图:DELETE FROM ctx_user_index_errors;/6、设置job,定时更新和同步优化全文检索-synccreate or replace procedure synchronize_indexisbegin ctx_ddl.sync_index(idx_ctx_fainfo);end synchronize_index;/-optimizecreate or replace procedure optimize_indexisbegin ctx_ddl.optimize_index(idx_ctx_fainfo, FULL);end optimize_index; /declare job number;begin dbms_job.submit(job=job, what=synchronize_index;, next_date=sysdate, interval=SYSDATE+1/48); commit; dbms_output.put_line(job |job| has been submitted.); dbms_job.submit(job=job, what=optimize_index;, next_date=sysdate, interval=SYSDATE+1/12); commit; dbms_output.put_line(job |job| has been submitted.);end;7、索引查询:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑) 0;/Or | 如果任一搜索项得分超过阀值:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 or 计算机) 0;/And & 如果多个搜索项得分都超过阀值:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 and 计算机) 0;/Accum , 如果多个搜索项得分累加超过阀值:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 accum 计算机) 9;/Minus - 如果第一个搜索项得分减去第二个搜索项得分的值超过阀值:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 minus 计算机) 6;/* 给搜索得分指定不同的权重:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 * 3 - 计算机 * 2) 12;/Near ; 得分值依赖于搜索项在被搜索文本中互相接近的程度:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 near 计算机) 12;/Equiv = 在搜索记分中将两个词同等对待:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 电脑 equiv 计算机) 12;/ 保留字是搜索项的组成部分的时候:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, as) 0;/_ 多字通配符:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, P4%) 0;/_ 单字通配符:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, 计算机_) 0;/$ 词根扩展搜索的符号:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, $计算机) 0;/? 模糊匹配搜索的符号:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, ?计算机) 0;/! 发音相近搜索的符号:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, !计算机) 0;/About 根据查询的概念返回文档,不是仅依据指定的精确单词或短语:SELECT noid,facode,fanameFROM fainfoWHERE contains (faname, about(计算机) 0;b、对于本地文件的检索:Oracle支持对文本文件Txt,Word文档,Excel表格,PowerPoint,PDF的文本检索。1、建立存储选项参数:BEGINctx_ddl.create_preference (file_pref, FILE_DATASTORE);ctx_ddl.set_attribute (file_pref, path, c:TEMP);END;/删除自定义存储选项参数:BEG

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论