计算机理论论文全文检索系统LUCENE的分析与扩展.doc_第1页
计算机理论论文全文检索系统LUCENE的分析与扩展.doc_第2页
计算机理论论文全文检索系统LUCENE的分析与扩展.doc_第3页
全文预览已结束

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

全文检索系统lucene的分析与扩展 全文检索系统lucene的分析与扩展 lucene full-text search system analysis and expansion 【摘要】 全文检索技术是一个最普遍的信息查询应用,人们每天在网上使用google、百度等搜索引擎查找自己所需信息,这些搜索引擎的核心技术之一就是全文检索。lucene是apache软件基金会jakarta项目组的成员项目,是一个开放源代码的全文检索引擎工具包,方便在目标系统中添加全文检索的功能,或者是以此为基础建立起完善的全文检索系统。lucene中只具备英文和德文两种西方语言的检索功能,并不具备中文检索功能,因此如果开发一个基于lucene的全文检索系统,一个中文检索模块必不可少。为了能够更加准确的实现分词同时避免歧义,本文对现在比较流行的基于统计的分词方法进行了改进,以词典训练的方式解决了一部分歧义词的处理以及未登录词汇的切分。本文的算法是建立在一个自定义的词典基础上的,本文中的词典并不是传统意义上的机械分词中的词典,在一篇文章中,两个字按照一定的次序出现的次数越多,那这两个字就更可能是一个词,所以我们定义了这样一个统计词典:它建立在对大规模的语料进行统计和分析的基础上,它其中的词条并不是通常我们所说的词,而是两个相邻的字之间的“黏合度”,即“黏合度”越高,成词的概率就越高。lucene的内核被设计得非常小巧,它的处理对象仅限于纯文本格式数据。因此,本文建立了一个通用的接口,开发一个能够用来索引多种格式文档的统一处理框架,通过这个框架索引各种文档内容,添加到索引数据库中,从而为全文检索系统添加多种格式文档的统一处理能力【abstract】 in the initial time of the internet, the number of sites is small, the information seems easy to find. however, with the development of the internet, the amount of the sites increase in the number of inquiries, the searching of information gets more difficult. the search engine will be created to meet the needs of information retrieval.full-text search technology is one of the most widespread applications of information that people used every day. through the google, baidu and other search engines, people search the information they need, the technology of these search engines is one of the core technology of full-text search. full-text search in this article refers to a variety of electronic data, such as text, sound, images and other objects provided in accordance with the contents of the data rather than the outside to achieve the characteristics of the means of information retrieval. by creating a search condition contains a series of user queries; it can help people a great deal of documents collation and management, then, people are able to quickly and easily find the information they need.the full-text search software is more mature, and has been widely used abroad, but the text in the west is very different from the text in chinese, so full-text search software abroad is not applicable for chinese users. although there are some chinese full-text searchable databases, but they are based on the essence of the relationship through the database of structured data, such as title, author, keywords, abstracts, and then obtained the full text by the link. there are rarely the real achievements of the chinese full-text search engine.lucene is a project of the apache software foundation jakarta project team. it is a open source full-text search engine tool package, what means it is not a complete full-text search engine, but an open source structure written by java, which provides data access and management by simple interfaces. it can be easily embedded into applications to achieve a variety of applications for full-text search function. lucene software development aims to provide a simple, easy-to-use tool package to facilitate in the target system to add full-text search functions, or as the basis for the establishment of a comprehensive full-text search system.lucene only have the english and german language search function of western, chinese search function is not included. therefore, while develop a lucene-based full-text search system, a chinese search module is neccessery.the main elements of chinese words segmentation are: the question of segmentation is to determine the normative definition of the word which can be used as sub-word units; segmentation algorithm the problem is how to word segmentation in order to establish the actual meaning of the word boundary; segmentation ambiguity is taken to deal with the issue of what kind of methods to eliminate all differences between the justice; unknown word recognition problem is how to proceed with unknown word dictionary, such as: names of places, names of persons, and have been translated into the identification and so on. now, there are three main research areas in the chinese word segmentation: mechanical method, statistical methods and the method of understanding.mechanical method is not easy to ambiguity, its on the base of the “sufficiently large” dictionary.on the basis of the new things are emerging in modern society, accompanied by the emergence of new vocabulary, together with the continuous introduction of foreign language translation and transliteration of the word, achieving an updated dictionary is an expensive project. as a result of the the complexity and difficulty of the general knowledge of chinese, the understanding based of the sub-word-based system is still in the testing stage now.in order to achieve more accurate segmentation at the same time to avoid ambiguity, i use a popular sub-word based on statistical methods to improve dictionary training as part of the solution to the ambiguity of the word processing and segmentation of unknown words. in my article, the algorithm is built on a custom dictionary on the basis of this paper is not in the dictionary in the traditional sense of mechanical sub-word dictionary. in an article, the more frequent two chinese characters appear together, the more possible they make up a word, and we define the dictionary a dictionary that based on the“bonding”between two adjacent characters.in order to make it more convenient and seamless, lucene core has been designed very small. it is limited to the processing text format. with the development of computer applications and networks, text format is no longer a mainstream format, a variety of file formats are used in all walks of life, such as microsofts word, excel, power point format. what a full-text search engine deals with is the documents saved in different file formats.therefore, i establish a common interface that a variety of file formats can be indexed. through the interface, the documents of different file formats can be added to the index database. this system can be designed to avoid the differences of the documentsfile formats from the users. 【关键词】 全文检索引擎; lucene; 中文检索; 文档格式处理 【key words】 full-text search engine; lucene; chinese search; file format accessing 全文检索系统lucene的分析与扩展提要 4-7 第1章 绪论 7-10 1.1 研究背景 7-8 1.2 全文检索技术及其研究意义 8 1.3 全文检索技术的研究和应用现状 8-9 1.4 本文工作 9-10

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论