信息检索(半开卷)_第1页
信息检索(半开卷)_第2页
信息检索(半开卷)_第3页
信息检索(半开卷)_第4页
信息检索(半开卷)_第5页
已阅读5页,还剩3页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

信息检索(半开卷)整理内容第一章Information Retrieval, IR信息检索,query 查询,data retrieval数据检索,literature文献,information 信息,Knowledge知识,data数据,相关性(relevance),推送(Push)超空间(hyperspace),用户任务(User task),文献逻辑表示(视图)(logical view of the document),检索任务(retrieval task),浏览(browsing),检索(retrieval ),拉出(pulling)过滤(filtering),全文本(full text),停用词(stop word),词干提取(stemming),文本操作(text operation),标引词(indexing term) 索引(index),信息检索策略(retrieval strategy)光学字符识别(Optical Character Recognition, OCR),扫描(scanning),用户需求(user need)跨语言(cross-language),倒排文档(inverted file) 查询操作(query operation),相关度(likelihood)检出文献(retrieved document)用户反馈(user feedback),信息检索的人机交互界面(human-computer interaction, HCI),文本图像(textual images)检索模型与评价(Retrieval Model & Evaluation)界面与可视化(Interface & Visualization)多媒体建模与检索(Multimedia Modeling & Searching),书目系统(bibliographic system)数字图书馆(Digital Library),建模(modeling)检索评价(retrieval evaluation),查询语言(query language)标准通用标记语言(Standard Generalized Markup Language, SGML),文本语言(text language)标引和检索(indexing and searching)并行和分布式信息检索(parallel and distribution IR),用户界面(user interface)导航(Navigation),可视化(visualization)模型与查询语言(model and query language) 有效标引与检索(efficient indexing and searching)1.定义Information retrieval (IR) deals with the representation, storage, organization of, and accessto information items.Focus is on the user information need 2.What is the primary goal of an IR system?the primary goal of an IR system is to retrieve all the documents which are relevant to a user query while retrieving as few non-relevant documents 1.What is the definition of retrieval performance evaluation?information retrieval systems require the evaluation of how precise is the answer set. This type of evaluation is referred to as retrieval performance evaluation.2. what is retrieval performance evaluation usually based on?It is usually based on a test reference collection测试参考集 and on an evaluation measure评价测度3. What are the most used retrieval evaluation measures?Two most used retrieval evaluation measures are recall and precision.查全率与查准率4. The test reference collection consists of ?The test reference collection consists of a collection of documents文献集, a set of example information requests信息查询实例, and a set of relevant documents (provided by specialists) for each example information request每个信息查询实例的一组相关文献。5.Please calculate the R(Recall Ratio), P( Precision Ratio), O(Omission Ratio),M(Miss Ratio)R=检中的相关信息量/系统中的相关信息总量*100%O=(1-R)*100%P=检中的相关信息量/检索出的信息总量*100%M=(1-P)*100%1.The most common measures of system performance are time and space, especially in the system for data retrieval.2.the retrieval task could also comprise a combination of these two strategies. Batch and interactive 批处理和交互查询 3.Recall is the fraction of the relevant documents (the set R) which has been retrieved i.e. Recall=|Ra|/|R|Precision is the fraction of the retrieved documents (the set A) which is relevant i.e., Precision=|Ra|/|A|4.Alternative Measures可供选择的测度方法:(1) The Harmonic Mean 调和平均值F(j)=2/(1/r(j)+1/p(j) ;(2)The E Measure E指标E(j) = 1- (1 +b2) / b2/r(j)+1/p(j);information need. 3.many models (such as the vector model) are completely structured on the concept of words,and words are the only type of queries allowed4.Phrase is a sequence of single-word queries.A more relaxed version of the phrase query is the proximity query. 5.A Boolean query has a syntax composed of atoms (i.e., basic queries) that retrieve documents, and of Boolean operators which work on their operands(操作数) (which are sets of documents) and deliver sets of documents.6.布尔逻辑的优先次序:NOTANDORBUT7.a fuzzy Boolean(模糊布尔查询) set of operators has been proposed. The idea is that the meaning of AND and OR can be relaxed.8.加权的优点:The algorithms for this model are totally different from those based on searching patterns (it is even possible that not every text word needs to be searched but only a small set of hopefully representative keywords extracted from each document并不需要检索每一个文本单词,而是检索从每篇文献中提取的最有可能代表文献的关键词集合)9.The most used types of patterns are: Words , Prefixes, Suffixes , Substrings , Ranges, Allowing errors , Regular expressions(正则表达式).10.Regular expressions:union并集(e1e2),concatenation交集(e1 e2),repetition重复(e)11.Fixed Structure固定式结构:优点This model is reasonable(合理的) when the text collection has a fixed structure.缺点the model is inadequate to represent the hierarchical structure which is represented with other models following.12.A hypertext is a directed graph 有向图where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes结点表示文本,有向边表示结点之间或者结点中位置之间的联系. 13.An intermediate structuring model which lies between fixed structure and hypertext is the hierarchical structure层次结构是介于固定结构和超文本结构之间的一种中间结构模型. It represents a recursive decomposition(递归分解) Dublin Core Metadata Element Set都柏林核心元数据集MARC (Machine Readable Cataloging Record )机读目录记录RDF (Resource Document Framework)资源描述框架XML (eXtensible Markup Language )可扩展标记语言HTML(HyperText Markup Language)超文本标记语言, (Entropy)熵1.what is a document, and give some examples.Definition: a single unit of information(一个单一的信息单元 ).A document is loosely defined定义宽泛,it may be. a complete logical unit(research paper, book, manual);a part of a larger text(paragraph, passage, an entry (条目)in a dictionary, );a physical unit(file, email, Web page)2.what characteristics does a document have?First, a document has a given syntax and structure. 语法和结构Which is usually dictated by the application or by the person who created it. 通常视应用而定,或由生产文献的人指定 Second, it also has a semantics, specified(说明的)by the author of the document.Third, a document may have a presentation style associated with it.与之相关联的样式 Which specifies how it should be displayed or printed. Such a style is usually given by the document syntax and structure and is related to a specific application (for example, a Web browser). 3.What is the definition of metadata? Metadata is information on the organization of the data, the various data domains, and the relationship between them In short ,metadata is data about the data.4.which type does the metadata include?the classification of metadata,the Dublin Core Metadata,Machine Readable Cataloging Record,Web Metadata,other use of Metadata5.please tell something about the application of metadata on the webMetadata information on Web documents文本.17.Each instance of SGML includes a description of the document structure called a document type definition.(DTD)文献类型定义. DTD的作用:The document type definition(DTD) is used to describe and name the pieces that a document is composed of and define how those pieces relate to each other.常用于描述和命名组成文献的部分,以及明确这些部分之间是如何彼此相互关联的。 18.Multimedia includes images, audio and video, as well as other binary data二进制数据。19.To improve compression ratios for higher resolutions, lossy compression有损压缩法 was developed. 20.formats for images:such as XBM, BMP, or PCX,GIF,JPEG(有损压缩),TIFF标签图像文件格式,TGA,PNG21.A particular class of images that is very important in office systems, multimedia retrieval, and digital libraries are images of documents that contain mainly typed or typeset text.含印刷体或打印体文本的图像These are called textual images 22.The Virtual Reality Modeling Language (VRML, ISO/IEC 14772-1)虚拟现实建模语言is a file format for describing interactive 3D objects and worlds第七章分隔符(separator),连字符(hyphen)排除表(list of stopwords)词干提取(stemming)词库(treasury of words)受控词汇表(controlled vocabulary)text compression文本压缩,noise (噪声)lexical analysis 词汇分析elimination of stopwords 排除停用词1.what is the text operations?various text transformation techniques which we call simply text operations. 2.what is the function of text compression?The gain obtained from compressing text is that it requires less storage space, 更少存储空间it takes less time to be transmitted over a communication collection to index its documents generates toomuch noise噪声 for the retrieval task.One way to reduce this noise is to reduce the set of words which can be used to refer to (i.e., to index) documents.减少用于标引文献的单词的数量 3.Document preprocessing five text operations: (1) Lexical analysis of the text文本的词汇分析 (2) Elimination of stopwords排除停用词 (3) Stemming词干提取 (4) Selection of index terms (5)Construction of term categorization structures 4.The main components of a thesaurus are its index terms, the relationships among the terms, and a layout design for these term relationships. 叙词表的主要组成部分是标引词、语词之间的关系以及其编排方式5.The set of terms related to a given thesaurus term is mostly composed of synonyms同义词 and near-synonyms近义词. In addition to these, relationships can be induced by patterns of co-occurrence within documents.6.叙词表的本质:a thesaurus is a classification scheme一种分类模式 composed of words and phrases whose organization aims at facilitating the expression of ideas in written text.方便表达书面文本的思想 7.Text normalization文本规范化 and the building of a thesaurus叙词表的建立 are strategies aimed at improving the precision查准率 of the documents retrieved. 8.indexing all the words in the text全文标引, The idea is that, despite a more noisy index, the retrieval task is simpler (it can be interpreted as a full text search) 尽管产生很多的噪标引,但检索任务更简单了and more intuitive to a common user.对普通用户更直观9.In a global clustering strategy, the documents are grouped accordingly to their occurrence in the whole collection.文献是根据他们在整个集合中的出现情况进行分组的; In a local clustering strategy, the grouping of documents is affected by the context defined by the current query and its local set of retrieved documents.文献的分组受当前查询及其检出的局viewing along with other kinds of surrogate information (such as document title and abstract).1.好处Pictures and graphics can be captivating(魅力的) and appealing(引起兴趣的、吸引人的), especially if well designed. A visual representation can communicate some kinds of information much more rapidly and effectively than any other method. 2.How evaluate a user interface?Precision/Recall? Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems 3.Search interfaces must provide users with good ways to get started.检索界面应该给用户提供较好的检索起始方式 In this section we will discuss four main types of starting points: lists, overviews, examples, and automated source selection.列表,概述,实例以及信息源的自动选择。 4.An overview can show the topic domains represented within the collections, to help users select or eliminate sources from consideration. An overview can help users get started, directing them into general neighborhoods, after which they can navigate using more detailed descriptions 5.Another way to help users get started is to start them off with an example of interaction with the system. This technique is also known as retrieval by reformulation. 6.Automated Source Selection三种思路:An ambitious approach(大胆的、充满挑战的方法) is to build a model of the source and of the information need of the user and try to determine which fit together best. A simpler alternative is to create a representation of the contents of information sources and match this representation against the query specification. The flip side(另一方面) to automatically selecting the best source for a query is to automatically send a query to multiple sources and then combine the results from the various systems in some way. Many metasearch engines exist on the Web. as possible.3.What is the difference between IR and DR?Data retrievalwhich docs contain a set of keywordsWell defined semantics(语义学)a single erroneous object implies failure!Information retrievalinformation about a subject or topic(主题)semantics is frequently loose(宽松的)small errors are tolerated4.the effective retrieval of relevant information is affected by ?user task ;the logical view of the documents5.What is the relation of the retrieval and browsing? Retrieval(检索):information or data;Purposeful Browsing(浏览):glancing around1.the notion of relevance is at the center of information retrieval.2.Practical Issues:security is not the only concern; Privacy;copyright and patent rights。第二章标引项或词(Index Term),权重(Weight),扁平式模型(flat),结构导向模型structure guided,超文本模型(hypertext),特殊检索(Ad hoc),过滤(Filtering),用户档(预设)(user profile)1.What is the three classic models in information retrieval?(布尔 向量 概率)are called Boolean, vector, and probabilistic. 2.What is a taxonomy of information retrieval models?Proximal Nodes Adhoc Filtering Browsing vect probabilistic Extended Boolean Belief Network Lat. Semantic Index Neural Networks Structure Guided Hypertext3.What is definition and characters of ad hoc and filtering?hoc retrieval :the documents in the collection remain relatively static while new queries are submitted to the system. filtering :the queries remain relatively static while (3) User-Oriented Measures 面向用户的测度方法;(4) Other Measure5.The coverage ratio is defined as the fraction of the documents known (to the user) to be relevant which has actually been retrieved i.e.,Coverage=|Rk|/|U|The novelty ratio is defined as the fraction of the relevant documents retrieved which was unknown to the user i.e.,Novelty=|Ru|/|Ru|+|Rk|6.four test reference collections,:TIPSTER/TREC文本检索会议, CACM, CISI, and Cystic Fibrosis.第四章查询(query),检索单元(retrieval unit)协议(protocol),分隔符(separator)布尔查询(Boolean query)布尔运算符(Boolean operator)查询语法树(query syntax tree)模糊布尔(fuzzy Boolean)模式匹配(pattern matching)SQL(Structured Query Language, 结构化查询语言)WAIS (广域信息服务系统Wide Area Information Service),relevance feedback 相关反馈CD-RDx Compact Disk Read only Data exchange (CD-RDx)(只读磁盘数据交换)Structured Full-text Query Language (SFQL) (结构化全文查询语言),extended patterns扩展模式searching allowing errors容错查询visual query languages可视化1.what is the type of query the user might formulate is largely dependent on ?the underlying潜在的 information retrieval model.2.What is The retrieval unit ?The retrieval unit 检索单元is the basic element which can be retrieved as an answer to a query (检索单元是响应查询而检出的结果集的基本要素。)(normally a set of such basic elements is retrieved, sometimes ranked by relevance or other criterion). The retrieval unit can be a file, a document, a Web page, a paragraph, or some other structural unit which contains an answer to the search query. We will simply call those retrieval units documents,简单地把这些检索单元称为文献。3.What is the advantage of keyword based query?Keyword-based queries are popular because they are intuitive,直观 easy to express易于表达, and of the text and is a natural model for many text collections. 第五章query reformulation 查询重构query expansion 查询扩展term reweighting 语词重新加权User Relevance Feedback用户相关反馈聚族、聚类(cluster)局部上下文分析(local context analysis)Automatic Global Analysis自动全局分析相似性叙词表(similarity thesaurus)1.What are the basic steps of query reformulation?Query reformulation involves two basic steps: expanding the original query with new terms and reweighting the terms in the expanded query. 2. What approaches does the query reformulation have?These approaches are grouped in three categories:(a) approaches based on feedback information from the user.基于用户反馈信息的方法(b) approaches based on information derived from the set of documents initially retrieved (called the local set of documents)基于最初检出文献集合(称为文献的局部集)信息的方法(c) approaches based on global information derived from the document collection:基于文献集合全局信息的方法.3. Which steps does the relevance feedback have?Marking relevant retrieved documents标出检出的相关文献Electing terms or changing weights of the words选择检索词或更新权值 4. What is the two basic techniques of relevance feedback ?query expansion: addition of new terms from relevant documents) and term reweighting: modification of term weights based on the user relevance judgement)5.What is Automatic Global Analysis?In the Automatic Global Analysis, all documents in the collection are used to determine a global thesaurus-like structure which defines term relationships.6.What is Automatic Local Analysis?cataloging, content rating, property rights, digitalSignatures,applications to electronic commerceNew standard: Resource Description Frameworkdescription of Web resources to facilitate automated processing of information6.please take some examples of the format of textFormats for document interchange (RTF)Formats for displaying (PDF, PostScript)Formats for encode email (MIME)7.please take some examples of the format of multimediaTagged Image File Format (TIFF标签图像文件格式) Joint Photographic Experts Group (JPEG) Portable Network Graphics (PNG新型位图图像格式) MPEG (Moving Pictures Expert Group)8.what is markup language ?Markup is defined as extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc1.text is the main form of communicating knowledge.2.The syntax of a document can express structure, presentation style, semantics, or even external actions. 3.当前使用的语言:The Standard Generalized Markup Language (SGML), which is covered later on in this chapter, tries to balance all the issues above. Metadata, markup, and semantic encoding represent different levels of formalization of the document contents.它利用元数据、标记和语义进行编码,规范地多层次地表示文献内容。4.The current trend is to use languages which provide information on the document structure文献结构, format格式, and semantics语义 while being readable by humans as well as computers. 5.common forms of metadata associated with text: include the author, the date of publication, the source of the publication, the document length (in pages, words, bytes, etc.), and the document genre类型 (book, article, memo, etc.).包括作者、出版日期、出版商、文献长度(如页数、字数、字节数)以及文献的类型(如书、文章、备忘录等) 6.the classification of metadata:Descriptive MetadataSemantic Metadata7.DC propo

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论