网络信息检索 文档检索系统评价_第1页
网络信息检索 文档检索系统评价_第2页
网络信息检索 文档检索系统评价_第3页
网络信息检索 文档检索系统评价_第4页
网络信息检索 文档检索系统评价_第5页
已阅读5页,还剩50页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Lecture 2 Evaluation1Lecture 2 Evaluation in Document Retrieval 文档检索系统评价Reference: James Allan, University of Massachusetts AmherstLecture 2 Evaluation2ReviewWe had proposed a very simple model of document retrievalRecall that IR is much more than document retrieval!Statistical language model(统计语言模型

2、)Rank documents by P(Q|Di)Assume independence soEstimate P(qi|D) by tf(w,D) / len(D)Highly artificial examples suggested model is “okay”Is it really any good?How can we find out? (Do you think Google better than Baidu or vice versa? How can you prove that?)How can we know if changes make it better?L

3、ecture 2 Evaluation3Evaluation in document retrieval: outlineHistoryTypes of evaluationRelevance (相关性) and test collections (测试集)Effectiveness measures (有效性度量)Recall and precision (召回率与精度)E and FExpected search length (期望搜索长度)Significance tests (重要性测试)Other issues and problemsLecture 2 Evaluation4Hi

4、storyExperimental methodology prominent (突出) issue in IR since 1960sNot sufficient to develop formalisms or approachesMandatory to demonstrate effectiveness empiricallyEarly work compared manual vs. automatic indexingQuestion was whether automatic could approach manual “quality” (old version of “Yah

5、oo” vs. “Google”)Assumes that manual indexing was “correct” approachMethods evolved to compare overall system performanceBatch mode retrieval (批处理模式信息检索)Interactive information seeking (交互式信息查找)Lecture 2 Evaluation5TREC ConferenceLecture 2 Evaluation6TREC Conference (Contd)Established in 1992 to eva

6、luate large-scale IRRetrieving documents from a gigabyte collectionHas run continuously since thenTREC 2006 conference: Nov 14-17, at NIST (National Institute of Standards and Technology) in Gaithersburg, Md. USA Run by NISTs Information Access DivisionInitially sponsored by DARPA as part of Tipster

7、 programNow supported by many, including DARPA, ARDA, and NISTProbably most well known IR evaluation settingStarted with 25 participating organizations in 1992 evaluationIn 2007, there were about 87 groups all over the world.Proceedings available on-line ( :/)Lecture 2 Evaluation7TREC g

8、eneral formatTREC consists of IR research tracksAd-hoc retrieval, routing, cross-language, scanned documents, speech recognition, query, video, filtering, Spanish, question answering, novelty, Chinese, high precision, interactive, Web, database merging, NLP, Each track works on roughly the same mode

9、lNovember: track approved by TREC communityWinter: tracks members finalize format for trackSpring: researchers train system based on specificationSummer: researchers carry out formal evaluationUsually a “blind” evaluation: researchers do not know answerFall: NIST carries out evaluationNovember: Grou

10、p meeting (TREC) to find out:How well your site didHow others tackled the problemMany tracks are run by volunteers outside of NIST (e.g., Web)“Coopetition(竞争中的合作)” model of evaluationSuccessful approaches generally adopted in next cycleLecture 2 Evaluation8TREC: pros and consWidely recognized, premi

11、er annual IR evaluationWhat is goodBrings together a wide range of active researchersHuge distributed resources applied to common taskSubstantial gains on tasks rapidlyValuable evaluation corpora (语料库) usually available after track completesWhat is less goodAnnual evaluation can divert resources fro

12、m researchEvaluations often require significant engineering effortSome tracks moving to bi-annually evaluation as a resultRecently, an explosion of tracksMeans less energy applied to individual tasksTREC program committee keeps a tight rein on number of tracksOn balance?Depends on your prejudicesLec

13、ture 2 Evaluation9Types of EvaluationIR system often component of larger systemMight evaluate several aspectsAssistance in formulating queries (帮助设定检索请求)Speed of retrievalResources requiredPresentation of documents (文档显示)Ability to find relevant documentsAppealing to users (market evaluation) (吸引用户)

14、Evaluation generally comparativeSystem A vs. B or A vs A (improved version of A)Cost-benefit analysis possibleMost common evaluation: retrieval effectivenessLecture 2 Evaluation10Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall an

15、d precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation11RelevanceRelevance is difficult to define satisfactorilyA relevant document is one judged useful in the context of a queryWho judges? What is useful?Humans not very consistentJudgments depend on

16、more than document and queryWith real collections, never know full set of relevant documentsRetrieval model incorporates notion of relevanceSatisfiability of a FOL (First Order Logic) expressionP( relevance | query, document)Similarity to queryWhat does language model use?Input “深圳社会保险”, get 2 resul

17、t sets Which on is more related with the query?Lecture 2 Evaluation12Test CollectionsCompare retrieval performance using a test collectionset of documentsset of queriesset of relevance judgments (which docs relevant to each query)To compare the performance of two techniques:each technique used to ev

18、aluate test queriesresults (set or ranked list) compared using some performance measuremost common measures - precision and recallUsually use multiple measures to get different views of performanceUsually test with multiple collections - performance is collection dependentLecture 2 Evaluation13Sampl

19、e Test CollectionsTREC includes five disks, so has numerous subsetsThe TDT (Topic Detection and Tracking ) corpora are also well-known ( for 2004 Evaluation)In English, Arabic, and ChineseBoth text, television audio, and radio audioLecture 2 Evaluation14Finding Relevant DocumentsQuestion: did system

20、 find all relevant material?To answer accurately, collection needs complete judgmentsi.e., “yes,” “no,” or some score for every query-document pairFor small test collections, can review all documents for all queriesDone for TDT collection of 60K documents as recently as 1998Not practical for large o

21、r medium-sized collectionsTREC collections have millions of documents(Even TDT had to abandon complete judging)Other approaches that can be usedPoolingSamplingSearch-basedLecture 2 Evaluation15Finding relevant documents (2)Search-basedRather than read every document, use manually-guided searchRead r

22、etrieved documents until convinced all relevance foundPoolingRetrieve documents using several (usually automatic) techniquesJudge top n documents for each techniqueRelevant set is unionSubset of true relevant setSamplingPossible to estimate size of true relevant set by samplingAll are incomplete, so

23、 when testing:How should unjudged documents be treated?How might this affect results?Lecture 2 Evaluation16Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthCase study: known item searchSig

24、nificance testsOther issues and problemsLecture 2 Evaluation17Precision and RecallPrecisionProportion of a retrieved set that is relevantPrecision = |relevant retrieved| |retrieved| = P( relevant | retrieved )Recallproportion of all relevant documents in the collection included in the retrieved setR

25、ecall = |relevant retrieved| |relevant| = P( retrieved | relevant )Precision and recall are well-defined for setsFor ranked retrieval, how to compute P/R values?Compute a P/R point for each relevant documentCompute value at fixed recall points (e.g., precision at 20% recall)Compute value at fixed ra

26、nk cutoffs (e.g., precision at rank 20)Lecture 2 Evaluation18Another common representationRelevant = A+CRetrieved = A+BCollection size = A+B+C+DPrecision = A (A+B)Recall = A (A+C)Miss = C (A+C)False alarm (fallout) = B (B+D)Lecture 2 Evaluation19Precision and Recall exampleComputing the precision an

27、d recall based on rankingLecture 2 Evaluation20Average precision of a queryOften want a single-number effectiveness measureE.g., for a machine-learning algorithm to detect improvementAverage precision is widely used in IRCalculate by averaging precision when recall increasesLecture 2 Evaluation21Pre

28、cision and Recall example 2Lecture 2 Evaluation22Recall/precision graphsAverage precision hides informationSometimes better to show tradeoff in table or graphLecture 2 Evaluation23Averaging across queriesIts very hard to compare P/R graphs or tables for individual queries (too much data)Need to aver

29、age over many queriesTwo main types of averagingMicro-average - each relevant document is a point in the averageMacro-average - each query is a point in the average (Most Common)What does each tell someone evaluating a system?Why use one over the other?Also done with average precision valueAverage o

30、f many queries average precision valuesCalled mean average precision (MAP)“Average average precision” sounds weirdLecture 2 Evaluation24Averaging graphs: a false startHow can graphs be averaged?Different queries have different meaningful recall valuesRecall/precision graph also has odd saw-shape (锯齿

31、状) if done directlySample graphs (In example 2)What is precision at 25% recall?Need to interpolateBut how?Lecture 2 Evaluation25Interpolation of graphsNeed recall/precision pairs between observed pairsFor averaging (as seen)Also, to better understand tradeoff between measuresThis plot showsactual R/

32、P dataJust have points No observed dataanywhere elseHow should weestimate valuesbetween points?Varying recall, mostlyLecture 2 Evaluation26Possible interpolation approachesNo interpolationNot very usefulConnect the dotsConnect maxConnect minConnect averageHow to deal with0% recall?Assume 0?Assume be

33、st?Constant start?Lecture 2 Evaluation27How to choose?It is an empirical fact that on average as recall increases, precision decreasesVerified time and time againOn averageSeems reasonable to aim for an interpolation that makes function monotonically decreasing (单调递增)One approach:where S is the set

34、of observed (R,P) pointsResults in a step functionLecture 2 Evaluation28Our example, interpolated this wayMonotonically dropsAverage will also fall monotonicallyHandles 0% recall smoothlyLecture 2 Evaluation29Our example (Contd)Given the data by Ranking #1Whats the precision at 0.25 recall?Select th

35、e maximum valueLecture 2 Evaluation30Averaging graphs: using interpolationHow can graphs be averaged?Different queries have different meaningful recall valuesRecall/precision graph also has odd saw-shape ifdone directlySample graphs (example 2)What is precisionat 25% recall?InterpolatevaluesaverageL

36、ecture 2 Evaluation31Interpolation and averagingvan Rijsbergen, p. 118 (1979)Lecture 2 Evaluation32Interpolated average precisionAverage precision at standard recall pointsFor a given query, compute P/R point for every relevant doc.Interpolate precision at standard recall levels11-pt is usually 100%

37、, 90, 80, , 10, 0% (yes, 0% recall)3-pt is usually 75%, 50%, 25%Average over all queries to get average precision at each recall levelAverage interpolated recall levels to get single resultCalled “interpolated average precision”Not used much anymore; “mean average precision” more commonValues at spe

38、cific interpolated points still commonly usedLecture 2 Evaluation33Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation34More Si

39、ngle-Valued MeasuresE measure (van Rijsbergen)Used to emphasize precision (or recall)essentially a weighted average of precision and recalllarge increases importance of precisionCan transform by = 1/(2 +1), = P/RWhen = 1 ( = ) equal importance of precision and recallNormalized symmetric difference o

40、f retrieved and relevant setsLecture 2 Evaluation35Symmetric Difference and EA is the retrieved set of documentsB is the relevant set of documentsP = |AB| |A|R = |AB| |B|A B (the symmetric difference) is the shaded area|A B|= |AB| - |AB|= |A| + |B| - 2|AB|E =1 - (2PR (P+R)= (P+R-2PR) (P+R)= = |A B|

41、(|A| + |B|)Lecture 2 Evaluation36F measureF = 1- E often usedGood results mean larger values of F“F1” measure is popular: F with =1Particularly popular with classification researchersLecture 2 Evaluation37F measure as an averageHarmonic mean(调和平均) of P and RInverse of average of their inversesHeavil

42、y penalizes low values of P or RCompared to standard averageLecture 2 Evaluation38F measure, geometric interpretationA is the retrieved set of documentsB is the relevant set of documentsP = |AB| |A|R = |AB| |B|Lecture 2 Evaluation39Evaluation in document retrieval: outlineTypes of evaluationRelevanc

43、e and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation40Other Single-Valued MeasuresExpected search lengthNormalized precision or recallmeasures area between actual and ideal curvesBreakeven pointpo

44、int at which precision = recallPopular in classification tasks, though not clear what it meansMany others.Lecture 2 Evaluation41Expected Search LengthEvaluation is based on type of information need:1. only one relevant document required2. some arbitrary number n3. all relevant documents4. a given pr

45、oportion of relevant documents.Two types of orderingSimple ordering: never have two or moredocuments at the same level of the orderingWeak orderingSearch length in a simple ordering is the number ofnon-relevant documents a user must scan before theinformation need is satisfiedSearch strategy output

46、assumed to be weak orderingExpected search length appropriate for weak orderingLecture 2 Evaluation42Expected Search LengthFor weak orderingFor simple ordering?Lecture 2 Evaluation43Expected Search LengthESL(q) = Pnonrel + FnonrelFneeded / (Frel+1)q is the queryPnonrel is the number of documents non

47、-relevant to q in all levels preceding the finalFrel is number of relevant documents in final levelFnonrel is number of non-relevant documents in final levelFneeded is the number of relevant documents required from the final level to satisfy the needUse mean expected search length for a set of queri

48、esThe measure is criticized for ignoring recallLecture 2 Evaluation44Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation45Homew

49、ork 1“Manually evaluation of popular SEs”Requirements:These four SEs must be covered in this evaluation, i.e. Google, Baidu, Yahoo and MSNEach student should design at least 2 queriesEach query contains 2-4 words or 1 or more multi-word phases, and should make sure that more than 100 items will be r

50、eturned by each SE. Review top 50 items returned by each search engine, and gives a relevant degree that varies from “D” to “A” for each item. Here “D” not relevant“C”a little relevant“B” almost relevant“A” relevantBuild a relevant document set by the union of all SEs returned items that are marked

51、as “B” or above.Compute the recall and precision values of each SE for each query, draw the recall-precision lines as shown in slides 22.Compare the performance of the SEs based on any average approach or any single measure you are interested in and write the comparison report.The original query and

52、 ranking results must be stored in files as the format given in the attached file named “evaluate.xml”.Lecture 2 Evaluation46BackupLecture 2 Evaluation47Why significance tests?System A beats System B on one queryIs it just a lucky query for System A?Maybe System B does better on some other queryNeed

53、 as many queries as possibleEmpirical research suggests 25 is minimum neededTREC tracks generally aim for at least 50 queriesSystem A and B identical on all but one queryIf System A beats System B by enough on that one query, average will make A look better than BAs above, could just be a lucky brea

54、k for System ANeed A to beat B frequently to believe it is really betterE.g. system A is only 0.00001% better than System BEven if its true on every query, does it mean much?Significance tests consider those issuesLecture 2 Evaluation48Significance TestsAre observed differences statistically differe

55、nt?Generally cant make assumptions about underlying distributionMost significance tests do make such assumptionsSingle-valued measures are easier to use, but R/P is possibleSign test or Wilcoxon signed-ranks test are typicalDo not require that data be normally distributedSign test answers how oftenW

56、ilcoxon answers how muchSign test is crudest but most convincingAre observed differences detectable by users?Lecture 2 Evaluation49Sign Test ExampleFor techniques A and B, compare average precision for each pair of results generated by queries in test collectionIf difference is large enough, count a

57、s + or -, otherwise ignoreUse number of +s and the number of significant differences to determine significance levelFor example, for 40 queriesTechnique A produced a better result than B 12 timesB was better than A 3 timesAnd 25 were “the same”p B 18 times and BA 9 timesp 0.122 and A is not signific

58、antly better than B at the 5% levelWhere n+ is the times that A performances better than B, n- is the times that B performances better than A,the value p should be queried from the test table. (See attached file “x2检验.mht” for more info.) Lecture 2 Evaluation50Evaluation in document retrieval: outli

59、neTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation51Feedback EvaluationRelevance feedback covered laterTwo-pass approachesCreate better query out of results from ori

60、ginal queryHow to treat documents that have been seen before?Rank freezingRanks of relevant documents fixed for subsequent iterationsCompare ranking with original rankingPerformance cant get worseResidual collectionAll previously seen documents (e.g. top n) removed from collectionCompare reranking w

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论