版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Lecture 2 Evaluation1Lecture 2 Evaluation in Document Retrieval 文档检索系统评价Reference: James Allan, University of Massachusetts AmherstLecture 2 Evaluation2ReviewWe had proposed a very simple model of document retrievalRecall that IR is much more than document retrieval!Statistical language model(统计语言模型
2、)Rank documents by P(Q|Di)Assume independence soEstimate P(qi|D) by tf(w,D) / len(D)Highly artificial examples suggested model is “okay”Is it really any good?How can we find out? (Do you think Google better than Baidu or vice versa? How can you prove that?)How can we know if changes make it better?L
3、ecture 2 Evaluation3Evaluation in document retrieval: outlineHistoryTypes of evaluationRelevance (相关性) and test collections (测试集)Effectiveness measures (有效性度量)Recall and precision (召回率与精度)E and FExpected search length (期望搜索长度)Significance tests (重要性测试)Other issues and problemsLecture 2 Evaluation4Hi
4、storyExperimental methodology prominent (突出) issue in IR since 1960sNot sufficient to develop formalisms or approachesMandatory to demonstrate effectiveness empiricallyEarly work compared manual vs. automatic indexingQuestion was whether automatic could approach manual “quality” (old version of “Yah
5、oo” vs. “Google”)Assumes that manual indexing was “correct” approachMethods evolved to compare overall system performanceBatch mode retrieval (批处理模式信息检索)Interactive information seeking (交互式信息查找)Lecture 2 Evaluation5TREC ConferenceLecture 2 Evaluation6TREC Conference (Contd)Established in 1992 to eva
6、luate large-scale IRRetrieving documents from a gigabyte collectionHas run continuously since thenTREC 2006 conference: Nov 14-17, at NIST (National Institute of Standards and Technology) in Gaithersburg, Md. USA Run by NISTs Information Access DivisionInitially sponsored by DARPA as part of Tipster
7、 programNow supported by many, including DARPA, ARDA, and NISTProbably most well known IR evaluation settingStarted with 25 participating organizations in 1992 evaluationIn 2007, there were about 87 groups all over the world.Proceedings available on-line ( :/)Lecture 2 Evaluation7TREC g
8、eneral formatTREC consists of IR research tracksAd-hoc retrieval, routing, cross-language, scanned documents, speech recognition, query, video, filtering, Spanish, question answering, novelty, Chinese, high precision, interactive, Web, database merging, NLP, Each track works on roughly the same mode
9、lNovember: track approved by TREC communityWinter: tracks members finalize format for trackSpring: researchers train system based on specificationSummer: researchers carry out formal evaluationUsually a “blind” evaluation: researchers do not know answerFall: NIST carries out evaluationNovember: Grou
10、p meeting (TREC) to find out:How well your site didHow others tackled the problemMany tracks are run by volunteers outside of NIST (e.g., Web)“Coopetition(竞争中的合作)” model of evaluationSuccessful approaches generally adopted in next cycleLecture 2 Evaluation8TREC: pros and consWidely recognized, premi
11、er annual IR evaluationWhat is goodBrings together a wide range of active researchersHuge distributed resources applied to common taskSubstantial gains on tasks rapidlyValuable evaluation corpora (语料库) usually available after track completesWhat is less goodAnnual evaluation can divert resources fro
12、m researchEvaluations often require significant engineering effortSome tracks moving to bi-annually evaluation as a resultRecently, an explosion of tracksMeans less energy applied to individual tasksTREC program committee keeps a tight rein on number of tracksOn balance?Depends on your prejudicesLec
13、ture 2 Evaluation9Types of EvaluationIR system often component of larger systemMight evaluate several aspectsAssistance in formulating queries (帮助设定检索请求)Speed of retrievalResources requiredPresentation of documents (文档显示)Ability to find relevant documentsAppealing to users (market evaluation) (吸引用户)
14、Evaluation generally comparativeSystem A vs. B or A vs A (improved version of A)Cost-benefit analysis possibleMost common evaluation: retrieval effectivenessLecture 2 Evaluation10Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall an
15、d precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation11RelevanceRelevance is difficult to define satisfactorilyA relevant document is one judged useful in the context of a queryWho judges? What is useful?Humans not very consistentJudgments depend on
16、more than document and queryWith real collections, never know full set of relevant documentsRetrieval model incorporates notion of relevanceSatisfiability of a FOL (First Order Logic) expressionP( relevance | query, document)Similarity to queryWhat does language model use?Input “深圳社会保险”, get 2 resul
17、t sets Which on is more related with the query?Lecture 2 Evaluation12Test CollectionsCompare retrieval performance using a test collectionset of documentsset of queriesset of relevance judgments (which docs relevant to each query)To compare the performance of two techniques:each technique used to ev
18、aluate test queriesresults (set or ranked list) compared using some performance measuremost common measures - precision and recallUsually use multiple measures to get different views of performanceUsually test with multiple collections - performance is collection dependentLecture 2 Evaluation13Sampl
19、e Test CollectionsTREC includes five disks, so has numerous subsetsThe TDT (Topic Detection and Tracking ) corpora are also well-known ( for 2004 Evaluation)In English, Arabic, and ChineseBoth text, television audio, and radio audioLecture 2 Evaluation14Finding Relevant DocumentsQuestion: did system
20、 find all relevant material?To answer accurately, collection needs complete judgmentsi.e., “yes,” “no,” or some score for every query-document pairFor small test collections, can review all documents for all queriesDone for TDT collection of 60K documents as recently as 1998Not practical for large o
21、r medium-sized collectionsTREC collections have millions of documents(Even TDT had to abandon complete judging)Other approaches that can be usedPoolingSamplingSearch-basedLecture 2 Evaluation15Finding relevant documents (2)Search-basedRather than read every document, use manually-guided searchRead r
22、etrieved documents until convinced all relevance foundPoolingRetrieve documents using several (usually automatic) techniquesJudge top n documents for each techniqueRelevant set is unionSubset of true relevant setSamplingPossible to estimate size of true relevant set by samplingAll are incomplete, so
23、 when testing:How should unjudged documents be treated?How might this affect results?Lecture 2 Evaluation16Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthCase study: known item searchSig
24、nificance testsOther issues and problemsLecture 2 Evaluation17Precision and RecallPrecisionProportion of a retrieved set that is relevantPrecision = |relevant retrieved| |retrieved| = P( relevant | retrieved )Recallproportion of all relevant documents in the collection included in the retrieved setR
25、ecall = |relevant retrieved| |relevant| = P( retrieved | relevant )Precision and recall are well-defined for setsFor ranked retrieval, how to compute P/R values?Compute a P/R point for each relevant documentCompute value at fixed recall points (e.g., precision at 20% recall)Compute value at fixed ra
26、nk cutoffs (e.g., precision at rank 20)Lecture 2 Evaluation18Another common representationRelevant = A+CRetrieved = A+BCollection size = A+B+C+DPrecision = A (A+B)Recall = A (A+C)Miss = C (A+C)False alarm (fallout) = B (B+D)Lecture 2 Evaluation19Precision and Recall exampleComputing the precision an
27、d recall based on rankingLecture 2 Evaluation20Average precision of a queryOften want a single-number effectiveness measureE.g., for a machine-learning algorithm to detect improvementAverage precision is widely used in IRCalculate by averaging precision when recall increasesLecture 2 Evaluation21Pre
28、cision and Recall example 2Lecture 2 Evaluation22Recall/precision graphsAverage precision hides informationSometimes better to show tradeoff in table or graphLecture 2 Evaluation23Averaging across queriesIts very hard to compare P/R graphs or tables for individual queries (too much data)Need to aver
29、age over many queriesTwo main types of averagingMicro-average - each relevant document is a point in the averageMacro-average - each query is a point in the average (Most Common)What does each tell someone evaluating a system?Why use one over the other?Also done with average precision valueAverage o
30、f many queries average precision valuesCalled mean average precision (MAP)“Average average precision” sounds weirdLecture 2 Evaluation24Averaging graphs: a false startHow can graphs be averaged?Different queries have different meaningful recall valuesRecall/precision graph also has odd saw-shape (锯齿
31、状) if done directlySample graphs (In example 2)What is precision at 25% recall?Need to interpolateBut how?Lecture 2 Evaluation25Interpolation of graphsNeed recall/precision pairs between observed pairsFor averaging (as seen)Also, to better understand tradeoff between measuresThis plot showsactual R/
32、P dataJust have points No observed dataanywhere elseHow should weestimate valuesbetween points?Varying recall, mostlyLecture 2 Evaluation26Possible interpolation approachesNo interpolationNot very usefulConnect the dotsConnect maxConnect minConnect averageHow to deal with0% recall?Assume 0?Assume be
33、st?Constant start?Lecture 2 Evaluation27How to choose?It is an empirical fact that on average as recall increases, precision decreasesVerified time and time againOn averageSeems reasonable to aim for an interpolation that makes function monotonically decreasing (单调递增)One approach:where S is the set
34、of observed (R,P) pointsResults in a step functionLecture 2 Evaluation28Our example, interpolated this wayMonotonically dropsAverage will also fall monotonicallyHandles 0% recall smoothlyLecture 2 Evaluation29Our example (Contd)Given the data by Ranking #1Whats the precision at 0.25 recall?Select th
35、e maximum valueLecture 2 Evaluation30Averaging graphs: using interpolationHow can graphs be averaged?Different queries have different meaningful recall valuesRecall/precision graph also has odd saw-shape ifdone directlySample graphs (example 2)What is precisionat 25% recall?InterpolatevaluesaverageL
36、ecture 2 Evaluation31Interpolation and averagingvan Rijsbergen, p. 118 (1979)Lecture 2 Evaluation32Interpolated average precisionAverage precision at standard recall pointsFor a given query, compute P/R point for every relevant doc.Interpolate precision at standard recall levels11-pt is usually 100%
37、, 90, 80, , 10, 0% (yes, 0% recall)3-pt is usually 75%, 50%, 25%Average over all queries to get average precision at each recall levelAverage interpolated recall levels to get single resultCalled “interpolated average precision”Not used much anymore; “mean average precision” more commonValues at spe
38、cific interpolated points still commonly usedLecture 2 Evaluation33Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation34More Si
39、ngle-Valued MeasuresE measure (van Rijsbergen)Used to emphasize precision (or recall)essentially a weighted average of precision and recalllarge increases importance of precisionCan transform by = 1/(2 +1), = P/RWhen = 1 ( = ) equal importance of precision and recallNormalized symmetric difference o
40、f retrieved and relevant setsLecture 2 Evaluation35Symmetric Difference and EA is the retrieved set of documentsB is the relevant set of documentsP = |AB| |A|R = |AB| |B|A B (the symmetric difference) is the shaded area|A B|= |AB| - |AB|= |A| + |B| - 2|AB|E =1 - (2PR (P+R)= (P+R-2PR) (P+R)= = |A B|
41、(|A| + |B|)Lecture 2 Evaluation36F measureF = 1- E often usedGood results mean larger values of F“F1” measure is popular: F with =1Particularly popular with classification researchersLecture 2 Evaluation37F measure as an averageHarmonic mean(调和平均) of P and RInverse of average of their inversesHeavil
42、y penalizes low values of P or RCompared to standard averageLecture 2 Evaluation38F measure, geometric interpretationA is the retrieved set of documentsB is the relevant set of documentsP = |AB| |A|R = |AB| |B|Lecture 2 Evaluation39Evaluation in document retrieval: outlineTypes of evaluationRelevanc
43、e and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation40Other Single-Valued MeasuresExpected search lengthNormalized precision or recallmeasures area between actual and ideal curvesBreakeven pointpo
44、int at which precision = recallPopular in classification tasks, though not clear what it meansMany others.Lecture 2 Evaluation41Expected Search LengthEvaluation is based on type of information need:1. only one relevant document required2. some arbitrary number n3. all relevant documents4. a given pr
45、oportion of relevant documents.Two types of orderingSimple ordering: never have two or moredocuments at the same level of the orderingWeak orderingSearch length in a simple ordering is the number ofnon-relevant documents a user must scan before theinformation need is satisfiedSearch strategy output
46、assumed to be weak orderingExpected search length appropriate for weak orderingLecture 2 Evaluation42Expected Search LengthFor weak orderingFor simple ordering?Lecture 2 Evaluation43Expected Search LengthESL(q) = Pnonrel + FnonrelFneeded / (Frel+1)q is the queryPnonrel is the number of documents non
47、-relevant to q in all levels preceding the finalFrel is number of relevant documents in final levelFnonrel is number of non-relevant documents in final levelFneeded is the number of relevant documents required from the final level to satisfy the needUse mean expected search length for a set of queri
48、esThe measure is criticized for ignoring recallLecture 2 Evaluation44Evaluation in document retrieval: outlineTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation45Homew
49、ork 1“Manually evaluation of popular SEs”Requirements:These four SEs must be covered in this evaluation, i.e. Google, Baidu, Yahoo and MSNEach student should design at least 2 queriesEach query contains 2-4 words or 1 or more multi-word phases, and should make sure that more than 100 items will be r
50、eturned by each SE. Review top 50 items returned by each search engine, and gives a relevant degree that varies from “D” to “A” for each item. Here “D” not relevant“C”a little relevant“B” almost relevant“A” relevantBuild a relevant document set by the union of all SEs returned items that are marked
51、as “B” or above.Compute the recall and precision values of each SE for each query, draw the recall-precision lines as shown in slides 22.Compare the performance of the SEs based on any average approach or any single measure you are interested in and write the comparison report.The original query and
52、 ranking results must be stored in files as the format given in the attached file named “evaluate.xml”.Lecture 2 Evaluation46BackupLecture 2 Evaluation47Why significance tests?System A beats System B on one queryIs it just a lucky query for System A?Maybe System B does better on some other queryNeed
53、 as many queries as possibleEmpirical research suggests 25 is minimum neededTREC tracks generally aim for at least 50 queriesSystem A and B identical on all but one queryIf System A beats System B by enough on that one query, average will make A look better than BAs above, could just be a lucky brea
54、k for System ANeed A to beat B frequently to believe it is really betterE.g. system A is only 0.00001% better than System BEven if its true on every query, does it mean much?Significance tests consider those issuesLecture 2 Evaluation48Significance TestsAre observed differences statistically differe
55、nt?Generally cant make assumptions about underlying distributionMost significance tests do make such assumptionsSingle-valued measures are easier to use, but R/P is possibleSign test or Wilcoxon signed-ranks test are typicalDo not require that data be normally distributedSign test answers how oftenW
56、ilcoxon answers how muchSign test is crudest but most convincingAre observed differences detectable by users?Lecture 2 Evaluation49Sign Test ExampleFor techniques A and B, compare average precision for each pair of results generated by queries in test collectionIf difference is large enough, count a
57、s + or -, otherwise ignoreUse number of +s and the number of significant differences to determine significance levelFor example, for 40 queriesTechnique A produced a better result than B 12 timesB was better than A 3 timesAnd 25 were “the same”p B 18 times and BA 9 timesp 0.122 and A is not signific
58、antly better than B at the 5% levelWhere n+ is the times that A performances better than B, n- is the times that B performances better than A,the value p should be queried from the test table. (See attached file “x2检验.mht” for more info.) Lecture 2 Evaluation50Evaluation in document retrieval: outli
59、neTypes of evaluationRelevance and test collectionsEffectiveness measuresRecall and precisionE and FExpected search lengthSignificance testsOther issues and problemsLecture 2 Evaluation51Feedback EvaluationRelevance feedback covered laterTwo-pass approachesCreate better query out of results from ori
60、ginal queryHow to treat documents that have been seen before?Rank freezingRanks of relevant documents fixed for subsequent iterationsCompare ranking with original rankingPerformance cant get worseResidual collectionAll previously seen documents (e.g. top n) removed from collectionCompare reranking w
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026浙江嘉兴市嘉善县江南幼儿园食堂从业人员招聘1人备考题库附答案详解ab卷
- 2026浙江理工大学招聘博士研究生7人备考题库含答案详解(完整版)
- 2026神农集团服务部经理专项招聘备考题库含答案详解(模拟题)
- 2026甘肃兰州科技职业学院春季招聘27人备考题库带答案详解
- 2026江西赣州市招聘章贡区商会工作人员1人备考题库及一套完整答案详解
- 2026河南安阳学院(原阳校区)行政人员招聘1人备考题库含答案详解(轻巧夺冠)
- 2026河南南阳市书院高中教师招聘4人备考题库含答案详解(培优b卷)
- 2026江西省肿瘤医院高层次人才招聘29人备考题库(13)带答案详解(b卷)
- 2026江西赣州市第三人民医院事业单位统一招聘17人备考题库带答案详解(综合卷)
- 2026江苏泰州市靖江市孤山片区农业综合服务中心退休高级专业技术人员招聘2人备考题库含答案详解(培优)
- 企业英文培训课件
- 土方回填安全文明施工管理措施方案
- 危废处置项目竣工验收规范
- 北京市东城区2025-2026学年高三上学期期末考试地理试卷
- 中国昭通中药材国际中心项目可行性研究报告
- 幽门螺杆菌对甲硝唑耐药的分子机制
- 2025年安徽历年单招试题及答案
- 专家咨询委员会建立方案
- 2025高考新高考II卷英语口语真题试卷+解析及答案
- 孤残儿童护理员中级
- 国家中医药管理局《中医药事业发展“十五五”规划》全文
评论
0/150
提交评论