[硕士论文精品]q 开放域的自动问答系统的研究_第1页
[硕士论文精品]q 开放域的自动问答系统的研究_第2页
[硕士论文精品]q 开放域的自动问答系统的研究_第3页
[硕士论文精品]q 开放域的自动问答系统的研究_第4页
[硕士论文精品]q 开放域的自动问答系统的研究_第5页
已阅读5页,还剩61页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

MASTERTHESISOFCHONGQINGUNIVERSITY中文摘要摘要信息技术和互联网的发展使得互联网上的以数字形式呈现的信息越来越丰富,这使得人们开始使用这些信息。如今人们能够通过搜索引擎,比如,GOOGLEHTTPWWWGOOGLECORN和BAIDUHTTPWWWBAIDUCOM,方便、快捷地找到自己需要的各种信息。这类搜索引擎,也叫信息检索系统INFORMATIONRETRIEVALSYSTEM,IR,接收一个或者几个关键字作为输入,然后从一个大的文档库中比如,WORLDWIDEWEB,即万维网检索结果,最后返回一系列的指向文档的链接和摘要给用户。通常情况下,这类基于关键字的搜索引擎能够满足大部分用户的需要。然而,这类搜索引擎也存在一些不足之处首先,返回网页的质量很大程度上依赖于用户输入的关键字,这对于那些没有使用经验的用户来说是一个很大的挑战,因为他们很难用一两个关键宇来表述他们的信息需求;其次,搜索引擎返回的是一系列相关的文档,而不是精确的答案。通常用户需要从这众多的返回结果中查找他们需要的答案。为了克服这些不足之处,近年来,越来越多的研究机构和公司致力于新一代信息检索系统的研究。其中一个很重要的方向就是问答系统QUESTIONANSWERINGSYSTEM,QA。问答系统接收以自然语言的形式表述的问题作为输入例如哪一条河流是世界上最长的,而不是关键字。同样地,问答系统也是从一个大的文档集中比如,万维网或者一个本地的文档集查找答案。问答系统将找到的精确答案而不是一系列的相关文档返回给用户。问答系统的研究试图使系统能回答较宽范围的问题,包括事实、列举、定义、类型的问题。比起其他的信息检索系统,要涉及更加复杂的自然语言处理技术。方式、原因、假设、语义约束和跨语言比如文档检索系统,问答系统被认为需本文主要集中在开放域的问答系统的研究。准确地说,本文集中在开放域的事实型和定义型问答系统的研究。事实型问题的答案往往是命名实体,比如,数字、人名或者机构名等等。诸如“谁是科林鲍威尔”或者“什么是模具”的问题就属于定义类问题112。在事实型问答系统中,本文提出了种新的答案排序的方法在网上利用伪反馈来对候选答案作再排序。伪反馈也叫本地反馈或者盲反馈是一项通常被用来提高信息检索系统性能的技术。在本文中,为了选出最佳答案,首先,在网上为每一个问题和候选答案对构建伪文档。然后,通过计算构建的伪文档和问题之间的TFIDF相似度对候选答案进行再排序。在2003年的文本检索会议TEXTRETRIEVALCONFERENCE。TREC的标准数据集上,一系列的实验被用来评测本文提出HTTP/INFO3DOUCOM/网络营销MASTERTHESISOFCHONGQINGUNIVERSITY中文摘要的方法。实验结果显示,本文提出的方法行之有效,相比基准系统,本系统的精度提高了14。在定义型问答系统中,本文也提出了一种新的答案排序方法。在TREC2003和TREC2004的评测中排名靠前的系统中,基于中心词向量特征向量的统计排序法被广泛使用,这些中心词是从外部资源中抽取出来的。在这些系统中,基于独立假设,中心词向量中的各个词被看作是彼此相互独立的。为了放松该假设,本文提出了一种新的基于语言模型的答案排序算法,该算法可以顾及到中心词之间的依存关系,用以提高目前基于词袋模型BAGOFWORDSMODEL的算法。一系列的实验被用来评测各种不同的依存关系模型。在TREC2003数据集上,实验结果显示相比基于词袋模型和一元语言模型UNIGRAMLANGUAGEMODEL的系统,BITERM语言模型极大的提高了系统性能,FMEASURE5分别提高了149和125。关键词事实型问答系统,定义型问答系统,问题改写,伪反馈,语言模型ABSTRACTTHEDEVELOPMENTOFINFORMATIONTECHNOLOGYANDTHEINTEMETHASDRAMATICALLYINCREASEDTHEQUANTITYOFINFORMATIONAVAILABLEINDIGITALFORMTHISHASRESULTEDINAPROLIFERATIONOFUSESOFPERSONALINFORMATIONNOWADAYS,ONECANUSESEARCHENGINES,SUCHASGOOGLEHTTPWWWGOOGLETOMANDBAIDU0ATTPWWWBAIDUCOM,LOFINDUSEFULINFORMATIONFROMTHEWEBEASILYANDRAPIDLYTHESESEARCHENGINES,ALSOCALLEDINFORMATIONRETRIEVALIRSYSTEMS,TAKEONEORMOREKEYWORDSASINPUT,SEARCHRESULTSFROMALARGETEXTCORPUSSUCHASTHEWORLDWIDEWEBANDRETURNALISTOFSNIPPETSANDLINKSTORELEVANTDOCUMEMSASOUTPUTTOUSERSNORMALLY,MOSTOFUSERSCOULDBESATISFIEDWITHTHESEKEYWORDSBASEDSEARCHENGINESHOWEVER,THEREARESOMESHORTCOMINGSFIRST,THEQUALITYOFRETURNEDDOCUMENTSRELIESORTHEINPUTKEYWORDSTHEREEXISTSABIGCHALLENGEFORNEWUSERSBECAUSEITISDIFFICULTTODESCRIBETHEIRINFORMATIONNEEDSUSINGONEORMOREKEYWORDS;SECOND,SEARCHENGINESRETURNA1ISTOFRELEVANTDOCUMENTSBUTNOTTHEEXACTANSWERSOFTENONESTILLHASTOREADALARGEAMOTULTOFTEXTTOFINDTHEANSWERTOOVERCOMETHESESHORTCOMINGS,RECENTLY,MOREANDMORERESEARCHORGANIZATIONSANDCOMPANIESDOEFFORTSTOEXPLORENEWGENERATIONOFINFORMATIONRETRIEVALSYSTEMSONEOFTHEMOSTIMPORTANTDIRECTIONSISQUESTIONANSWERINGQASYSTEMSUCHQASYSTEMTAKESANATURALLANGUAGEQUESTIONEG“WHICHISTHELONGESTRIVERINTHEWORLD”ASINPUTINSTEADOFKEYWORDSALSOQASYSTEMSEARCHESANSWERFROMALARGETEXTCOLLECTIONSUCHASTHEWORLDWIDEWEBORALOCALCOLLECTIONTHENONEEXACTALLSWERTOINPUTQUESTIONISRETURNEDT0USER,INSTEADOFALISTOFRELEVANTDOCUMENTSQARESEARCHATTEMPTSTODEALWITHAWIDERANGEOFQUESTIONTYPESINCLUDINGFACTOID,LIST,DEFINITION,HOW,WHY,HYPOTHETICAL,SEMANTICALLYCONSTRAINED,ANDCROSSLINGUAIQUESTIONSQAISREGARDEDASREQUIRINGMORECOMPLEXNATURALLANGUAGEPROCESSINGNLPTECHNIQUESTHANOTHERTYPESOFINFORMATIONRETRIEVALSUCHASDOCUMENTRETRIEVALTHISPAPERFOCUSESONTHESTUDYOILOPENDOMAINQASPECIFICALLY,ITFOCUSESONANSWERINGFACTOIDANDDEFINITIONOPENDOMAINQUESTIONSANSWERSTOFACTOIDQUEAIONSARETYPICALLYNAMEDENTITIES,EG,ANUMBEAPERSONNAME,ORALLORGANIZATIONBANLE,ORTHELIKEQUESTIONS1IKE“MOISCOLINPOWELL”OR“WHATISMOLD”AREDEFINITIONALQUESTIONSLLFORTHEFACTOIDQASUBTASLTHISPAPERPROPOSESANOVELANSWERRERANKINGIILHTTP/INFO3DOUCOM/网络营销APPROACH,INWHICHCANDIDATEANSWERSARERERANKEDBYEXPLOITINGPSEUDORELEVANCEFEEDBACKONTHEWEBPSEUDORELEVANCEFEEDBACKALSOKNOWNASLOCALFEEDBACKORBLINDFEEDBACKISATECHNIQUECOIDTLONLYUSEDTOIMPROVEINFORMATIONRETRIEVALPERFORMANCEINTHISPAPER,FORSELECTINGTHEBESTANSWER,FIRSTLY,PSEUDODOCUMENTFOREACHPAIROFQUESTIONANDCANDIDATEANSWERISCONSTRUCTEDUSINGTHEWEBTHENCANDIDATEANSWERSTOQUESTIONARERERANKEDACCORDINGTOTFIDFBASEDSIMILARITYBETWEENTHEQUESTIONANDTHECORRESPONDINGPSEUDODOCUMENTOFEACHCANDIDATEANSWERASERIESOFEXPERIMENTSONSTANDARDTEXTRETRIEVALCONFERENCETREE2003COLLECTIONHAVEBEENCONDUCTEDTOEVALUATETHEPROPOSEDAPPROACHEXPERIMENTRESULTSSHOWTHATTHEPROPOSEDAPPROACHISPROMISINGANDIMPROVESTHEBASELINESYSTEMBY14OINACCURACYFORTHEDEFINITIONQASUBTASK,THISPAPERALSOPROPOSESANEWRERANKINGMETHODSTATISTICALRANKINGMETHODSBASEDONCENTROIDVECTORPROFILEEXTRACTEDFROMEXTERNALKNOWLEDGEHAVEBECOMEWIDELYADOPTEDINTHETOPDEFINITIONALQASYSTEMSINTREC2003AND2004INTHESEAPPROACHES,TERMSINTHECENTROIDVECTORARETREATEDASABAGOFWORDSBASEDONTHEINDEPENDENTASSUMPTIONTORELAXTHISASSUMPTION,THISPAPERPROPOSESANOVDLANGUAGEMODELBASEDANSWERRERANKINGMETHODTOIMPROVETHEEXISTINGBAGOFWORDSMODELAPPROACHBYCONSIDERINGTHEDEPENDENCEOFTHEWORDSINTHECEN台XIDVCTOREXPERIMENTSHAVEBEENCONDUCTEDTOEVALUATETHEDIFFERENTDEPENDENCEMODELSTHERESULTSONTHETREC2003TESTSETSHOWTHATTHEREMAKINGAPPROACHWI也BITERMLANGUAGEMODELSIGNIFICANTLYOUTPERFORMSTHEONEWITHTHEBAGOFWORDSMODELANDUNIGRAMLANGUAGEMODELBY149AND125RESPECTIVELYINFMEASURE5KEYWORDSFACTOIDQA,DEFINITIONQA,QUESTIONREFORMULATION,PSEUDORELEVANCEFEEDBACK,LANGUAGEMODELIV独创性声明本人声明所呈交的学位论文是本人在导师指导下进行的研究工作及取得的研究成果。据我所知,除了文中特别加以标注和致谢的地方外,论文中不包含其他人已经发表或撰写过的研究成果,也不包含为获得重庭态堂或其他教育机构的学位或证书而使用过的材料。与我一同工作的同志对本研究所做的任何贡献均己在论文中作了明确的说明并表示谢意。学位论文作者签名阵汶签字目期丑“年F阳名目学位论文版权使用授权书本学位论文作者完全了解重麽盔堂有关保留、使用学位论文的规定,有权保留并向国家有关部门或机构送交论文的复印件和磁盘,允许论文被查阅和借阅。本人授权重送盔堂可以将学位论文的全部或部分内容编入有关数据库进行检索,可以采用影印、缩印或扫描等复制手段保存、汇编学位论文。保密,在年解密后适用本授权书。本学位论文属于不保密V。请只在上述一个括号内打“”学位论文作者签名名L以签字日期删年FF月衫日签字日期州年F,月髟日渖陛HTTP/INFO3DOUCOM/网络营销1INTRODUCTIOIL11MOTIVATIONANDBAEL【GROUNDPARTOFTHISTHESISWASFINISHEDWHILEMCAUTHORWASVISITINGMICROSOTTRESEARCHASIAMSRADURINGMARCH2005一MARCH2006锄ACOMPONENTOFTHEPROJECTOFASKBIUCHATBOTLEDBYDRMINGZHOULTHEWORKOILDEFINITIONQAPROPOSEDINTHISPAPERHASBEENPUBLISHEDINPROEEEAINGSOFTHE21STINTERNATIONALCONFERENCEONCOMPUTATIONALLINGUISTICSAND44THANNUALMEETINGOFTHEACLCOLINGACL2006科12OVERVIEWOFOPENDOMAINQUESTIONANSWERINGTHEQUANTITYOFONLINEINFORMATIOLAAVAILABLEINCREASESDAYBYDAYATTHESAMETIME,LLSERSINFORMATIONNEE,DSINCREASELIKEWISESEARCHCNGJILE,LIKEGOOGLEANDBAIDU,SOCALLEDINFORMATIONRETRIEVALSYSTEMSCANHELPAUSEL“INHISSEARCHFORANANSWEREVERYONEWHOUSESTHEINTEMETMAYBEFAMILIARWITHTHESEKINDSOFTOOLSUSERINPUTSONEORIILOLTQUERYTERMSANDALISTOFSNIPPETSANDLINKSTORELEVANTDOCUMENTSISRETURNEDOFTENTHESESYSTEMSCAILSATIS匆USERSBUTTHEREAFESOMESHORTCOMINGSFIRST,THEQUALITYOFRETURNEDDOCUMENTSRETIESONTHEINPMKEYWORDSTHEREEXISTSABIGCHALLENGEFORNEWUSERSITISBE,CAUSETHATITISDIFFICULTTODESCRIBETHEIRINFORMATIONLLETDSUSINGONEORMOLEKEYWORDSFORTHESENEWUSERS;SECOND,SEARCHE139INTRETURNAFISTOFRELEVANTDOCUMENTSNOTTHEEXACTANSWEROFTENONESTILLHASTOREADALARGEAMOUNTOFTEXTTOFINDTHEANSWELOQASYSTEMSOFFERAN,ALTERNATIVESUCHSYSTEMSTAKEANATURALLANGUAGEQUESTIONEG“WHATISTHEPOPULATIONOFCHINARR“INSTEADOFKEYWORDS笛INPUTTHOSESYSTEMSANALYZETHEINPUTQUESTIONANDLOOKFORALLA13SWELINALARGETEXTCOLLECTIONTHENTHEEXACT卸雌哺惯ISRETARA咄IMTEADOFALISTOFDOCUMENTSTHEQUESTIOMWEALLTALKINGABOUTAREUSUALLYFACTOIDQUESTIONSALLSWCRTOTHESEQUESTIOMARETYPICALLYNAMEDENTITIES,EG,ANUMBER,APERSONSNAME,ORANORGANIZADONNAME,01“THELIKEFORILLSTANCL“MOISTHEPRESIDENTOFUSA”ASKSFORAPERSONSNAMEAND“HOWMANYRIVERSARCTHEREINCHINA“ASKSFORANUMBER,DTCQASYSTEMSORERCOLLLCTHESHORTCOMINGSOFU_ADITIONALIRSYSTEMSANDTHEYCOULDBETHENEXTGENERATIONSEARCHGINESONEOFTHEMOSTFAMOUSONLINEQASYSTEMSISSTARTCATTPSTARTCSAILMIT,EDUJWHICHHASBEENONLINEANDCONTINUOUSLYOPERATINGSINCE1THISWORKWFINISHEDWHILEFILEAUTHORWASVISITINGMICROSOFTRESEARELAASIAOVLSRADURINGMARCH2005MARCH2006笛ACOMPONENTOFTHEPROJECTOFASKBILLCHATBOTICDBY轨MINGZHOUDECEMBER,1993ITHASBEENDEVELOPEDBYBORISKATZANDHISASSOCIATESOFTHEINFOLABGROUPATTHEMITCOMPUTERSCIENCEANDARTIFICIALINTELLIGENCELABORATORYEARLYQUESTIONANSWERINGSYSTEMSV,芒他BUILTTOANSWERASPECIFIEDELOSEDDOMAINQUESTIONSFOREXAMPLE,THE0111EOFFIRSTANDBESTKNOWNSYSTEMSWASBASEBALLF4】ITANSWERSQUESTIONSABOUTDATES,LOCATIONS,TEAMSANDSC01“EOFBASEBALLGAMESMOREDETAILSABOUTTHISSYSTEMWILLBEINLRODUEEDINTHEFOLLOWINGECTIORLSNOWADAYSINTERESTISPARTICULARLYFOCUSED011OPENDOMAINQUESTIONANSWERING,INWHICHQUESTIONSARENOTRESTRICTEDTOASPECIFIEDDOMAINANY1110REINTHISPAPER,WEFOCUSONANSWERINGOPENDOMAINQUESTIONSBESIDESFACTOIDQUESTIONS,TODAYRESEARCHISALSOFOCUSED011ANSWERINGMORECOMPLEXTYPESOFQUESTIONS,LIKEDEFINITIONALQUESTIONSEG“WHOISBILLGATES,ANDLISTQUESTIONSEG“LIST20CITIESWHOSEPOPULATIONSARCMOLETHAN10MILLION”THOSEQUESTIONSARCHARDERTOANSWELTHANSIMPLEFACTOIDQUESTIONS,BCCAU,INFORMATIONHASTOBEEXTRACTEDFROMSEVERALDOCUMENTSANDCOMBINEDTOFORMONEA13SWELOINGENERAL,THEDIFFICULTYOFTHETASKNOTORDYDEPENDS011THETYPEOFQUESTION,BUTALSOITDEPENDS011THETEXTEOUEETIONMOSTLYASWEKNOW,HUMANLANGUAGEISCOMPLEXANDFLEXIBLETHESAMCINFORMATIONREQUESTCALLBEEXPRESSEDINDIFFERENTSENTENCEFORMSIEWORDSANDSTRUCTURESIFAQUESTIONSENTENCEANDANA13SWCRSCNTENE,ESHARETHESIMILARVOCABULARYANDSTRUCTUREITISEASYTOEXTRACTTHECONECTANSW豇BUTIFANANSWERISFORMULATEDINOTHERWORDSANDTHESLLUCTUREDOESNOTMATCHTHATOFAQUESTION,THENITISMUCHHARDERTOFMDTHEL丝WCR121GENERAIBAEL【GROUNDASMENTIONEDABOVE,AQUESTIONANSWERILLGSYSTEMTAKESANATURALLANGUAGEQUESTIONASINPUTANDLCTLLRMTHEEXACTANSWJ苣L“TOUSCLTHETASKOFQUESTIONANSWERINGCONSISTSOFSEVERALSTEPSFIRST,ANALYZETHEQUESTIONWHAIISTHEFOCUSOFTHEQUESTION,ANDWHATSHOULDBETHEARSWERTYPETHENTHESYSTEMWILLLOOKTHROUGHTHETEXTCOLLECTIONTOSEARCHFORAN锄冉吼THEREMIGHTBEMORETHANONECANDIDATEALLSWELSFOUNDINORDERTOGETTHEBESTANSWER,RANKINGPROCEDUREISPERFORMEDUSUALLYFINALLY,也ETOPRANKEDANSWERISRETURNEDTOUS盯ASTHEEXACTTLSWERIFITISADEFINITIONALQUESTIONORLISTQUESTION,REDUNDANCYDETECTIONANDA1LSWERFUSIONSHOULDBEDONEBEFOREITLFETLLMSTHEFIRSTQUESTIONANSWERINGSYSTEMWASBUILTONDATABASEITMEETLTHATTHEDATA,INWHICHTHEAIISWEI“MUSTBESEARCHEDFOR,WEFESTRUCTUREDINORDERTOFMDANSWE,RFROMTHESMACTUREDDATA,THENATURALLANGUAGEQUESTIONMUSTBETRANSFORMEDTOAQUERYIN2HTTP/INFO3DOUCOM/网络营销SOLLLEQUERYLANGUAGESUCHASYSTEMISNLO北LIKEANINTERFACETOASTRUCTUREDDATABASEINADDITION,THESEDATABASEBASEDSYSTEMSAREUSUALLYRESTRICTEDTODOSEDDOMAINSONEOFTHEFIRSTANDBESTKNOWNQUESTIONANSWERINGSYSTEMSWASBASEBALLNITANSWE鸺QUESTIONSABOUTDATES,LOCATIONS,TEAMSANDSCORESOFBASEBALLGAMESYOUNASKONLYSIMPLEQUESTIONS,WITHOUTCONNECTIVESSUCHAS“AND“01“OR,ANDWITHOUTSUPERLATIVESSUCH雒“MOST“ANDHIGHESTTHEDATAHAVEBEENSTOREDINADATABASEINTHEFORMOFLISTSOFATTRIBUTESWITHTHEIRVALUESTHEQUESTIONSWEREAUTOMATICALLYPUTINTHESAMEFORMATFIRST,TLLEYAREPARTIALPARSEDTODISTINGUISHTHEDIFFERENTPHRASESTHENUSINGADICTIONARY,CERTAINPHRASESAREMAPPEDTOATTRIBUTEVALUEPAIRSFOREXAMPLE,YANKEESISMAPRIECLTOREAMYANKEESWHENTHEQUESTIONN口UGTURCISESTABLISHED,THESYSTEMWILLSEARCHTHEDATABASEFOR姐甜LSWEL“THROUGHCOMPARINGTHESTRUCTUREOFQUESTIONWITHTHEANSWELRSTRUETURETHEONEBESTMATCHINGAGAINSTWITHTHEQUESTIONISSELECTED够FINALANSWELLUNARNISANOTHERWELLKNOWNSYSTEMBASED011SMLETUREDDATAIITWASDEVELOPEDATBOLTBERANEDANDNEWMAN03BNLTOMAKEITEASIERFORLUNARGEOLOGISTSTOOBTAIN,COMPAREANDEVALUATECHEMICALANALYSISDATAOFLUNARROCKANDS0NDURINGADEMONSNATIONONALUNARSCIENCECONVENTIONIN1971ITWASABLETOAN鲥VFFL“90OFTHEQUESTIOMTHATWEREPOSEDBY1UNARGEOLOGISTSWITHOUTPRECEDINGIMTRUCTIOMSYSTEMS,SUCH笛BASEBALL,LUNAWORKONLYFORVERYLIMITEDDOMAINSMOREOVER,ITAPPEARSVERYEUMBERSOMETO台ANSFERTHESESYSTEMSTOANTHERDOMAINSTEXTBASEDQUESTIONANSWE血GSYSTEMSDONOTAS11111ETHEDATAARESTRUCTUREDBOTH也EQUESTIONANDTHEDATAMUSTBEANALYZEDTOFINDAN螂WCRONEOFTHEFIRSTTEXTBASEDSYSTEMS吣THEORACLESYSTEM61THISSYSTEMCARRIEDOUTASYNTACTICANALYSIS啪ONTHEQUESTIONANDONTHETEXTTHESUBJECT,OBJ吼VERBANDPOSSIBLEPLACEANDTIMEDETEMLINADOMWEREIDENTIFIEDBUTTHEANALYSISWORKSONLYONSIMPLESHORTSENTENCESASSXLN笛ASENTENCEBECOM髓LNOL;ECOMPLEX,IFITCONTAINEDTWOOBJECTSFOREXAMPLE,ITGOESWRONGN如WASTHEMAINRES血ICTIONOFORACLETHESYSTEMFIRSTOFALLASSIGNSEACHWORDAPARTOFSPEEEH协矿OSTHEANALYSISOFTHETEXTCORPUSCANHAPPENOFFLINEBUTTHEANALYSISOFTHEQUESTIONMUSTBECARRIEDOUTWITLLINQLLEJRYTIMENEXTTHEQUESTIONISTRANSCRIBEDINTODECLARATIVEFORMFOREXAMPLE,QUESTION“WHATISTHEPOPULATIONOFCNNAISTRANSFORMEDTOTHEPOPDATIONOFCHINAISWHAT”BSEARCHINGAILSW盯STAGE,THESYSTEMCOMPARESTHE2AICALLEDGRAMMATICALTAGGING,ISTHEPROCSOFMARKINGUPTHEWORDSINATEXT站CORRESPONDINGTDAPA蹦CULARPARTOFSPCLLBASEDOLLBO曲船DEFINITION,越WELLITSCONTEXT3TRANSPOSEDQUESTIONWITLLAPOSSIBLEANSW嚣SENTENCELOULDNGFORIDENTICALWORDSANDTHESALDEWORDORDERFORABOVEQUESTION,THEANSWER13BILLIONWOULDBEFOUNDIFTHEREEXISTSSENTENT“THEPOPULATIONOFCHINAISL3BILLION“INTHECORPUSTHISAPPROACHHASAHIGHPRECISIONBUTTHEREISASHORTCOMINGASWELLKNOWN,PEOPLECANEXPRESSTHESAMEINFORMATIONWITHDIFFEREMWORDSANDSTRUCTURESWITLLTHISMETHOD,LITTLEDIFFERENCEBETWEENQUESTIONANDANSWERAATENCEWILLMAKEITFAILEDTOFINDTHEANSWERTHESENTENCEQHECHINASPOPULATIONIS13BILLIONWILLNEVERBESELECTEDASANSWERSENTENCEWITHTHEDEVELOPMENTOFTHEINTERNET,MOREANDMOREINFORMATIONCANBEFOUNDONTHEWEB111EWEBCANBEVIEWED髂AVERYLARGETEXTCORPUSTHATCONTAINSHUGEAMOUNTOFINFORMATIONNOWADAYSRESEARCHERSHAVEBECOMEINTERESTEDAGAININTHISTYPEOFSYSTEM,BEC组USEITOFFERSASIMPLEANDEFFECTIVEAPPROACHTOANSWERINGQUESTIONSWITHAVERYLARGETEXTCORPUSEG,THEWORLDWIDEWEBTHEBIGGERTHECORPUSIS,THEBIGGERTHECHANCETHATTHEREISANANSWERINTHESALLLEVOCABULARYASTHEQUESTIONISINTHATCASC,FNRLHERSYNTACTICORSEMANTICANALYSISISNOTNECESSARYPMTOSYNTHEXT“ISANOTHERTEXTBASEDSYSTEMITTRIESTOANS贸QUESTIOUSBYMEALSOFANENCYCLOPEDIAANINDEXISMADEFOREACHCONTENTWORDINTHETEXTWORDSWITHTHESAMESTEMALECOMBINEDTHUSGOVERN,GOVERNOR,GOVERNMENTETOAREALLREDUCEDTOGOVERMEACHINDEXWORDGETSALIST“心SVOLUME,ARTICLE,PARAGRAPHANDSENTENCENUMBERS,WHICHINDICATEWHEREINTHEENCYCLOPEDIATHEWORDOCCRRSGIVENAQUESTION,FIRSTLY,SEARCHALLQUESTIONCONTENTWORDSFROMTHEINDEXESTOSELECTRELEVANTTEXTSNIPPETSTHEN11SEADEPENDENCYSMACTUZEPARSERTOPARSETHEQUESTIONANDTHESELECTEDTEXTSNPPETSONLYTHECANDIDATEANSWERSOFWHICHTHESMLCTUREOFTHEDEPENDENCYGRAPHISSIMILARTOTHOSEOFTHEQUESTIONCONTINUETOTHENEXTSTEPTHELARGESTDIFFERENCEOFPROTOSYNTHEXINCOMPARISONWI也OTHERSYSTEMSINTHEBEGINNINGPERIODISTHEUOFDEPENDENCYGRAPHSTHISISSIMILARWITHTHESYNTACTICANALYSISWHICHISFREQUENTLYUSEDINTODAYSQUESTIONANSWERINGSYSTEMS122TREEANDCLEFQIUESTIONANSWERINGTRACKSTHETEXTRETRIEVALCONFERENCETTREC3THETRECISANONGOINGSERIESOFWORKSHOPSFOCUSINGONALISTOFDIFFERENTINFORMATIONRETRIEVALRESEARCHAREASOR缸ACKSITISCOSPONSOREDBYTHENATIONALINSTITUTEOFSTANDARDSANDTECTMOLOGY娜SDANDADVANCEDRESEARCHANDDEVELOPMENTACTIVITYARDACENTEROFTHEUSDEPARTMENTOFDEFENSE,ANDBEGANIN1992ASPARTOFTHETIPSTERTEXTPROGRAMTHEPURPOSEOFTHETRECISTOSUPPORTAND4HTTP/INFO3DOUCOM/网络营销ENCOURAGERESEARCHWITHINTHEINFORMATIONRETRIEVALCOMMUNITYEACHLRAEKHASACHALLENGEWHEREINNSTPROVIDESPARTICIPATINGGROUPSWITHDATASETSANDTESTPROBLEMSDEPENDINGOILTRACK,TESTPROBLEMSMIGHTBEQUESTIONS,TOPICS,ORTARGETEXTRACTIBLEFEATURESUNIFORMSCORINGISPERFORMEDSOTHESYSTEMSCANBEFAIRLYEVALUATEDAFTEREVALUATIONOFTHERESULTS,13WORKSHOPPROVIDESAPLACEFORPARTICIPANTSTOCOLLECTTOGETHERTHOUGHTSANDIDEASANDPRESENTCURRENTANDFUTURERESEARCHWORKTHEQATRACKWASSTARTEDIN1999INTRECTREC8、THEGOALOFTHETRACKISTOFOSTERRESEARCHONSYSTEMSTHATRETRIEVEANSWELRATHELTHANDOEUMERTTSINRESPONSETOAQUESTION,WITH姐EMPHASIS011SYSTEMSTHATCALLFUNCTIONINUNRESTRICTEDOPENDOMAINSSYSTEMSA糟GIWNALARGECORPUSOFNEWSPAPERANDNEWSIERARTICLESANDASETOFCLOSEDCLASSQUESTIONSSUCH越“WHOINVENTEDTHEPAPERCLIP”THEYALEREQUIREDTORETURNASHORTF。50BYTESTEXTSNIPPETANDADOCUMENT勰ARESPONSETOAQUESTION,WHERETHESNIPPETCONTAINSANANSWERTOTHEQUESTIONANDTHEDOCUMENTSUPPORTSTHAT删ERTHETRECONLYSUPPORTSEVALUATIONSFORENGLISHUPTONOW,V跖QATZACL娃THELATESTOILEISTHETREC一2005HAVEBEENPERFORMEDANDMORETHAN100SYSTEMSHAVEBEENTESTEDINTHEEVALUATIONSTABLE11GIVESTHEQATRACKDETAILSABOUTTHEDOCUMENTCOLLECTION,TESTQUESTIONSETANDPARTICIPANTSTABLE11DATASTALISTIESFORTHETRECQATRACKSTREC8旺犯9TREC10Z汪1LTREC12TREC13TREC14199920002001但002200320042005群OF5280009790009790001033000103300010330001033000DOEUMENTS拌OFMEGABYICS1904303330333072307230723072TECT撑OFTEST198682500500500349530QUESTIONS撑OF20283634252830PARTICIPANTSTHETASKINTHEFIRSTTWOQATRACKSTREC8AND9WASPRACTICALLYTHESAMEFOREACHQUESTIONINTHEQUESTIONSET,SYSTEMSREAIEVEDARANKECLLISTOFUPTOFIVETEXTSNIPPETSTHATCONTAINEDANRLTLSWC,RTOTHEQUESTIONPLUSADOCUMENTTHATSUPPORTEDTHEALL_SWCLTHEQUESTIONSWCRCRESTRICTEDTOFAETOIDQUESTIOILSSUCH丛“WASTHE16。PRESIDENTOFTHEUNITEDSTALESANDWHEREISRIDERCOLLEGELOCATED”EACHQUESTIONWASGUARANTEEDTOHAVE缸LEASTONEDOCUMENTINTHECOUEETIONTHATEXPLICITLYANSWERED丸THEMAXIMUMLENGTHOFTHETEXTSNIPPETSWASEITHER5001“250BYTESDEPENDINGOIL5THERUNTYPELS9HUMANASSESSORSREADEACHRETURNEDSNIPPETANDDECIDEDWHETHERTHESTRINGACTUALLYDIDCONTAINALLANSWERTOTHEQUESTIONINTHECONTEXTPROVIDEDBYTHEDOCUMEM_GIVENASETOFJUDGMENTSFORTHESTRINGS,THESCORECOMPUTEDFORASUBMISSIONWASTHEMEANRECIPROCALRANKMI汛ANINDIVIDUALQUESTIONRECEIVEDASCOREEQUALTOTHERECIPROCALOFTHERANKATWHICHTHEFIRSTCOLTEDRESPONSEWASRETURNED,ORZELOIFNONEOFTHEFIVERESPONSESCONTAINEDTHECORRECTANSWELTHEFINALSCOREFORAPARTICIPANTWASTHEMEANOFTHEINDIVIDUALQUESTIONSRECIPROCALRANKS1N1MRR2亩善二RANKFIRSTANSWER11智一、NATURALLY,ALLOWING250BYTESINARESPONSEWASA_NEASIERTASKTHANLIMITINGRESPONSESTO50BYTESFOREVERYPARTICIPANTTHATSUBMITTEDRUNSOFBOTHLENGTHS,THE250BYTESLIMITRUNHADAHIGH盯MRRFOR50BYTESLIMITR11NS,THEBESTPEFFOMLINGSYSTEMSWEREABLETOA11SWETSCAR70OFTHETESTQUESTIONSINTREC8ANDABOUT65OFTHEQUESTIONINTREC9ALTHOUGHTHE65SCOREWASASLIGHTLYWORSERESULTTHANTHETREC8SCA3RESINTHEABSOLUTETERMSTHEPERFORMANCEOFTHETREC9SYSTEMSIMPROVEDSIGNIFICANTLYINQATECHNOLOGYTHETREC9TASKWASCONSIDERABLYHARDERTHANTHETREC8TASKBECAUSETREC9USEDREALQUESTIONSSOMEOFTHEMFROMMSENCARTALCLGWHILETREC8USEDQUESTIONSCONSTRUCTEDSPECIFICALLYFORTHETRACKTHISISMAINDIFFERENCEBETWEENTREC8ANDTREC一9ANOTHERDIFFERENCEISTHATBOTHTHEDOCUMENTSETANDTHETESTSETOFQUESTIONSOFTREC9WERELARGERTHANTHEONESOFTREC8SEETABLE11ASMENTIONEDABOVE,PARTICIPANTSRETURNEDARANKEDLISTOFFIVE【DOCUMENTID,ANSWERSTRINGPAIRSPERQUESTIONSUCHTHATEACHANSWERSUINGWASBELIEVEDTOCONTAINALLANSWERTOTHEQUESTIONADOCUMENMD,ANSWERSTRING】PAIRWASJUDGEDCORRECTIF;INTHEOPTIONOFNISTASSESSOR,THEANSWERSTRINGCONTAINEDALL“ULSWERTOTHEQUESTION,THEANSWERSTRINGWASRESPONSIVETOTHEQUESTION,ANDTHEDOCUMENTSUPPORTED

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论