跨语言信息检索技术_第1页
跨语言信息检索技术_第2页
跨语言信息检索技术_第3页
跨语言信息检索技术_第4页
跨语言信息检索技术_第5页
已阅读5页,还剩73页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、 Cross Language Information Retrieval Road Map lCross Lingual IR lMotivation lDefinition lGeneral Issues With CLIR lBasic Approaches to CLIR lCLIR evaluation lCLIR applications 2021-7-133 Information Retrieval lSingle language:both the users query and documents to be searched are in same language. l

2、Cross language:documents written in a language different from the language of the users query documents query 2021-7-134 2000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日) The Internet Big Picture World Regions PopulationInternet Users Penetrat ion(%po pulation) Users % of Table Growth 2000-2015 Africa1,15

3、8,355,663313,257,07427.0%9.6%6,839% Asia4,032,466,8821,563,208,14338.8%47.8%1,268% Europe821,555,904604,122,38073.5%18.5%475% Middle East236,137,235115,823,88249.0%3.5%3,426% North America 357,172,209313,862,86387.9%9.6%191% Latin America 617,776,105333,115,90853.9%10.2%1,743% Oceania/Aus tralia 37,

4、157,12027,100,33472.9%0.8%256% World Total7260,621,1183,270,490,58445%100%806% World Internet Users and 2015 Population Stats 2021-7-135 2021-7-136 Usage of content languages for websites 2021-7-137 20022015 English72%English54.5% German7%Russian5.9% Japanese6%German5.7% Spanish3%Japanese5.0% French

5、 3%Spanish4.7% Italian2%French4.1% Dutch2%Portuguese2.6% Chinese 2%Chinese 2.2% Korean1%Italian2.1% Russian 1%Polish1.9% Portuguese1%Turkish1.6% Source: http:/ /research/activities/wcp/stats/intnl.html Cross Language IR lMotivation lInformation unavailability in some languages lLan

6、guage barrier lDefinition: lCross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the users query (wikipedia) lExample: lA user may ask query in Chinese but retrieve relevant documen

7、ts written in English. Why do we need CLIR systems? lNeeds technologies that enable access to info regardless of geographic/language barriers. lTo find, retrieve and understand relevant information in whatever language/form. lCLIR has become one of the key factors affecting knowledge sharing all ove

8、r the world. General Issues With CLIR lMultilingual text access (character sets, etc.) lDifferences between languages -stemming, compound words, breaks between words, etc. lTerm ambiguity between languages lWhat to translate (query vs. document) and how Matching strategies lNo translation l(1) Cogna

9、te matching lTranslation l(2) Query translation l(3) Document translation l(4) Interlingual techniques 2021-7-1311 Cognate matching(同源匹配)同源匹配) lIn the case of the most naive cognate matching, untranslatable terms such as proper nouns or technical terminology are left unchanged through the stage of t

10、ranslation. lThe unchanged term can be expected to match successfully with a corresponding term in another language if the two languages have a close linguistic relationship.(for example, generation in English and French) lWhen two languages are very different, by exploring a method for measuring si

11、milarity between transliteration and its original word, we may make cognate matching feasible(音译). 2021-7-1312 2021-7-1313 Query translation 搜索引擎搜索引擎 翻译系统翻译系统 法语查询 法语文档 结果结果 中文查询 选择 浏览 法语文档集合法语文档集合 过程: 将中文查询翻译成法语 检索法语文档集合 将检索结果翻译成中文 2021-7-1314 query translation lQuery translation is the most widely

12、 used matching strategy for CLIR due to its tractability. lthe retrieval system does not have to change its inverted files of index terms in any way against queries in any language. lIt is less computationally costly to process the translation of a query than that of a large set of documents lChalle

13、nge: term ambiguity lqueries are often short and short queries provide little context for disambiguation lTerm disambiguation will be discussed later. 2021-7-1315 查询翻译优缺点查询翻译优缺点 l优点 l简单 l容易操作 l灵活 l节约时间、空间,效率高 l缺点 l缺乏上下文 l对于短查询式,翻译歧义性大 2021-7-1316 Document translation 中文查询 法语文档集合法语文档集合 搜索引擎搜索引擎 翻译系统翻

14、译系统 中文文档集合中文文档集合 结果结果 选择 浏览 过程: 将整个法语文档翻译成中文文档 直接用中文文档检索 2021-7-1317 Document translation lDocument translation has opposite advantages and disadvantages from query translation. lIn CLIR experiments, this approach is not usually utilized, and query translation is dominant. lHowever, some researchers

15、 have used it to translate large sets of documents since more varied context within each document is available for translation, which can improve translation quality. lOard and Hackett (1998) reported that automatic machine translation of a set of documents using a commercial MT system outperforms q

16、uery translation in an experiment of CLIR from German to English 2021-7-1318 文档翻译优缺点文档翻译优缺点 l优点 l只翻译一次 l文档提供的上下文比较丰富 l文档可以线下事先翻译好 l缺点 l翻译速度慢 l占用大量空间、时间,效率低 l依赖机器翻译系统的质量 2021-7-1319 查询翻译查询翻译vs.文档翻译文档翻译 l取决于特定语言资源 l通常查询翻译使用更广 l两种方法都提出了“交互性”挑战 Interlingual approach lan intermediate space of subject rep

17、resentation into which both the query and the documents are converted is used to compare them. lOne type of interlingual approach is to use the synsets provided in WordNet, which is a wellknown machine- readable thesaurus. lFor example, Diekema, Oroumchian, Sheridan, and Liddy (1999) employed the Wo

18、rdNet synset numbers as language- independent representations for CLIR. lSince a synset number (label) representing a concept is corresponded to a set of concrete words in each of languages supported (e.g., English and French), it is possible that a query term in the source languages is linked to wo

19、rds in the target language via the synset number. 2021-7-1320 Translation techniques lDictionary-based methods lParallel corpora-based method lUse of WWW resources 2021-7-1321 Dictionary-based methods lUsing a bilingual Machine Readable Dictionary (MRD). lmost retrieval systems are still based on so

20、-called bag- of-words architectures, in which both query statements and document texts are decomposed into a set of words (or phrases) through a process of indexing. lThus we can translate a query easily by replacing each query term with its translation equivalents appearing in a bilingual dictionar

21、y or a bilingual term list. 2021-7-1322 2021-7-1323 bilingual dictionary l人工构建的双语词典 lprinted lMerriam-Websters Dictionaries lLongman Dictionaries lelectronic lFreedict at http:/ lTravlang at http:/ l问题 lHas to be processed to be readable by machine lLimited vocabulary lDictionary translations are in

22、herently ambiguous and add extraneous information l机器自动构建的词典 l称为机读词典 Machine Readable Dictionaries (MRD) 2021-7-1324 Term translation oil petroleum probe survey take samples 选哪个翻译? 没有翻译! restrain cymbidium goeringii 分词错误 oil petroleum probe survey take samples 2021-7-1325 Some issues in term transla

23、tion lCompound words,for example German ldecomposition lNo boundary between words,e.g. Chinese lsegmentation lSpecialized vocabulary not contained in the dictionary,e.g. named entity 2021-7-1326 Examples lCompound decomposition (复合词分解) lchinese word segmentation l新西兰花 l新西兰 花New Zealand flowers l新 西兰

24、花 fresh broccolis 2021-7-1327 Corpora-based method lParallel(双语平行语料库) or comparable corpora(双语可比语料库) are useful resources enabling us to extract beneficial information for CLIR. lFor example, in order to translate English queries into Spanish, Davis and Dunning (1995) extracted moderately frequent S

25、panish terms from Spanish documents aligned with English documents which had been searched using an English query (source query). 2021-7-1328 Parallel corpora lA parallel corpus (pl. corpora) is a document collection composed of two or more disjoint subsets, each written in a different language, suc

26、h that documents in each subset are translations of documents in each other subset. lVery high accuracy 2021-7-1329 象形文字 古埃及文字 希腊文 2021-7-1330 罗塞塔石碑罗塞塔石碑 l罗塞塔石碑(Rosetta Stone,也译作罗塞达 碑),高1.14米,宽0.73米,是一块制作于公 元前196年的大理石石碑,原本是一块刻有埃 及国王托勒密五世(Ptolemy V)诏书的石碑。 石碑上用希腊文字、古埃及文字和当时的通俗 体文字刻了同样的内容。由于这块石碑刻有三 种不同

27、语言版本,使得近代的考古学家得以有 机会对照各语言版本的内容后,解读出已经失 传千余年的埃及象形文之意义与结构,而成为 今日研究古埃及历史的重要里程碑。 2021-7-1331 More parallel corpora lnews: lDE-News (German-English) lHong-Kong News, Xinhua News (Chinese-English) lGovernment docuemtns: lCanadian-Hansards (French-English) lEuroparl (Danish, Dutch, English, Finnish, French

28、, German, Greek, Italian, Portugese, Spanish, Swedish) lUN Treaties (Russian, English, Arabic, ) lBible (many, many languages) 2021-7-1332 Examples EnglishGerman Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform The discussion around the envisaged majo

29、r tax reform continues . Die Diskussion um die vorgesehene grosse Steuerreform dauert an . The FDP economics expert, Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 . Der FDP - Wirtschaftsexperte Graf Lambsdorff s

30、prach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen . 2021-7-1333 Comparable corpora lA comparable corpus is a pair of corpora in two different languages, which come from the same domain. lTalking the same topic lParallel sentences may also be mined from compar

31、able corpora such as news stories written on the same topic in different languages. lSome researchers extract phrase pairs from comparable corpora using a classifier approach. 2021-7-1334 Example Use of WWW resources lThe WWW can provide rich and ubiquitous machine- readable resources, from which we

32、 may be able to automatically extract information useful for CLIR. lFor example, Chen (2002) and Chen and Gey (2003) made use of a general search engine on the Internet and tried to find English translation equivalents of Chinese or Japanese terms (mainly proper nouns) by analyzing contexts of these

33、 terms in Chinese and Japanese Web documents returned by the engine. 2021-7-1335 2021-7-1336 Term disambiguation techniques (翻译歧义性翻译歧义性) lDisambiguation from among multiple alternative term translations,多个翻译如何选 择?e.g., Apple, Bank lUse of part-of-speech (POS) tags. lUse of parallel corpus. lUse of c

34、o-occurrence statistics in the target corpus. lUse of the query expansion technique. Use of part-of-speech tags lThe basic idea of using part-of-speech (POS) tags for translation disambiguation is to select only translations having the same POS with that of the source query term. lThis method requir

35、es that POS tagging software is available for both languages. 2021-7-1337 Parallel corpus-based disambiguation lA parallel corpus was used for determining the best translation or set of translations by Davis (1997, 1998), where a single translation for each source term was selected from a set of tra

36、nslations listed in an MRD according to the result of searching a parallel corpus. 2021-7-1338 2021-7-1339 Translation probability 探测探测survey 试探试探 样品样品 测量测量 (p = 0.4) (p = 0.3) (p = 0.25) (p = 0.05) 多个翻译多个翻译 翻译概率翻译概率 Disambiguation based on co- occurrence statistics lthe correct translations of quer

37、y terms should co-occur in target language documents and incorrect translations should tend not to co-occur. lFirst, the two most related terms in the query were determined based on cooccurrence statistics in the source language corpus, and then the best translations were selected from all pairs of

38、translations of these two terms according to co-occurrence statistics in the target language corpus. lIt should be noted that these two corpora do not have to be parallel or comparable. 2021-7-1340 Query expansion for disambiguation lPseudo relevance feedback (PRF), also known as blind feedback, is

39、widely recognized as an effective ltechnique for enhancing performance of information retrieval. PRF also works effectively for CLIR tasks. lIn the case of CLIR, two kinds of PRF are feasible: lPre-translation feedback and lPost-translation feedback 2021-7-1341 Pre-translation feedback lDocuments fr

40、om a corpus in the source language can be retrieved prior to translation in order to add a set of new terms to the source query (pre-translation feedback) if such a corpus is available. lPre-translation feedback may contribute to improvement of precision. This is due to the fact that the PRF is basi

41、cally done using the entire querynot each source term respectively. That is, synonyms or related terms corresponding to the correct meaning of each source term within a context of the query are expected to be automatically added through the PRF process. 2021-7-1342 Post-translation feedback lAfter t

42、ranslation, standard PRF can be applied using the target document collection (post-translation feedback). lpost-translation feedback can be considered a device for improving recall ratio, as shown in standard experiments of monolingual retrieval. lIn CLIR, two well-known methods for weighting terms

43、in the top-ranked documents are often utilized for selecting good terms, i.e., the Rocchio method and the probabilistic method. 2021-7-1343 bi-directional translation lBoughanem et al. (2002), explored a bi-directional translation technique in which a form of backward translation is used for ranking

44、 translation candidates. Suppose that we need to translate English query terms into French ones. In bi-directional translation, first a set of French equivalents for an English term is found in an EnglishFrench dictionary. Next, using a FrenchEnglish dictionary, each French equivalent is reversely t

45、ranslated into a set of English terms. Basically, if the set includes the original source term, the French translation equivalent is chosen as a preferred translation. 2021-7-1344 2021-7-1345 跨语言检索评价跨语言检索评价 l信息检索评价 l给定一个检索主题,一个文档集合,一些人工判断 好的相关文献 l对系统返回的检索结果进行判断 lTREC CLIR (96-02): 英语到其他语言 lCLEF (00-

46、): 欧洲语言之间 lNTCIR (99-): 亚洲语言与英语 2021-7-1346 跨语言检索评价模型跨语言检索评价模型 47 Applications of CLIR 2021-7-1348 2.1 Cross language Search Engine lApril 25, 2006: European search engine “Quaero” lFrench President announced 90 million-euro support. lMay 16, 2007: Google Translate lProvide CLIR for 12 languages lGo

47、al: take all the Web 跨语言信息检索查询翻译技术研究D;苏 州大学;2010 l王序文. 基于主题伪相关反馈的跨语言信息检索技 术研究 D; 北京邮电大学,2014 l彭琳.汉语词语语义相似度度量及其在跨语言信息 检索中的应用研究D; 复旦大学, 2010 2021-7-1368 2021-7-1369 对对“交互交互”的挑战的挑战 lCLIR poses some unique challenges for interaction lHow do you help users select translated query terms? lHow do you help users select document terms for query refinement? lHow do you compensate for poor translation quality? 2021-7-1370 多语言信息获取多语言信息获取 Cross-Language Information Access,CLIA CLIR System Result Processing Result Presentation Query formulation Question analysis Request gener

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论