




已阅读5页,还剩61页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Hsin-Hsi Chen,1,跨語言資訊檢索導論,Hsin-Hsi Chen (陳信希) Department of Computer Science and Information Engineering National Taiwan University,Hsin-Hsi Chen,2,Outline,Multilingual Environments What is Cross-Language Information Retrieval? Major Problems in CLIR Major Approaches in CLIR Case Study: CLIR in NPDM Summary,Hsin-Hsi Chen,3,Multilingual Collections,There are 6,703 languages listed in the Ethnologue Digital libraries OCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languages World Wide Web Around 40% of Internet users do not speak English, however, 80% of Web sites are still in English,Hsin-Hsi Chen,4,真實世界語言使用人口,( /faq.htm),中文,英語,印度語,西班牙語,葡萄牙語,孟加拉語,俄語,阿拉伯語,日語,Hsin-Hsi Chen,5,(Statistics from Euro-Marketing Associates, 1998),西班牙語,德語,日語,法語,中文,荷蘭語,葡萄牙語,義大利語,瑞典語,韓文,Hsin-Hsi Chen,6,/globstats/,(Statistics from Euro-Marketing Associates, 1999),中文人口 比例(6.1%) 法文人口 比例(8.8%) (1998年),Hsin-Hsi Chen,7,網路世界語言使用人口,Hsin-Hsi Chen,8,網際網路內容,(Network Wizards Jan 99 Internet Domain Survey),英語,日語,德語,法語,荷蘭語,芬蘭語,西班牙語,中文,瑞典語,33,878,1,687,1,684,654,546,473,458,432,546,40%的Internet使用者 不懂英文,但是80% 的Internet內容是英文,Hsin-Hsi Chen,9,(Source: ),Hsin-Hsi Chen,10,What is Cross-Language Information Retrieval?,Definition: Select information in one language based on queries in another. Terminologies Cross-Language Information Retrieval (ACM SIGIR 96 Workshop on Cross-Linguistic Information Retrieval) Translingual Information Retrieval (Defense Advanced Research Project Agency - DARPA),Hsin-Hsi Chen,11,Generalization: Multi- & Cross- Lingual Information Access,Hsin-Hsi Chen,12,MLIR Applications,Multilingual information access in multilingual country, organization, enterprise, etc. Cross- language information retrieval for users who read a second language (large passive vocabulary) but are not able to formulate good queries (small active vocabulary). Monolingual users may retrieve images by taking advantage of multilingual captions. Monolingual users may retrieve documents and have them translated (automatically or manually) in their language.,Hsin-Hsi Chen,13,Why is Cross- Language Information Retrieval Important?,More information workers with less time require fast access to global resources global B2B interactions (virtual enterprises) global B2C interactions (online trading, travelling) time critical information (translation comes too late),Hsin-Hsi Chen,14,History,1970 Salton runs retrieval experiments with a small English/ German dictionary 1972 Pevzner shows for English and Russian that a controlled thesaurus can be used effectively for query term translation 1978 ISO Standard 5964 for developing multilingual thesauri (revised in 1985) 1990 Latent Semantic Indexing (LSI) applied to CLIR,Hsin-Hsi Chen,15,History (Continued),1994 1st PhD thesis on CLIR by Khaled Radwan 1996 Similarity thesaurus applied to CLIR (ETH Zurich) 1996 Dictionary based retrieval applied to CLIR (Umass & XEROX Grenoble) 1997 Generalized Vector Space Model (GVSM) applied to CLIR (CMU),Hsin-Hsi Chen,16,History (Continued),1997 CLIR (Cross- Language Information Retrieval) track starts within TREC 1998 NTCIR starts in Japan 1999 TIDES (Translingual Information Detection, Extraction, and Summarization) starts in U. S. 2000 CLEF starts in Europe,Hsin-Hsi Chen,17,An Architecture of Multilingual Information Access,Hsin-Hsi Chen,18,Major Problems of CLIR,Queries and documents are in different languages. translation Words in a query may be ambiguous. disambiguation Queries are usually short. expansion,Hsin-Hsi Chen,19,Major Problems of CLIR (Continued),Queries may have to be segmented. segmentation A document may be in terms of various languages. language identification,Hsin-Hsi Chen,20,Enhancing Traditional Information Retrieval Systems,Which part(s) should be modified for CLIR?,Documents,Queries,Document Representation,Query Representation,Comparison,(3),(1),(2),(4),Hsin-Hsi Chen,21,Enhancing Traditional Information Retrieval Systems (Continued),(1): text translation (2): vector translation (3): query translation (4): term vector translation (1) and (2), (3) and (4): interlingual form,Hsin-Hsi Chen,22,What are the Problems?,Ambiguous terms (e.g., performance) Multiword phrases may correspond to single-word phrases (e. g. South Africa = 南非, Sdafrika) Coverage of the vocabulary There is not a one-to-one mapping between two languages Translating queries automatically (lack of syntax) Translating documents automatically (performance, ) Computing mixed result lists,Hsin-Hsi Chen,23,Cross-Language Information Retrieval,Hsin-Hsi Chen,24,Query Translation Based CLIR,English Query,Translation Device,Chinese Query,Monolingual Chinese Retrieval System,Retrieved Chinese Documents,Hsin-Hsi Chen,25,Translating the 400 Million non-English Pages of the WWW,. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.,Hsin-Hsi Chen,26,Knowledge-Based,Examples Subject Thesaurus Hierarchical and associative relations. Unique term assigned to each node. Concept List Term space partitioned into concept spaces. Term List List of cross-language synonyms. Lexicon Machine readable syntax and/or semantics.,Hsin-Hsi Chen,27,Ontology-Based Approaches,Exploit complex knowledge representations e.g., EuroWordNet A Proposal for Conceptual Indexing using EuroWordNet,Hsin-Hsi Chen,28,Dictionary-Based Approaches,Exploit machine-readable dictionaries. Problems translation ambiguity + target polysemy coverage (unknown words, abbreviations, .),Hsin-Hsi Chen,29,Dictionary-Based Approaches (Continued),Issue 1: selection strategy Select all. Select N randomly. Select best N. Issue 2: which level word phrase,Hsin-Hsi Chen,30,Selection Strategy: Select All,Hull and Grefenstette 1996 Take concatenation of all term translation. E: politically motivated civil disturbances F: troubles civils a caractere politique trouble - turmoil, discord, trouble, unrest, disturbance, disorder civil - civil, civilian, courteous caractere - character, nature politique - political, diplomatic, politician, policy Original English (0.393) vs. Automatic word-based transfer dictionary (0.235): 59.8%. errors: multi-word expressions and ambiguity,Hsin-Hsi Chen,31,Selection Strategy: Select All (Continued),Davis 1997 (TREC5) Replace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary. Monolingual (0.2895) vs. All-equivalent substitution (0.1422): 49.12%,Hsin-Hsi Chen,32,Evaluation Method,Average Precision (5-, 9-, 11-points) Model,Spanish Query,Mono IR Engine,English Query,Bilingual Dictionary,Mono IR Engine,TREC Spanish Corpus,Spanish Equivalents,English Query,Mono IR Engine,TREC Spanish Corpus,Spanish Equivalents by POS,POS Bilingual Dictionary,TREC Spanish Corpus,Hsin-Hsi Chen,33,Selection Strategy: Select N,Simple word-by-word translation Each query term is replaced by the word or group of words given for the first sense of the terms definition. 50-60% drop in performance (average precision),Hsin-Hsi Chen,34,Selection Strategy: Select N (Continued),word/phrase translation Take at most three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary. 30-50% worse than good translation Well-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements. WBW (0.0244), phrasal (0.0148), good phrasal (0.0610) -39.3% +150.3%,Hsin-Hsi Chen,35,Selection Strategy: Select Best N,Hayashi, Kikui and Susaki 1997 search for a dictionary entry corresponding to the longest sequence of words from left to right choose the most frequently used word (or phrases) in a text corpus collected from WWW no report for this query translation approach Davis 1997 (TREC5) POS disambiguation Monolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%,Hsin-Hsi Chen,36,Corpus-Based Approaches,Categorization Term-Aligned Sentence-Aligned Document-Aligned (Parallel, Comparable) Unaligned Usage Setup Thesaurus Vector Mapping,Hsin-Hsi Chen,37,Term-Aligned Corpora,Fine-grained alignment in parallel corpora Oard 1996 Term alignment is a challenging problem.,Parallel Binlingual Corpus,Cooccurrance Statistics,Translation Tables,Machine Translation System,English Query,Spanish Query,Hsin-Hsi Chen,38,Sentence-Aligned Corpora,Davis & Dunning 1996 (TREC4) High-frequency Terms,Hsin-Hsi Chen,39,Brief Summary,dictionary-based methods Specialized vocabulary not in the dictionaries will not be translated. Ambiguities will add extraneous terms to the query. parallel/comparable corpora-based methods Parallel corpora are not always available. Available corpora tend to be relative small or to cover only a small number of subjects. Performance is dependent on how well the corpora are aligned.,Hsin-Hsi Chen,40,Brief Summary (Continued),Dictionaries are very useful. Achieve 50% on their own Parallel corpora have limitations. Domain shifts Term alignment accuracy Dictionaries and corpora are complementary. Dictionaries provide broad and shallow coverage. Corpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.,Hsin-Hsi Chen,41,Hybrid Methods,What knowledge can be employed? lexical knowledge corpus knowledge .,Hsin-Hsi Chen,42,Hybrid Methods (Continued),Query Expansion Issue 1: context pseudo relevance feedback (local feedback): A query is modified by the addition of terms found in the top retrieved documents. local context analysis: Queries are expanded by the addition of the top ranked concepts from the top passages.,Hsin-Hsi Chen,43,Hybrid Methods (Continued),Issue 2: when before query translation after query translation,Hsin-Hsi Chen,44,Hybrid Methods (Continued),Ballesteros & Croft 1997,Original Spanish TREC Queries,human translation,English (BASE) Queries,Spanish Queries,automatic dictionary translation,English Queries,query expansion,Spanish Queries,query expansion,Spanish Queries,automatic dictionary translation,INQUERY,Hsin-Hsi Chen,45,Hybrid Methods (Continued),Performance Evaluation pre-translation MRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1139) +33.5% +38.5% post-translation MRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% combined pre- and post-translation MRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.1358) +51.0% +65.0% 32% below a monolingual baseline,Hsin-Hsi Chen,46,Cross-Language Evaluation Forum,A collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST) Extension of CLIR track at TREC (1997-1999),Hsin-Hsi Chen,47,Main Goals,Promote research in cross-language system development for European languages by providing an appropriate infrastructure for: CLIR system evaluation, testing and tuning Comparison and discussion of results,Hsin-Hsi Chen,48,CLEF 2000 Task Description,Four evaluation tracks in CLEF 2000 multilingual information retrieval bilingual information retrieval monolingual (non-English) information retrieval domain-specific IR,Hsin-Hsi Chen,49,Case Study: CLIR for NPDM,Hsin-Hsi Chen,50,3M in Digital Libraries/Museums,Multi-media Selecting suitable media to represent contents Multi-linguality Decreasing the language barriers Multi-culture Integrating multiple cultures,Hsin-Hsi Chen,51,NPDM Project,Palace Museum, Taipei, one of the famous museums in the world NSC supports a pioneer study of a digital museum project NPDM starting from 2000 Enamels from the Ming and Ching Dynasties Famous Album Leaves of the Sung Dynasty Illustrations in Buddhist Scriptures with Relative Drawings,Hsin-Hsi Chen,52,Design Issues,Standardization A standard metadata protocol is indispensable for the interchange of resources with other museums. Multimedia A suitable presentation scheme is required. Internationalization to share the valuable resources of NPDM with users of different languages to utilize knowledge presented in a foreign language,Hsin-Hsi Chen,53,Translingual Issue,CLIR to allow users to issue queries in one language to access documents in another language the query language is English and the document language is Chinese Two common approaches Query translation Document translation,Hsin-Hsi Chen,54,Resources in NPDM pilot,an enamel, a calligraphy, a painting, or an illustration MICI-DC Metadata Interchange for Chinese Information Accessible fields to users Short descriptions vs. full texts Bilingual versions vs. Chinese only Fields for maintenance only,Hsin-Hsi Chen,55,Search Modes,Free search users describe their information need using natural languages (Chinese or English) Specific topic search users fill in specific fields denoting authors, titles, dates, and so on,Hsin-Hsi Chen,56,Example,Information need Retrieval “Travelers Among Mountains and Streams, Fan Kuan” (“范寬谿山行旅圖”) Possible queries Author: Fan Kuan; Kuan, Fan Time: Sung Dynasty Title: Mountains and Streams; Travel among mountains; Travel among streams; Mountain and stream painting Free search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province,Hsin-Hsi Chen,57,ECIR in NPDM,Hsin-Hsi Chen,58,Specific Topic Search,proper names are important query terms Creators such as “林逋” (Lin Pu), “李建中” (Li Chien-chung), “歐陽脩” (Ou-yang Hsiu), etc. Emperors such as “康熙” (Kang-hsi), “乾隆” (Chien-lung), “徽宗” (Hui-tsung), etc. Dynasty such as ”宋” (Sung), “明” (Ming), “清” (Ching), etc.,Hsin-Hsi Chen,59,Name Transliteration,The alphabets of Chinese and English are totally different Wade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries backward transliteration Transliterate target language terms back to source language ones Chen, Huang, and Tsai (COLING, 1998) Lin and Chen (ROCLING, 2000),Hsin-Hsi Chen,60,Name Mapping Table,Divide a name into a sequence of Chinese characters, and transform each character into phonemes Look up phoneme-to-WG (Pinyin) mapping table, and derive a canonical form for the name Example “林逋” “ ” “Lin Pu” (WG),Hsin-Hsi Chen,61,Name Similarity,Extract named entity from the query Select the most similar named entity from name mapping table Naming sequence/scheme LastName FirstName1, e.g., Chu Hsi (朱熹) FirstName1 LastName, e.g., Hsi Chu (朱熹) LastName FirstName1-FirstName2, e.g., Hsu Tao-ning (許道寧) FirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) Any order, e.g., Tao Ning Hsu (許道寧) Any transliteration, e.g., Ju Shi (朱熹),Hsin-Hsi Chen,62,Title,谿山行旅圖” “Travelers among Mountains and Streams” “travelers“, “mountains“, and “streams“ are basic components Users can express their informat
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 煤矿支护考试题及答案
- 数学旋转考试题及答案
- 康复治疗面试题及答案
- 储能系统运维安全手册
- java自增自减面试题及答案
- 家电公司采购合同管理办法
- 西藏环卫工人考试试题及答案
- 海曙社工面试题及答案
- 咸宁叉车考试题及答案
- 物理磁学考试题及答案
- 2025汽车智能驾驶技术及产业发展白皮书
- 苯职业病防护课件
- 2025年铸牢中华民族共同体意识基本知识测试题及答案
- 2025年湖北省中考道德与法治真题(解析版)
- 2025-2030年中国胃食管反流病行业市场现状供需分析及投资评估规划分析研究报告
- 2025-2030中国苯丙酮尿症(PKU)行业市场发展趋势与前景展望战略研究报告
- 2025至2030年中国PA10T行业市场竞争态势及未来前景分析报告
- 催收新人培训管理制度
- DZ/T 0089-1993地质钻探用钻塔技术条件
- 2025-2030中国铁路道岔行业市场现状供需分析及投资评估规划分析研究报告
- 特种设备安全法培训课件
评论
0/150
提交评论