版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Outline of Today,Introduction Lexicon construction Topic Detection and Tracking Summarization Question Answering,Data Mining- Market Basket Analysis,80% of the people who buy milk also buy bread On Fridays, 70% of the men who bought diapers also bought beer. What is the relationship between diapers
2、and beer? Walmart could trace the reason after doing a small survey!,The business opportunity in text mining,Corporate Knowledge “Ore”,Email Insurance claims News articles Web pages Patent portfolios IRC Scientific articles,Customer complaint letters Contracts Transcripts of phone calls with custome
3、rs Technical documents,Stuff not very accessible via standard data-mining,Text Knowledge Extraction Tasks,Small Stuff. Useful nuggets of information that a user wants: Question Answering Information Extraction (DB filling) Thesaurus Generation Big Stuff. Overviews: Summary Extraction (documents or c
4、ollections) Categorization (documents) Clustering (collections) Text Data Mining: Interesting unknown correlations that one can discover,Text Mining,The foundation of most commercial “text mining” products is all the stuff we have already covered: Information Retrieval engine Web spider/search Text
5、classification Text clustering Named entity recognition Information extraction (only sometimes) Is this text mining? What else is needed?,One tool: Question Answering,Goal: Use Encyclopedia/other source to answer “Trivial Pursuit-style” factoid questions Example: “What famed English site is found on
6、 Salisbury Plain?”,From ,Another tool: Summarizing,High-level summary or survey of all main points? How to summarize a collection? Example: sentence extraction from a single document,IBM Text Miner terminology: Example of Vocabulary found,Certificate of deposit CMOs Commercial bank Commercial paper
7、Commercial Union Assurance Commodity Futures Trading Commission Consul Restaurant Convertible bond Credit facility Credit line,Debt security Debtor country Detroit Edison Digital Equipment Dollars of debt End-March Enserch Equity warrant Eurodollar ,What is Text Data Mining?,Peoples first thought: M
8、ake it easier to find things on the Web. But this is information retrieval! The metaphor of extracting ore from rock: Does make sense for extracting documents of interest from a huge pile. But does not reflect notions of DM in practice. Rather: finding patterns across large collections discovering h
9、eretofore unknown information,Definitions of Text Mining,Text mining mainly is about somehow extracting the information and knowledge from text; 2 definitions: Any operation related to gathering and analyzing text from external sources for business intelligence purposes; Discovery of knowledge previ
10、ously unknown to the user in text; Text mining is the process of compiling, organizing, and analyzing large document collections to support the delivery of targeted types of information to analysts and decision makers and to discover relationships between related facts that span wide domains of inqu
11、iry.,True Text Data Mining:Don Swansons Medical Work,Given medical titles and abstracts a problem (incurable rare disease) some medical expertise find causal links among titles symptoms drugs results E.g.: Magnesium deficiency related to migraine This was found by extracting features from medical li
12、terature on migraines and nutrition,Swanson Example (1991),Problem: Migraine headaches (M) Stress is associated with migraines; Stress can lead to a loss of magnesium; calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker; Spreading cortical depression (SCD)
13、is implicated in some migraines; High levels of magnesium inhibit SCD; Migraine patients have high platelet aggregability; Magnesium can suppress platelet aggregability. All extracted from medical journal titles,Gathering Evidence,migraine,magnesium,stress,CCB,PA,SCD,All NutritionResearch,All Migrai
14、neResearch,Swansons TDM,Two of his hypotheses have received some experimental verification. His technique Only partially automated Required medical expertise Few people are working on this kind of information aggregation problem.,Or maybe it was already known?,Lexicon Construction,What is a Lexicon?
15、,A database of the vocabulary of a particular domain (or a language) More than a list of words/phrases Usually some linguistic information Morphology (manag- e/es/ing/ed manage) Syntactic patterns (transitivity etc) Often some semantic information Is-a hierarchy Synonymy Numbers convert to normal fo
16、rm: Four 4 Date convert to normal form Alternative names convert to explicit form Mr. Carr, Tyler, Presenter Tyler Carr,Wordnet Example,Lexica in Text Mining,Many text mining tasks require named entity recognition. Named entity recognition requires a lexicon in most cases. Example 1: Question answer
17、ing Where is Mount Everest? A list of geographic locations increases accuracy Example 2: Information extraction Consider scraping book data from Template contains field “publisher” A list of publishers increases accuracy Manual construction is expensive: 1000s of person hours! Sometimes an unstructu
18、red inventory is sufficient Often you need more structure, e.g., hierarchy,Lexicon Construction (Riloff),Attempt 1: Iterative expansion of phrase list Start with: Large text corpus List of seed words Identify “good” seed word contexts Collect close nouns in contexts Compute confidence scores for nou
19、ns Iteratively add high-confidence nouns to seed word list. Go to 2. Output: Ranked list of candidates,Lexicon Construction: Example,Category: weapon Seed words: bomb, dynamite, explosives Context: and Iterate: Context: They use TNT and other explosives. Add word: TNT Other words added by algorithm:
20、 rockets, bombs, missile, arms, bullets,Lexicon Construction: Attempt 2,Multilevel bootstrapping (Riloff and Jones 1999) Generate two data structures in parallel The lexicon A list of extraction patterns Input as before Corpus (not annotated) List of seed words,Multilevel Bootstrapping,Initial lexic
21、on: seed words Level 1: Mutual bootstrapping Extraction patterns are learned from lexicon entries. New lexicon entries are learned from extraction patterns Iterate Level 2: Filter lexicon Retain only most reliable lexicon entries Go back to level 1 2-level performs better than just level 1.,Scoring
22、of Patterns,Example Concept: company Pattern: owned by Patterns are scored as follows score(pattern) = F/N log(F) F = number of unique lexicon entries produced by the pattern N = total number of unique phrases produced by the pattern Selects for patterns that are Selective (F/N part) Have a high yie
23、ld (log(F) part),Scoring of Noun Phrases,Noun phrases are scored as follows score(NP) = sum_k (1 + 0.01 * score(pattern_k) where we sum over all patterns that fire for NP Main criterion is number of independent patterns that fire for this NP. Give higher score for NPs found by high-confidence patter
24、ns. Example: New candidate phrase: boeing Occurs in: owned by , sold to , offices of ,Shallow Parsing,Shallow parsing needed For identifying noun phrases and their heads For generating extraction patterns For scoring, when are two noun phrases the same? Head phrase matching X matches Y if X is the r
25、ightmost substring of Y “New Zealand” matches “Eastern New Zealand” “New Zealand cheese” does not match “New Zealand”,Seed Words,Extraction Patterns,Level 1: Mutual Bootstrapping,Drift can occur. It only takes one bad apple to spoil the barrel. Example: head Introduce level 2 bootstrapping to preven
26、t drift.,Level 2: Meta-Bootstrapping,Evaluation,Discussion,Partial resources often available. E.g., you have a gazetteer, you want to extend it to a new geographic area. Some manual post-editing necessary for high-quality. Semi-automated approaches offer good coverage with much reduced human effort.
27、 Drift not a problem in practice if there is a human in the loop anyway. Approach that can deal with diverse evidence preferable. Hand-crafted features (period for “N.Y.”) help a lot.,References,Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. Julian Kupiec. Murax: A
28、robust linguistic approach for question answering using an on-line encyclopedia. In the Proceedings of 16th SIGIR Conference, Pittsburgh, PA, 2001. Don R. Swanson: Analysis of Unintended Connections Between Disjoint Science Literatures. SIGIR 1991: 280-289 Tim Berners Lee on semantic web: 2001/0501i
29、ssue/0501berners-lee.html Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping (1999)Ellen Riloff, Rosie Jones. Proceedings of the Sixteenth National Conference on Artificial Intelligence Unsupervised Models for Named Entity Classification (1999)Michael Collins, Yoram Singer
30、,First Story Detection,First Story Detection,Automatically identify the first story on a new event from a stream of text Applications Finance: Be the first to trade a stock Breaking news for policy makers Intelligence services Other technologies dont work for this Information retrieval Text classifi
31、cation Why?,Definitions,Event: A reported occurrence at a specific time and place, and the unavoidable consequences. Specific elections, accidents, crimes, natural disasters. Activity: A connected set of actions that have a common focus or purpose - campaigns, investigations, disaster relief efforts
32、. Topic: a seminal event or activity, along with all directly related events and activities Story: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.,Examples,Thai Airbus Crash (11.12.98) On topic: stories reporting details of the cra
33、sh, injuries and deaths; reports on the investigation following the crash; policy changes due to the crash (new runway lights were installed at airports). Euro Introduced (1.1.1999) On topic: stories about the preparation for the common currency (negotiations about exchange rates and financial stand
34、ards to be shared among the member nations); official introduction of the Euro; economic details of the shared currency; reactions within the EU and around the world.,TDT: The Corpus,Topic Detection and Tracking “Bake-off” sponsored by US government agencies TDT evaluation corpora consist of text an
35、d transcribed news from 1990s. A set of target events (e.g., 119 in TDT2) is used for evaluation Corpus is tagged for these events (including first story) TDT2 consists of 60,000 news stories, Jan-June 1998, about 3,000 are “on topic” for one of 119 topics Stories are arranged in chronological order
36、,TDT2000 - Tasks in News Detection,The First-Story Detection Task,To detect the first story that discusses a topic, for all topics.,First Story Detection,New event detection is an unsupervised learning task Detection may consist of discovering previously unidentified events in an accumulated collect
37、ion retro Flagging onset of new events from live news feeds in an on-line fashion Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set The input to on-line detection is the stream of TDT stories in chronological order simulating real-time incoming d
38、ocuments The output of on-line detection is a YES/NO decision per document,Approach 1: KNN,On-line processing of each incoming story Compute similarity to all previous stories Cosine similarity Language model Prominent terms Extracted entities If similarity is below threshold: new story If similarit
39、y is above threshold for previous story s: assign to topic of s Threshold can be trained on training set Threshold is not topic specific!,Approach 2: Single Pass Clustering,Assign each incoming document to one of a set of topic clusters A topic cluster is represented by its centroid (vector average
40、of members) For incoming story compute similarity with centroid,Patterns in Event Distributions,News stories discussing the same event tend to be temporally proximate A time gap between burst of topically similar stories is often an indication of different events Different earthquakes Airplane accid
41、ents A significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nouns Events are typically reported in a relatively brief time window of 1- 4 weeks,Similar Events over Time,Approach 3: KNN + Time,Only consider d
42、ocuments in a (short) time window Compute similarity in a time weighted fashion: m: number of documents in window, di: ith document in window Time weighting significantly increases performance.,FSD - Results,Discussion,Hard problem Becomes harder the more topics need to be tracked. Second Story Dete
43、ction much easier that First Story Detection Example: retrospective detection of first 9/11 story easy, on-line detection hard,References,Online New Event Detection using Single-Pass Clustering, Papka, Allan (University of Massachusetts, 1997) A study on Retrospective and On-Line Event Detection, Ya
44、ng, Pierce, Carbonell (Carnegie Mellon University, 1998) Umass at TDT2000, Allan, Lavrenko, Frey, Khandelwal (Umass, 2000) Statistical Models for Tracking and Detection, (Dragon Systems, 1999),Summarization,What is a Summary?,Informative summary Purpose: replace original document Example: executive
45、summary Indicative summary Purpose: support decision: do I want to read original document yes/no? Example: Headline, scientific abstract,Why Automatic Summarization?,Algorithm for reading in many domains is: read summary decide whether relevant or not if relevant: read whole document Summary is gate
46、-keeper for large number of documents. Information overload Often the summary is all that is read. Human-generated summaries are expensive.,Summary Length (Reuters),Goldstein et al. 1999,Summarization Algorithms,Keyword summaries Display most significant keywords Easy to do Hard to read, poor repres
47、entation of content Sentence extraction Extract key sentences Medium hard Summaries often dont read well Good representation of content Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well Something between the last
48、 two methods?,Sentence Extraction,Represent each sentence as a feature vector Compute score based on features Select n highest-ranking sentences Present in order in which they occur in text. Postprocessing to make summary more readable/concise Eliminate redundant sentences Anaphors/pronouns Delete s
49、ubordinate clauses, parentheticals,Sentence Extraction: Example,Sigir95 paper on summarization by Kupiec, Pedersen, Chen Trainable sentence extraction Proposed algorithm is applied to its own description (the paper),Sentence Extraction: Example,Feature Representation,Fixed-phrase feature Certain phr
50、ases indicate summary, e.g. “in summary”, “in conclusion” etc. Paragraph feature Paragraph initial/final more likely to be important. Thematic word feature Repetition is an indicator of importance Do any of the most frequent content words occur? Uppercase word feature Uppercase often indicates named
51、 entities. (Taylor) Is uppercase thematic word introduced? Sentence length cut-off Summary sentence should be 5 words. Summary sentences have a minimum length.,Training,Hand-label sentences in training set (good/bad summary sentences) Train classifier to distinguish good/bad summary sentences Model
52、used: Nave Bayes Can rank sentences according to score and show top n to user.,Evaluation,Compare extracted sentences with sentences in abstracts,Evaluation of features,Baseline (choose first n sentences): 24% Overall performance (42-44%) not very good. However, there is more than one good summary.,
53、DUC,DUC: government sponsored bake-off to further progress in summarization and enable researchers to participate in large-scale experiments.,Multi-documents Summarization,Newsblaster (Columbia),Query-Specific Summarization,So far, weve look at generic summaries. A generic summary makes no assumptio
54、n about the readers interests. Query-specific summaries are specialized for a single information need, the query. Summarization is much easier if we have a description of what the user wants. Recall from last quarter: Google-type excerpts simply show keywords in context,Discussion,Correct parsing of
55、 document format is critical. Need to know headings, sequence, etc. Limits of current technology Some good summaries require natural language understanding Example: President Bushs nominees for ambassadorships Contributors to Bushs campaign Veteran diplomats Others,References,A Trainable Document Su
56、mmarizer (1995)Julian Kupiec, Jan Pedersen, Francine ChenResearch and Development in Information Retrieval The Columbia Multi-Document Summarizer for DUC 2002 K. McKeown, D. Evans, A. Nenkova, R. Barzilay, V. Hatzivassiloglou, B. Schiffman, S. Blair-Goldensohn, J. Klavans, S. Sigelman, Columbia Univ
57、ersity,QA systems,Question Answering from text,An idea originating from the IR community With massive collections of full-text documents, simply finding relevant documents is of limited use: we want answers from textbases QA: give the user a (short) answer to their question, perhaps supported by evi
58、dence.,People want to ask questions,Examples from AltaVista query log who invented surf music? how to make stink bombs where are the snowdens of yesteryear? which english translation of the bible is used in official catholic liturgies? how to do clayart how to copy psx how tall is the sears tower? E
59、xamples from Excite query log (12/1999) how can i find someone in texas where can i find information on puritan religion? what are the 7 wonders of the world how can i eliminate stress What vacuum cleaner does Consumers Guide recommend Around 1215% of query logs,The Google answer #1,Include question words etc. in your stop-list Do standard IR Sometimes this (sort of) works: Question: What famed English site is found on Salisbury Plain?,The Google answer #2,Take the question and try to find it as a string on the web Return the next sentence on that web page as the answer Works
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025重庆标准件工业有限责任公司招聘9人笔试历年难易错考点试卷带答案解析2套试卷
- 2025鄂尔多斯市国源矿业开发有限责任公司招聘(75人)笔试历年备考题库附带答案详解
- 2025贵州雍安物流(集团)有限公司招聘初审合格人员笔试参考题库附带答案详解
- 2025贵州盘兴能源开发投资股份有限公司招聘39人笔试参考题库附带答案详解
- 2025红河发展集团有限公司第二次社会公开集中招聘工作人员(15人)笔试参考题库附带答案详解
- 2026年春季河南洛阳市瀍河回族区公益性岗位招聘38人笔试备考题库及答案解析
- 北医三院病理科病理医师招聘1人(合同制)考试备考试题及答案解析
- 2026年浙江三门经济开发区管理委员会下属事业单位公开选聘工作人员1人笔试参考题库及答案解析
- 2025年反食品浪费方案
- 2026浙江杭州市余杭区选聘名优教师30人考试备考题库及答案解析
- 基于MOFs的α-突触核蛋白寡聚体电化学发光适配体传感器的构建与性能研究
- 酒店突发事件应急预案2025优化版
- 喉运动神经性疾病课件
- 拒绝黄赌毒知识培训简报课件
- 对青少年使用AI辅助学习情况的调查研究报告
- 核酸标本采集技术课件
- 生物(全国新高考Ⅰ卷)2024年普通高等学校招生全国统一考试生物真题试卷及答案
- T/ZHCA 603-2021化妆品生产企业消毒技术规范
- 鼻眼相关解剖结构
- 触电急救知识培训
- A类业余无线电操作技术能力验证题目题库
评论
0/150
提交评论