• 上传我的文档

From-Data-Fusion-to-Knowledge-Fusion.ppt

收藏 版权申诉 举报 下载
From-Data-Fusion-to-Knowledge-Fusion.ppt_第1页
第1页 / 共29页
From-Data-Fusion-to-Knowledge-Fusion.ppt_第2页
第2页 / 共29页
From-Data-Fusion-to-Knowledge-Fusion.ppt_第3页
第3页 / 共29页
From-Data-Fusion-to-Knowledge-Fusion.ppt_第4页
第4页 / 共29页
From-Data-Fusion-to-Knowledge-Fusion.ppt_第5页
第5页 / 共29页
资源描述:

《From-Data-Fusion-to-Knowledge-Fusion.ppt》由会员分享,可在线阅读,更多相关《From-Data-Fusion-to-Knowledge-Fusion.ppt(29页珍藏版)》请在人人文库网上搜索。

1、,From Data Fusion to Knowledge Fusion,The task of data fusion is to identify the true values of data items e.g., the true date of birth for Tom Cruise among multiple observed values drawn from different sources e.g., Web sites of varying and unknown reliability.Knowledge fusion identifies true subje。

2、ct-predicate-object triples extracted by multiple ination extractors from multiple ination sources.,Introduction,Extractor To build a knowledge base, we employ multiple knowledge extractors to extract base. This involves three key steps Identifying which parts of the data indicate a data item and it。

3、s value. Linking any entities that are mentioned to the corresponding entity identifier. Linking any relations that are mentioned to the corresponding knowledge base schema.,Some concepts,subject-predicate-object triples We can define the subject,predicate,object as a linkage of some entities and r。

4、elations. e.g.,Tom Cruise, date_of_birth, 7/3/1962,Some concepts,We define the knowledge fusion problem, and to adapt existing data fusion techniques to solve this problem. We suggest some simple improvements to existing s that substantially improve their quality. We make a detailed error analysis o。

5、f our s, and a list of suggested directions for future research to address some of the new problems raised by knowledge fusion.,Contribution,Our goal is to build a high-quality Web-scale knowledge base.The figure depicts the architecture of our system.,System architecture,Voting Among conflicting va。

6、lues, each value has one vote from each data source, and we take the value with the highest vote count. Quality-based Quality-based s uate the trustworthiness of data sources and accordingly compute a higher vote count for a high-quality source. Relation-based Relation-based s extend quality-based s。

7、 by additionally considering the relationships between the sources.,Data fusion ,We follow the data at and ontology in Freebase, and store the knowledge as subject, predicate, object triples, because data in Freebase is structured but not triples. Note that in each triple the subject, predicate pair。

8、 corresponds to a “data item” in data fusion, and the object can be considered as a “value” provided for the data item, like the key,value pair. So our goal is find that means extract them by extractors new facts abouts subject and predicate.,Knowledge base,Now we need to crawl a large set of Web pa。

9、ges and extract knowledge from four types of Web contents.There are some source typesTXT,DOM,TBL and ANO. Contributions from Web sources are highly skewed the largest Web pages each contributes 50K triples while half of the Web pages each contributes a single triple.,Web sources,3 tasks in knowledge。

10、 extraction triple identification deciding which words or phrases describe a triple. entity linkage deciding which Freebase entity a word or phrase refers to. predicate linkage to decide which Freebase predicate is expressed in the given piece of textbeacuse predicates are implicit.,Extractors,uatin。

11、g the quality of the extracted triples requires a gold standard that contains true triples and false triples and Freebase uses closed-world assumption to make it.However, this assumption is not always valid because of facts missing.Instead, we use local closed-world assupmtionLCWA. Some of the erron。

12、eous triples are due to wrong ination provided by Web sources whereas others are due to mistakes in extractions,but extractions are responsible for the majority of the errorsmore than 96 errors are provided by extractors.,Quality of extracted knowledge,The more Web sources from which we extract a tr。

13、iple, or the more extractors that extract a triple, the more likely the triple is true. But there can be exceptions,Quality of extracted knowledge,Given a set of extracted knowledge triples, each associated with provenance ination such as the extractor and the Web source, knowledge fusion computes f。

14、or each unique triple the probability that it is true. 3 challenges The of knowledge fusion is three-dimensionalthe third is extractors. The output of knowledge fusion is a truthfulness probability for each triple. The scale of knowledge is typically huge.,Knowledge fusion,VOTE For each data item, 。

15、VOTE counts the sources for each value and trusts the value with the largest number of sources. ACCU For each source S that provides a set of values VS, the accuracy of S is computed as the average probability for values in VS. For each data item D and the set of values VD provided for D, the probab。

16、ility of a value v VD is computed as its a posterior probability conditioned on the observed data using Bayesian analysis.,Adapting data fusion techniques,Three assumptions about ACCU For each D there is a single true value. There are N unily distributed false values. The sources are independent of 。

17、each other. By default we set N 100 and accuracy A 0.8. POPACCU POPACCUmore robust than ACCU extends ACCU by removing the assumption that wrong values are unily distributed; instead, it computes the distribution from real data and plugs it in to the Bayesian analysis.,Adapting data fusion techniqu。

18、es,Adaptations We reduce the dimension of the KF by considering each Extractor, URL pair as a data source, which we call a provenance. For ACCU and POPACCU, we simply take the probability computed by the Bayesian analysis. For VOTE, we assign a probability as follows if a data item D s; p has n pr。

19、ovenances in total and a triple T s; p; o has m provenances, the probability of T is pT m/n. We scale up the three s using a MapReduce-based framework.,Adapting data fusion techniques,Calibration curve We plot the predicted probability versus the real probability and divide the triples into l 1 b。

20、uckets. Here we use l 20 when we report our results. We compute the real probability for each bucket as the percentage of true triples in the bucket compared with our gold standard. We summarize the calibration using two measures. The deviation computes the average square loss between predicted pro。

21、babilities and real probabilities. The weighted deviation is the same except that it weighs each bucket by the number of triples in the bucket.,Experimental uation,Experimental uation,The basic models consider an Extractor, URL pair as a provenance. Now maybe we can vary the granularity. The page-le。

22、vel and site-level. The predicate-level and all triples. The parttern-level and extractor-level.,Granularity of provenances,Granularity of provenances,We consider filtering provenances by two criteria the coverage and the accuracy. We compute triple probabilities for data items where at least one tr。

23、iple is extracted more than once, and then re-uate accuracy for each provenance. We ignore provenances for which we still use the default accuracy. We use a threshold on accuracy to ingore some provenances, while this could cause a problem we may lose all provenances so cannot predict the probabili。

24、ty for any triple.,Provenance selection,Provenance selection,Leveraging the gold standardby Freebase,Filter provenances by coverage. Change provenance granularity to Extractor, pattern, site, predicate. Filter provenances by accuracy .5. Initialize source accuracy by the gold standard.,Putting the。

25、m all together,Improving the existing s,Choose 20 false positives and 20 false negatives to check their provenance.,Error Analysis,Distinguishing mistakes from extractors and from sources Considering hierarchical value spaces Leveraging confidence of extractions Improving the closed world assumption,Future Directions,,THANKS,,Q A,It is a kind of ology use a precise and machine processable , to define the self-existent conceptobject or entity. This ology is similar to a fundamental modeling process and the output of this process is some models which we call ontologies.,Ontology,,。

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新DOC

关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

网站客服QQ:766697812   微信号:renrenwenkuwang   人人文档上传用户QQ群:460291265   

copyright@ 2020-2023  renrendoc.com 人人文库版权所有   联系电话:0512-65154990

备案号:苏ICP备12009002号-5  经营许可证:苏B2-20200052  苏公网安备:32050602011097号