Developing Crowdsourced Ontology Engineering Tasks An iterative process 发展众源本体工程任务:一个迭代过程_第1页
Developing Crowdsourced Ontology Engineering Tasks An iterative process 发展众源本体工程任务:一个迭代过程_第2页
Developing Crowdsourced Ontology Engineering Tasks An iterative process 发展众源本体工程任务:一个迭代过程_第3页
Developing Crowdsourced Ontology Engineering Tasks An iterative process 发展众源本体工程任务:一个迭代过程_第4页
Developing Crowdsourced Ontology Engineering Tasks An iterative process 发展众源本体工程任务:一个迭代过程_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Developing Crowdsourced Ontology EngineeringTasks: An iterative processJonathan M. Mortensen, Mark A. Musen, Natalya F. NoyStanford Center for Biomedical Informatics ResearchStanford University, Stanford CA 94305, USAAbstract. It is increasingly evident that the realization of the Seman-tic Web will require not only computation, but also human contribution.Crowdsourcing is becoming a popular method to inject this human ele-ment. Researchers have shown how crowdsourcing can contribute to man-aging semantic data. One particular area that requires signi cant humancuration is ontology engineering. Verifying large and complex ontologiesis a challenging and expensive task. Recently, we have demonstratedthat online, crowdsourced workers can assist with ontology veri cation.Speci cally, in our work we sought to answer the following driving ques-tions: (1) Is crowdsourcing ontology veri cation feasible? (2) What is theoptimal formulation of the veri cation task? (3) How does this crowd-sourcing method perform in an application? In this work, we summarizethe experiments we developed to answer these questions and the resultsof each experiment. Through iterative task design, we found that workerscould reach an accuracy of 88% when verifying SNOMED CT. We thendiscuss the practical knowledge we have gained from these experiments.This work shows the potential that crowdsourcing has to o er other on-tology engineering tasks and provides a template one might follow whendeveloping such methods.1 BackgroundResearch communities have begun using crowdsourcing to assist with managingthe massive scale of data we have today. Indeed, certain tasks are better solved byhumans than by computers. In the life sciences, Zooniverse, a platform whereincitizen scientists contribute to large scale studies, asks users to perform taskssuch as classifying millions of galaxies or identifying cancer cells in an image 8.In related work, Von Ahn and colleagues developed games with a purpose, atype of crowdsourcing where participants play a game, and as a result help com-plete some meaningful task. For example, in Fold.it, gamers assist with foldinga protein, a computationally challenging task 4. Further demonstrating thepower of the crowd, Bernstein et al. developed a system that uses the crowd toquickly and accurately edit documents 1. With crowdsourcings popularity ris-ing, many developer resources are now available, such as Amazons MechanicalTurk, Crowd ower, oDesk, Houdini, etc. Finally, as evidenced by this workshop,CrowdSem, the Semantic Web community is beginning to leverage crowdsourc-ing. Systems such as CrowdMap, OntoGame, and ZenCrowd demonstrate how2crowdsourcing can contribute to the Semantic Web 11, 2, 10. Crowdsourcingenables the completion of tasks at a massive scale that cannot be done compu-tationally or by a single human.One area amenable to crowdsourcing is ontology engineering. Ontologies arecomplex, large, and traditionally require human curation, making their develop-ment an ideal candidate task for crowdsourcing. In our previous work, we devel-oped a method for crowdsourcing ontology veri cation. Speci cally, we soughtto answer the following driving questions:(1) Is crowdsourcing ontology veri cation feasible?(2) What is the optimal formulation of the veri cation task?(3) How does this crowdsourcing method perform in an application?In this work, we brie y highlight each of the experiments we developed to answerour questions, and, with their results in mind, then discuss how one might ap-proach designing crowdsourcing tasks for the Semantic Web. In previous work,we have published papers that explore each driving question in depth. The maincontribution of this work is a uni ed framework that presents all of the experi-ments. This framework will enable us to re ect on current work and to ask newquestions for crowdsourcing ontology engineering.2 Ontology Veri cation TaskWe have begun to reduce portions of ontology engineering into microtasks thatcan be solved through crowdsourcing. We devised a microtask method of ontol-ogy veri cation based on a study by Evermann and Fang 3 wherein participantsanswer computer-generated questions about ontology axioms . A participant ver-i es if a sentence about two concepts that are in a parent-child relationship is cor-rect or incorrect. For example, the following question is a hierarchy-veri cationmicrotask for an ontology that contains classes Heart and Organ:Is every Heart an Organ?A worker then answers the question with a binary response of Yes“ or No.“This task is particularly useful in verifying ontologies because the class hier-archy is the main type of relationship found in many ontologies. For example, in296 public ontologies in the BioPortal repository, 54% of these ontologies con-tained only SubClassOf relationships between classes. In 68% of ontologies, theSubClassOf relationships accounted for more than 80% of all relationships. Thus,verifying how well the class hierarchy corresponds to the domain will enable theveri cation of a large fraction of the relations in ontologies.3 Protocol & Experimental DesignWe developed various experiments that use the hierarchy veri cation task to an-swer our driving questions. Generally, each of these experiments follows the samebasic procedure. First, we selected the ontology and axioms to verify. Next, we3created a hierarchy-veri cation task formatted as HTML from these entities andsubmitted the task to Amazon Mechanical Turk. Finally, we obtained workerresponses, removed spam, and compared the remaining responses to a gold stan-dard using some analysis metric. Thus, in each experiment we used a standardset of basic components outlined in Table 1. Typically, we manipulated one ofthese components in each experiment. Figure 1 presents an example task as itappears to a worker on Amazon Mechanical Turk.Table 1. Dimensions in which our crowdsourcing experiments vary.4 ExperimentsTo answer the driving questions, we performed a series of experiments using thebasic protocol. We began with the most basic question about feasibility of themethod. Having shown that, we tested various parameters in order to optimizethe method. Finally, we used the optimal method in verifying SNOMED CT.Table 2 summarizes these experiments and their parameters. In the following,we describe the speci cs of each experiment and our conclusions for each.4.1 Is crowdsourcing ontology veri cation feasible? 7In this rst driving question, we wished to understand if it were possible forAmazon Mechanical Turk workers (turkers) to perform on par with other groupsalso performing the hierarchy-veri cation task.4Figure 1. Example task that a Mechanical Turk worker sees in a browser. In thisexample, workers are provided concept de nitions and a Subsumption relation fromthe Common Anatomy Reference Ontology to verify.Table 2. Experiments we performed in developing a method to crowdsource ontologyveri cation.5Experiment 1: Students and the CrowdMethods We determined whether turkers could recapitulate results from a studyby Evermann and Fang 3. In this study, after completing a training session, 32students performed the hierarchy-veri cation task with 28 statements from theBunge-Wand-Weber ontology (BWW) and 28 statements from Suggested UpperMerged Ontology (SUMO), where half of the statements were true, and half falsein each. As an incentive to perform well, students were o ered a reward for thebest performance.Knowing the results of that experiment, we asked turkers to verify the samestatements. As in the initial study, we required turkers to complete a 12 questiontraining quali cation test. We asked for 32 turkers to answer each 28 questionset and paid $0.10/set. Furthermore, we o ered a bonus for good performance.After turkers completed the tasks, we removed spam responses from workerswho responded with more than 23 identical answers. Finally, we compared theperformance of the students with that of the turkers using a paired t-testResults In both experiments, the average accuracy of student was 34% higherthan the accuracy of the turkers. However, the di erence was not statisticallysigni cant.Conclusion Turkers recapitulated previous hierarchy-veri cation results and per-formed on par with students in the hierarchy-veri cation task.Experiment 2: Verifying the Common Anatomy Reference Ontology(CARO)Methods Verifying a domain ontology was the second component in showing thefeasibility of our veri cation method. For this veri cation task, we used CARO, awell curated biomedical ontology. We selected 14 parent-child relations from theontology as correct relations. Like with WordNet, we paired children with par-ents that were not in the same hierarchy to simulate incorrect relations. We thenasked workers to verify these relations following the earlier experimental setup.In this situation, we had no quali cation test. As a comparison, we asked ex-perts on the obo-anatomy and National Center for Biomedical Ontology mailinglists to perform the same veri cation. Finally, we measured worker and expertperformance, and compared the groups using a t-test.Results With the proper task design of context and quali cations (addressedlater), turkers performed 5% less accurately than experts, but there was not astatistically signi cant di erence.Conclusions Workers performed nearly as as well as experts in verifying a domainontology. These results are quite encouraging. In addition, the results of thisexperiment led us to hypothesize that worker performance signi cantly dependson the task formulation. We address this next.64.2 What is the optimal formulation of the hierarchy veri cationtask? 5With the feasibility of crowdsourcing ontology veri cation established, we fo-cused on formulating the task in an optimal fashion. There were four main pa-rameters that we hypothesized would a ect the methods performance: OntologyType (i.e., the domain of the ontology being veri ed), Question Formulation (i.e.,How should we ask a worker to verify a relationship?), Worker Quali cation (i.e.,How does the accuracy of a worker vary based on certain quali cation?), andContext (i.e., What information should be provide to assist a worker in answeringthe question?).Experiment 3: WordNet and Upper OntologiesMethods Having shown that turkers perform similarly to students and domainexperts, we then analyzed how turker performance varied based on ontologyselection. To do so, we compared worker performance in verifying BWW andSUMO to verifying WordNet. We created a set of WordNet statements to verifyby extracting parent-child relationships in WordNet and also generating incorrectstatements from incorrectly paired concepts (i.e. pairing concepts in parent-child relationships that are not actually hierarchically related). We then askedworkers to verify the 28 WordNet, SUMO and BWW statements following thesame setup as the rst experiment (including the training quali cation), payingworkers $0.10/set, giving a bonus, and removing spam.Results Echoing the rst experiment, workers performed only slightly better thanrandom on BWW and SUMO, respectively. However, workers had an averageaccuracy of 89% verifying WordNet statements. There was a clear di erencebetween worker performance on upper ontologies and WordNet.Conclusion While workers struggle with verifying conceptually di cult relation-ships, such as those contained in upper level ontologies, they perform reasonablywell in tasks related to common-sense knowledge.Experiment 4: Question FormulationMethods We repeated the task of verifying 28 WordNet statements but variedthe polarity and mood of the veri cation question we ask the workers. In thiscase, we did not require quali cations as with the earlier experiments. Table 3shows the 6 di erent question styles through example.Results Worker performance varied from 77% on negatively phrased statementsto 91% with the positive, indicative mood (i.e., a True/False statement assertingthe relationship). In addition, workers responded faster with positively phrasedquestions.7Table 3. Example question types we presented to users on Mechanical Turk.Conclusion Generally for crowdsourcing, one should create tasks in the mostcognitively simple format as possible. In this situation, asking the veri cation assimply as possible (i.e., Dog is a kind of Mammal. True or False?)Experiment 5: Worker Quali cationMethods Having determined the optimal method to ask the veri cation question,we theorized that workers who could pass a domain-speci c quali cation testwould perform better than a random worker on tasks related to that domain. Wedeveloped a 12 question high-school level biology quali cation test. For turkers toaccess our tasks, they would have to pass this test. We assume that the ability topass this test serves as a reasonable predictor of biology domain knowledge. Weasked workers to complete the CARO veri cation (Experiment 3), but requiredthem to rst pass the quali cation task, answering at least 50% of it correctly.Results With quali cations, turkers improved their accuracy to 67% (from ran-dom without quali cations) when verifying CARO.Conclusion When crowdsourcing, some method to select experts in the domainof the task is necessary to achieve reasonable performance. However, such lowaccuracy was not satisfying to the authors.Experiment 6: Task ContextMethods With the increases in performance with proper question formulationand quali cation requirements, we next proposed that concept de nitions wouldassist workers in verifying a relation. In this experiment, we used CARO becausethe ontology has a complete set of de nitions. We repeated Experiment 3, withquali cations and simply stated veri cation questions, varying whether usersand experts were shown de nitions.Results With de nitions, workers performed with an average accuracy of 82%.Experts performed with an average accuracy of 86%. So, when providing workersand experts with de nitions, there was no statistically signi cant di erence.8Conclusion In crowdsourcing, context is essential, especially for non-domainexperts. While workers might not have very speci c domain knowledge, withproper context or training, they can complete the task. This experiment revealedthat in some situations, a properly designed microtask can indeed provide resultson par with experts.4.3 How does this crowdsourcing method perform on anapplication? 6The previous experiments were all synthetic turkers only found errors thatwe introduced. With the optimal task formulation in hand, we shifted our focusto a true ontology veri cation task of verifying a portion of SNOMED CT.We selected SNOMED CT because it is a heavily studied, large and complexontology, making it an ideal candidate for our work.Experiment 7: Verifying SNOMED CTMethods In 2011, Alan Rector and colleagues identi ed entailed SubClass ax-ioms that were in error 9. In our nal experiment, we evaluated whether ourmethod could recapitulate their ndings. To do so, we asked workers to performthe hierarchy veri cation task on these 7 relations along with 7 related relationswe already knew were correct. We used the optimal task formulation we deter-mined in earlier experiments and provided de nitions from the Uni ed MedicalLanguage System. In addition, we posted the task with 4 di erent quali cationtests: biology, medicine, ontology, and none. To note, instead of asking workersto complete the task of verifying all 14 relations in one go, as with earlier ex-periments, we instead broke up the task into smaller units, creating one taskper axiom and paid unquali ed workers and quali ed workers $0.02 and $0.03per veri cation, respectively. We then compared workers average performanceto their aggregate performance (i.e., when we combined all workers responses toone nal response through majority voting 6).Results The aggregate worker response was 88% accurate in di erentiating cor-rect versus incorrect SNOMED CT relations. On average, any single workerperformed 17% less accurately than the aggregate response. Furthermore, therewas no signi cant di erence in performance for tasks with di ering quali cationtests.Conclusion Individually, workers did not perform well in identifying errors inSNOMED CT. However, as a group, they perform quite well. The stark di erencebetween average worker performance and aggregate performance reinforces thefact that the power of the crowd lies in their combined response, not any workeralone.95 DiscussionEach of the experiments we performed highlighted various lessons we learnedin developing a method for crowdsourcing ontology veri cation. A few lessonsare particularly useful for the Semantic Web community. First, many of ourexperiments focused on changing small components of the task. Even so, throughthis process we greatly improved crowd worker performance. It is clear that eachtask will be unique, but in most cases, extensive controlled trials will assistin identifying the best way to crowdsource a task. Following this strategy, weveri ed a complex ontology with relatively high accuracy. In addition, our currentresults only serve as a baseline through additional iteration, we expect theincreases in accuracy to continue.Second, using the re ned tasks, we showed that crowd workers, in aggre-gate, can perform on par with experts on domain speci c tasks when providedwith simple tasks and the proper context. The addition of context was the sin-gle biggest factor at improving performance. In the Semantic Web, a trove ofstructured data are available, all of which may provide such needed context (andmaybe other elements, such as quali cation tests). For example, when using thecrowd to classify instance-level data, the class hierarchy, de nitions, or otherinstance examples may all assist the crowd in their task.Our results suggest that crowdsourcing might serve as method to improveother ontology engineering tasks such as typing instances, adding de nitions,creating ont

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论