Can the Internet grade math Crowdsourcing a complex scoring task and picking the optimal crowd size互联网能算数学吗?众包复杂计分任务与最优群体规模的选择_第1页
Can the Internet grade math Crowdsourcing a complex scoring task and picking the optimal crowd size互联网能算数学吗?众包复杂计分任务与最优群体规模的选择_第2页
Can the Internet grade math Crowdsourcing a complex scoring task and picking the optimal crowd size互联网能算数学吗?众包复杂计分任务与最优群体规模的选择_第3页
Can the Internet grade math Crowdsourcing a complex scoring task and picking the optimal crowd size互联网能算数学吗?众包复杂计分任务与最优群体规模的选择_第4页
Can the Internet grade math Crowdsourcing a complex scoring task and picking the optimal crowd size互联网能算数学吗?众包复杂计分任务与最优群体规模的选择_第5页
已阅读5页,还剩38页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Can the Internet grade math? Crowdsourcing a complex scoringtask and picking the optimal crowd size Nathan VanHoudnosyHeinz First Paper / Advanced Data Analysis Paper / Integrative Interdisciplinary ProjectSeptember 20, 2011AbstractThis paper presents crowdsourcing as a novel approach to reducing the grading burden of constructedresponse assessments. We nd that the average rating of 10 workers from a commercial crowdsourcingservice can grade student work cheaply ($0.35 per student response) and reliably (Pearson correlationwith the teachers gold-standard scores, = 0:86 :04 for Grade 3 to = :72 :07 for Grade 8). Thespeci c context of our proof-of-concept dataset is 3rd-8th grade constructed response math questions. Asecondary contribution of the paper is the development of a novel subsampling procedure, which allowsa large data-collection experiment to be split into many smaller pseudo-experiments in such a way as torespect within-worker and between-worker variance. The subsampling procedure plays a key role in ourcalculation that the average of 10 workers scores su ces to produce reliable scores.1 IntroductionTeachers are busy. Time use studies suggest the average teacher needs over 50 hours per week to completehis or her work (Drago, Caplan, Constanza, NEA Research, 2010). Teachers say spendingthis additional time beyond their contractually obligated 37 hours per week is stressful. When askedWhat in your present position as a teacher HINDERS YOU MOST from providing the bestservice of which you are capable?and given a blank space with which to respond, teachers consistently answerHeavy work-load.Since 1976, this response was either the top (1981, 1986, 1996, 2001, and 2006) or second (1976 and 1991)most common answer (NEA Research, 2010).Assessing students is part of what keeps teachers busy; they devote between one-third to one-half of theirworkweek to assessment related activities. Speci cally, just scoring student work and recording it consumes10% to 15% of their workweek (Stiggins Stoddard NEA Research, 2010), (2) focuses on the tasks the students perform (Sanford Roth, Brooks-Gunn, Linver, Young theyask someone or something else to score the student work. We categorize these o oading approaches intofour types depending on who or what is scoring the student work. The categories are computer automatedscoring, peer-scoring, self-scoring, and hiring teaching assistants.Computer automated scoring comes in two kinds. The rst kind are systems that score either multiplechoice questions or xed answer questions. The system knows which response is correct because the studentmade a particular kind of mark with a pencil, or clicked a particular radio button in a web browser, orentered a speci c number (15.3 m/s) in a text box. As discussed, this kind of system cannot score high-levelopen format assessments, so we do not consider it further. The second kind of computer automated scoringbuilds models to predict the score that an expert human rater would assign. For example, a computer canread“ an essay, extract the low-level syntax features, t a statistical model, and output a predication of2the score an expert human rater would have assigned. The computer does not understand what it is doing;it is simply returning the most likely result from the model (Bennett, 2011). In practice, this works fairlywell for structured essays such as GMAT scoring by e-Rater (Burstein, Leacock, AttaliCho, Schunn, Sadler Kaufman Paolacci, Chandler, Requesters are the people or organizations who ask the Turkers to perform work;A HIT, or Human Intelligence Task, is the basic unit of work that a Requester o ers to the pool ofTurkers; andan Assignment is a collection of HITs that the Requester assembles together to accomplish to his orher speci c task.To use these terms in a simple example, if A wants to use AMT to verify that all of the booksin their current-event category are, in fact, books about current events, then A would be theRequester. An individual HIT might be de ned as asking a Turker to check if a single book describes a23/5current event, and then the whole Assignment would be a list of all the books currently classi ed as currentevent books.To explain the common use of AMT, we give a more involved example. Consider the case of a Requesterwho has a large number of photographs (10,000) taken by tourists that may or may not contain a corporatelogo. The goal is two-part. First, to determine the subset of photos that contain logos and second, to assignshort descriptive labels to all of the images. Since the two goals do not depend on each other, the Requesterchooses to split the work into two separate Assignments, each with their own type of HIT. The rst type ofHIT asks the Turkers to answer a Yes/No question (Does this image contain a corporate logo?“) for tenphotographs. The second HIT would be very similar to the rst, but instead of asking a Yes/No question, ablank text box would be displayed under each image so that the Turker could type in an appropriate label.Given the more open nature of the second HIT, the Requester must pay close attention to the instructionsgiven to the Turkers. Otherwise subsets of Turkers may interpret the directions in one way while a di erentsubset may interpret the directions in di erent way (Welinder, Branson, Belongie, Hullman, 2011).The Requester should give careful thought to the grain-size of the HIT, the amount of work a givenHIT contains. Depending on the dimensions of the photos and the typical resolution of a Turkers computerdisplay, the Requester could choose between a single photograph or tens of photographs to comprise a singleHIT. Too many photographs, and the Turkers may see the task as overly burdensome, but too few, and theTurkers may nd the overhead delay to get the next set of photographs negatively impact their e ciency(Chilton, Horton, it has only been in operation since 2005and the majority of the teachers are in the beginning of their careers. Although it primarily serves studentsfrom low-income families, the school has made Adequate Yearly Progress each year of its operation. Thisdistinction is not shared by many other Pennsylvania schools that serve this demographic.Propel EAST is convenient for researchers. The school collects a lot of data on its students. A large part ofthis data arises from third-party exams that predict the schools performance on the PSSA5, Pennsylvaniashigh-stakes end of year assessment. Propel EAST uses both computer based adaptive assessments like theMeasures of Academic Progress (MAP) exam from Northwest Evaluation Association6 as well as more-traditional pencil-and-paper assessments that mimic the format and structure of the PSSA like the 4Sightexams from the Success for All Foundation7.Although machines score the entirety of the MAP exam, only the multiple choice section of the 4Sightexams can be scored by machine. Each 4Sight exam contains two constructed response questions thatteachers must score by hand. This is an inconvenience both to the teachers who volunteer to score (theteachers could be spending this time on other tasks) and the teacher whose students are being tested (unlikethe MAP exam, the results are not available until the volunteers can gather to process the scores). Propelaccepts these inconveniences because the open-ended nature of the constructed response questions is avaluable component of predicting how the students will perform on the PSSA.Our collaboration with Propel EAST revolved around designing and pilot-testing a drop-in replacementto their current inconvenient practice. To motivate the issues involved, we give a short overview of thegrading task the expert teachers perform.4.2.1 Rubrics: The tool the experts useThe Success for All Foundation provided a rubric speci c to each question. Figure 1 on page 3, gave anexample of student work for Grade 6 Question 30, Figure 2 displays the rubric to grade that student work. Itis the most complex rubric in the dataset provided by Propel EAST. Its proper use requires careful trainingand attention to detail.The rubric in Figure 2 is designed for experts to use. First, it contains embedded exceptions that thevolunteer teachers must catch. The scoring of parts A, B and D is very simple. The volunteer teacher mustonly answer a simple Yes/No question: Was the answer provided by the student correct?“ For example,the student work given in Figure 1 received 1 point for part A, 1 point for part B, and 0.5 points for part D.The rubric becomes more complicated for part C. The volunteer teacher must rst determine whether thestudent followed the direction to not use multiplication. Only if the student used a di erent operation (i.e.addition, subtraction, or division) is the volunteer teacher to determine the quality of the student response45http:/www.portal.state.pa.us/portal/server.pt/community/pennsylvania system of school assessment %28pssa%29/875767/7Likert Scale Raw Score Level of Student Understanding4 4 only Thorough3 3 or 3.5 General2 2 or 2.5 Partial1 0.5 or 1 or 1.5 Minimal0 0 only NoneTable 1: Likert-type scale used for the 4Sight rubrics.on a scale with three levels (i.e. 1.5 points, 1.0 point, or 0.5 points). For the example response in Figure 1,note that although the student constructed a valid function table, the student ignored the speci c instructionnot to use multiplication. Consequently, the student received zero points for part C. The raw score of thisexample student was then 2.5 out of 4 points. The last step the volunteer teacher took, which is not shownin the example rubric, was to convert the raw score to a Likert-type scale using Table 1. This nal step alsorequired attention to detail because the table uses a non-standard rounding scheme.The rubric required the volunteer teacher to pay careful attention to its instructions in order to use itreliably. First, the Yes/No questions in parts A and B were worth 1 point while the Yes/No question inpart D was worth 0.5 points. Even though the student from Figure 1 did not follow the directions for partC, the student still earned full credit for the explanation of that incorrect answer in part D. Given that thevolunteer teacher was invested in the activity of scoring student work, this is not a source of concern forthis traditional method of scoring. It is, however, something that must be addressed if crowdsourcing thescoring.4.3 Interface Design: the key to crowdsourcingCrowdsourcing takes a complex task commonly performed by an expert, breaks it up into smaller pieces,and then distributes those pieces to a crowd of workers. How to split the Assignment into HITs and displaythose HITs is key to the success of crowdsourcing. It is the focus of this section.We de ne an interface as the implementation of the design decisions of splitting the Assignment intoHITs and building a way to display them to the Turkers. We begin with a description of a bad interface thatmotivates two general lessons. We then describe the lessons so that the resulting interface successfully splitsan Assignment into individual HITs, and displays the HITs intuitively so Turkers will successfully completethe assignment.4.3.1 An example of a bad interface to learn fromWe present the rst interface we o ered to the Turkers as an example of a bad design to learn from. It failed.Not enough Turkers were willing to complete the HITs, even when o ered the comparatively high wage of$0.15 per HIT, so the Assignment never completed.This rst interface closely mimicked what the volunteer teachers did. First, the Turkers read the samerubric the volunteer teachers used. After reading the rubric, the Turkers saw an example of student workfor the rst part of the rst question. Since the Turkers had computer monitors instead of physical paperrubrics and physical paper copies of student work, the interface design kept the relevant pieces of informationon the screen at the same time. This display pattern gave a piece of student work, then a relevant piece ofthe paper rubric, and then nally a multiple choice question that implemented the question asked by therubric. Three student responses were shown in a single HIT so that the Turkers who invested the time tounderstand the rubric could quickly rate the work of several students at once.Figure 3 displays a zoomed out screen-shot to give a sense of the interfaces sheer size. The text ismuch too small to read because each black arrow represents the height of a standard computer display,approximately 750 pixels. The rst three arrows span a complete copy of the rubric and the informedconsent. The next four and a half arrows span the response from the rst student that is interleaved withthe rubric. The last half of the last arrow spans the beginning of the second students response. Althoughnot shown in the gure, the interface also displayed a third student response. The Turker had to press the84Sight rubric used to score Figure 1Figure 2: The 4Sight rubric for Grade 6 Question 30. An example of student work for this question wasgiven in Figure 19page down key approximately 17 times to travel from the top of the interface to the submit button at thebottom.Figure 3: A zoomed out view of the rst rubric-implementation we attempted. Each of the black arrowsrepresents approximately 750 pixels, roughly the amount of information a computer screen can display.This interface failed in at least three ways. First, it gave the Turkers the expert-level task withoutbreaking it down for them or o ering sca olding. This was an error because the crowd from AmazonMechanical Turk is not a crowd of teaching assistants. They do not have the expertise to grade with therubrics the teachers used. Second, this interface was ugly and hard to navigate. This was an error becauseTurkers choose HITs from a large pool. If a HIT is not attractive, Turkers are less likely to choose it andthe Assignment may not complete.4.3.2 Lesson 1: Split the expert-task and sca old non-expertsThe rubrics the volunteer teachers used give minimal direction because they assume that the people who willuse them are experts with long attention spans who will receive training in their use. For example, in Figure2 the part C portion of the rubric merely reminds the teacher with a note at the bottom of the instructionsthat no points are to be awarded if the student used multiplication. That exception could be missed by anon-expert if the non-expert is overwhelmed with other information.The easiest way to sca old the non-expert Turkers is to reduce the grain-size, the amount of work theyperform in a given HIT. If, for example, Turkers were only asked to grade part C in Figure 2, then theexception of using multiplication is unlikely to be missed. It is not, in principle, necessary to keep part A,part B, and part C from the same student together in the same HIT. A Requester can construct HITs ofarbitrarily small grain-size as long as the speci cation of the Assignment still covers all of the parts of all ofthe questions the students answered.However, HITs that are too small cannot be easily compared with the teachers ratings. The volunteerteachers rated the student work at the question level, so we can only directly compare individual Turkers10with the teachers when the same Turker rates the whole question. Since this proof-of-concept study requiresthis direct comparison to identify Turker quality, it sets a limit on the allowable grain-size of the HIT. Forour case, a whole question must be a single HIT. Generally, the Requester must choose the grain-size tobalance the needs of the Turkers with the needs of the problem the Assignment solves.Since our grain-size for the Turkers is the same as the grain-size for the expert teachers, we must sca oldthe Turkers throughout the task. We chose to do this by collapsing the rubric the teachers used into a seriesof judgments, which are the smallest answerable pieces of a rubric, and then combining the judgments withina HIT to calculate the score the Turker indirectly assigned. For example,In Part A, did the student give the correct answer (144)?Yes No ,is a simple judgment. We use multiple judgments to sca old Turkers through rubric exceptions by hidingthe scoring consequences of the exceptions and instead only asking the Turker to nd the exceptions. Forexample, the multiplication exception in part C of Figure 2, can be implemented by asking:In part C, did the student use an operation other than multiplication to construct their table?Yes NoFor part C, rate the quality of the students work.Everything is correct.Minor errors are present.There is only some correct work.The answer is wrong.and then using the answers to these two judgments to calculate the appropriate score for the HIT.In summary, the rst design lesson is to break the Assignment into the smallest possible HITs and sca oldthe Turkers through the remaining within-HIT judgments. This will give a task-level design that Turkerswill be able to complete. The next lesson addresses attracting the Turkers to these split and sca oldedAssignments.4.3.3 Lesson 2: Make the HIT attractive and easy for the Turkers to useTurkers nd Assignments through text-only advertisements on the AMT Marketplace. If a Turker is inter-ested in an Assignment, AMT gives the Turker a preview of a randomly selected HIT from the Assignment.The Turker may then choose to accept the HIT or return to browsing for other Assignments. Consequently,the preview of a HIT is part of the advertisement that draws Turkers to the Assignment.Therefore, if the HIT does not immediately engage the Turker with clear and interesting work, the Turkeris likely to move on to other Assignments. Careful choice of HIT grain-size and sca olding for within-HITjudgments will bring the HIT close to the expectations of the Turkers, but the HIT must also be

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论