An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文_第1页
An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文_第2页
An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文_第3页
An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文_第4页
An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文_第5页
已阅读5页,还剩8页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、an introduction to criterion-referenced item analysis and item response theory abstract: item analysis (ia) is a term to be used more specifically in the context of classical measurement to refer to the application of statistical techniques to determine the properties of test items, principally item

2、 difficulty and item discrimination. the purpose for it is to select which items will remain on future revised and improved versions of the test. sometimes, item analysis is also performed simply to investigate how well the items on a test are working with a particular group of students, or study wh

3、ich items match the language domain of interest. item response theory (irt) or latent trait theory, as it has been variously termed, is a general measurement theory developed independently by birnbaum in the united states and by rasch in denmak. it refers to primarily, but not entirely, three famili

4、es of analytical procedures. these are identified as the one-parameter, the two-parameter and the three-parameter logistic models. what these models have in common is a systematic procedure for considering and quantifying the probability or improbability of individual item and person response patter

5、ns given the overall pattern of responses in a set of test data. they also offer new and improved ways for the estimation of item difficulty and person ability. the article is written to illustrate why item analysis and item response theory (irt) are useful for teachers when they construct tests and

6、 examining measurement equivalence. the introduction contains three parts: 1. criterion or domain-referenced vs. norm-reference or standardized tests. 2. item analysis. 3. item response theory. it will simply show the difference between criterion-reference tests (crt) ia and norm-referenced tests ia

7、 because of the complication of irt, it will only introduce the three-parameter model of irt.key words: criterion-reference test item analysis item response theory test1. introductionitem analysis (ia) is an aspect of test analysis which involves examination of the characteristics of test items (ala

8、n davies etc., 1999). it is used more specifically in the context of classical measurement to refer to the application of statistical techniques to determine the properties of test items, principally item difficulty and item discrimination. the purpose of it is to select which items will remain on f

9、uture revised and improved versions of the test. sometimes, item analysis is also performed simply to investigate how well the items on a test are working with a particular group of students, or study which items match the language domain of interest. item response theory (irt) or latent trait theor

10、y, as it has been variously termed, is a general measurement theory developed independently by birnbaum in the united states and by rasch in denmak (grant henning, 2001); refers to primarily, but not entirely, three families of analytical procedures. these are identified as the one-parameter, the tw

11、o-parameter, and the three-parameter logistic models. what these models have in common is a systematic procedure for considering and quantifying the probability or improbability of individual item and person response patterns given the overall pattern of responses in a set of test data. they also of

12、fer new and improved ways for the estimation of item difficulty and person ability. 2. criterion or domain-referenced vs. norm-reference or standardized tests.from the definition of ia, we get to know ia, more specifically to belong to classical measurement, is related to item difficulty or item fac

13、ility (if) and item discrimination (id), which are used in traditional norm-reference tests (nrt) item analysis (james d. brown and thom hudson, 2002). before we study about item analysis, we should understand what are crt and nrt first.2.1 criterion-reference tests a general definition for a criter

14、ion-referenced test (crt) was first provided by glaser in 1963 (rui huang, a: 2004). he first defined criterion-referenced measures indicating the content of the behavioral repertory, and the correspondence between what an individual does and the underlying continuum of achievement. measures, which

15、assess students achievement in terms of a certain criterion standard, thus provide information as to the degree of competence attained by a particular student who is independent of reference to the performance of others. (p.519). later in 1971, he and nitko made clear and simple definition for a crt

16、: a criterion-referenced test is one that is deliberately constructed to yield measurements that are directly interpretable in terms of specified performance standards. performance standards are generally specified by defining a class or domain of tasks that should be performed by the individual. (p

17、.653) in this case, criterion-referenced tests are useful for teachers both in clarifying teaching objectives and in determining the degree to which they have been met. crts are also often used for professional accreditation purposes, i.e. the test represents the types of behaviors considered critic

18、al for participation in the profession in question (alan davies: 38).test sores of crts report a candidates ability in relation to the criterion, i.e. what the candidate can and cannot do, rather than comparing his/her performance with that of other candidates in the relevant population, such as hap

19、pens in norm-referenced tests. test results are often reported using descriptive scales (i.e. percentage) rather than a numerical score. in contrast to norm-referenced tests the criterion, or cut-score, is set in advance (rui huang, b: 2004).2.2 norm-referenced tests.a type of test whereby a candida

20、tes scores are interpreted with reference to the performance of the other candidates. thus the quality of each performance is judged not in its own right, or with reference to some external criterion, but according to the standard of the group as a whole. in other words, norm-referenced tests are mo

21、re concerned with spreading individuals along with an ability continuum, the normal curve, than with the nature of the task to be attained, which is the focus of criterion-referenced tests (alan davies, 1999). where an alternate version of a norm-referenced test is being developed, interpretation of

22、 raw scores on the new version of the test may be made in the light of normative performance (i.e. the mean and standard deviation) on the previous version, as is the case for widely administered tests such as toefl. for norm-referencing to be effective, it is important that there be a large number

23、of subjects, and a wide range of normally-distributed scores.the ranking capacity of norm-referenced tests is sometimes used to set cut-off scores, so that, for example, only got 60% of the test, those examinees are allowed to pass (huizhong yang, 2001). 2.3 distinctions between criterion-referenced

24、 and norm-referenced testing.as many educators and members of the public fail to grasp the distinctions between crts and nrts, we may draw a chart to compare these two types of tests from purpose, content and item characteristics. much confusion can be eliminated if the basic differences are underst

25、ood. the following chart is adapted from: popham, j.w (1975), which clearly distinguish crts from nrts.dimensioncriterion-referenced testsnorm-referenced testspurpose to determine whether each student has achieved specific skills or concepts. to find out how much students know before instruction beg

26、ins and after it has finished.to rank each student with respect to the achievement of others in broad areas of knowledge. to discriminate between high and low achievers.contentmeasures specific skills which make up a designated curriculum. these skills are identified by teachers and curriculum exper

27、ts. each skill is expressed as an instructional objective.measures broad skill areas sampled from a variety of textbooks, syllabi, and the judgments of curriculum experts.itemcharacteristicseach skill is tested by at least four items in order to obtain an adequate sample of student performance and t

28、o minimize the effect of guessing. the items which test any given skill are parallel in difficulty.each skill is usually tested by less than four items. items vary in difficulty.items are selected that discriminate between high and low achievers.scoreinterpretationeach individual is compared with a

29、preset standard for acceptable achievement. the performance of other examinees is irrelevant. a students score is usually expressed as a percentage. (rui huang, b)each individual is compared with other examinees and assigned a score-usually expressed as a percentile, a grade equivalent score, or a s

30、tanine. students score is usually expressed as a percentile.3. items analysis in most language testing situations we are concerned with the writing, administration, and analysis of appropriate items. the test is considered to be no better than the items that go into its composition. weak items shoul

31、d be identified and removed from the test. thus, there are certain principles we can follow in writing items that may ensure greater success when the items undergo formal analysis. there are several ways to define items analysis (ia). jack c. richards etc (1992) defined ia is the analysis of the res

32、ponses to the items in a test, in order to find out how effective the test items are and to find out if they indicate differences between good and weak students. as alan davies mentioned in his dictionary of language testing (1999:92) ia is an aspect of test analysis which involves examination of th

33、e characteristics of test items. the term is used more specifically in the context of classical measurement to refer to the application of statistical techniques to determine the properties of test items, principally item difficulty and item discrimination. as a whole, ia will be defined in this pap

34、er as systematic statistical evaluation of the effectiveness of individual test items. it is usually done for purposes of selecting which items will remain on future revised and improved versions of the test. sometimes, however, item analysis is performed simply to investigate how well the items on

35、a test are working with a particular group of students, or to study which items match the language domain of interest. ia can take numerous forms, but when testing for norm-referenced purposes there are two traditional item statistics that are typically applied; item facility and item discrimination

36、. in developing crts, other difference index, b-index, agreement statistic, and item phi (). 3.1 traditional item analysistraditional nrt item analysis has been used for several years. it refers always to multiple-choice tests (robert wood, 1993). 3.1.1 item facilityitem facility goes by many other

37、names: item difficulty, item easiness, p-value, or abbreviated simply as if (james. d. brown: 114). regardless of what is it called, it is a measure of the ease of a test item. it is the proportion of the students who answered the item correctly, and may be determined by the formula: item facility (

38、if) = where: r= number of correct answers; n= number of students taking the test the higher the ratio of r to n, the easier the item (jack. c: 240). it is important to note that this formula assumes that items left blank by examinees are wrong (james, etc: 114).calculating if will result in values r

39、anging from 0 to 1.00 for each item. for instance, an if index of 0.21(item 23 on table 1) would indicate that 21% of the examinees answered the item correctly. this would seem to be a very difficult item because 79% are missing it. an if of 0.94 (item 20 on table 1) would indicate that 94% of the e

40、xaminees answered correctlya very easy item because almost all of the examinees got it right.the apparently simple information provided by the item facility statistic can prove very useful. consider the pattern of right and wrong answers shown in table 2 (taken from rui huang, b: 2004). the examinee

41、s responses are recorded as 1s for correct answers and 0s for incorrect answers. notice that item 4 was answered correctly by every examinee (as indicated by the 1s straight down that column). it is equally easy to identify that item 9 is the most difficult because no examinee answered it correctly.

42、 according to the if formula we can calculate all items if which are shown on table 3. therefore item 9 and item 4 may be revised or rejected. table2: item analysis data (first 10 items only)students no. itemstotal1 2 3 4 5 6 7 8 9 10 etc%36373839401 1 1 1 1 0 1 1 0 1 . 960 1 1 1 1 0 1 0 0 1 . 950 0

43、 1 1 1 0 1 0 0 1 . 921 0 1 1 1 0 0 0 0 1 . 911 1 1 1 0 0 1 0 0 1 . 9041424344451 0 1 1 0 0 1 1 0 1 . 900 1 1 1 1 0 1 0 0 1 . 881 0 1 1 0 0 1 0 0 1 . 800 1 1 1 0 1 1 0 0 1 . 790 1 1 1 0 1 0 0 0 1 . 7246474849501 0 0 1 1 1 1 1 0 1 . 671 0 0 1 0 1 0 1 0 1 . 661 1 0 1 0 1 0 0 0 1 . 640 0 0 1 0 1 0 1 0 1

44、 . 640 0 0 1 0 1 1 0 0 0 . 613.1.2 item discrimination another important characteristic of a test item is how well it discriminates between weak and strong examinees in the ability being tested. difficulty alone is not sufficient information upon which to base the decision ultimately to accept or re

45、ject a given item. consider, for example, an item 15 (if=0.5) in table 1, which half of the examinees pass and half fail. using difficulty as the sole criterion, we would adopt this item as an ideal item. but what if we discovered that the persons who passed the item were the weaker half of the exam

46、inees, and the persons who failed the item were the stronger examinees in the ability being measured. this would certainly cause us to have different thoughts about the suitability of such an item. if our test were comprised entirely of such items, a high score would be an indication of inability an

47、d a low score would be an indication of comparative ability. what we need at this point is a method of computing item discrimination. item discrimination (id) is an entirely different statistic, which shows the degree to which an item separates the “upper” examinees from the “lower” ones. these grou

48、ps may also be called the “high ”and “low” scores or the “upper” and “lower” proficiency groups. usually the upper and lower groups are defined as the upper and lower third (33%), or 27% (james:116).itemstatistic items1 2 3 4 5 6 7 8 9 10iftotal.53 .47.671.00.40.53.67.33 .00 .93ifupper.60 .601.001.0

49、0.80.00.80.20 .00 1.00iflower.60 .20.00 1.00.201.00.40.60.00.80id.00 .401.00 .00.60-1.00.40-.40.00.20table 3: item facility and discrimination statistics (taken from james, 2002)for instance, table 2, we may divide five examinees in the top, middle, and bottom groups (33% each). once the data are se

50、parated into the upper and lower groups, discrimination indices can be calculated. the item facility for the upper and lower groups should be calculated separately. the if for the upper group will be calculated by dividing the number of examinees answering correctly in the upper group by the total n

51、umber of examinees in that group. similarly, the if for the lower group will be calculated by dividing the number who answered correctly in the lower group by the total number of examinees in the lower group. therefore:id= ifupperif lower where: id= item discriminationifupper = item facility for the

52、 upper group on the whole testiflower = item facility for the up lower group on the whole testfrom table 3 we got ifupper =r/n=5/5=1.00, and iflower=0. thus item 3 id=1, that indicated item 3 would be considered a very good discrimination index for an nrt. in other words, all strong examinees (upper

53、 group) answered correctly and weak examinees (lower group) had wrong answers. id indices can range from 1.00 (if all of the upper group answer correctly and all of the lower group answer incorrectly) to-1.00 (if all of the lower group answer correctly and all of the upper group answer incorrectly a

54、s on item 6 in the tables. such kind of items should be revised). id statistics can be used effectively to revise and improve norm-referenced tests. however, the calculations represented by the above formula are not difficult. because they must be done for each item, they may become tedious.another

55、statistic that is often used in place of the id index described above is to compute the correlation between item responses and total scores for any given test. since item scores in this instance are binary, i.e., zero or one, and total scores are continuous, the resultant correlation coefficient is

56、called a point-biserial correlation coefficient. id is shown here because it is easier to understand conceptually and to calculate. in developing an nrt, the test revision process is accomplished by keeping the “best” items and discarding weak ones. the choices are selected if they have an if value

57、of between 0.40 to 0.70 and id index generally 0.40 and above. as cziko (1983) points out, these items are selected because they maximize variance (individual differences) and this, in turn, produces higher estimates of reliability and validity when traditional correlational statistics are used.3.2. criterion-referenced item analysiswhile it is possible to employ th

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论