版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1
FinalReportfortheIEAR&D2023
ResearchProjectentitled:
UsingLargeLanguageModelsforAutomaticItem
Generation:DevelopmentandValidationfor
TIMSSGrade4
PI:MarekMuszyński
Co-Is:HubertPlisiecki,ArturPokropek,TomaszŻółtak
InstituteofPhilosophyandSociologyofthePolishAcademyofSciences
(IFiSPAN)
2
Tableofcontents
ExecutiveSummary4
1.Introductionandstudyaims5
2.Studypreparation8
2.1Itemextraction8
2.2ItemGenerationDesign9
2.3Itemgenerationandinitialrevisionprocess15
3.Subjectmatterexpertspanel16
3.1Subjectmatterexpertspaneldesign16
3.2Subjectmatterexpertpanelcomposition19
3.3SMEspanelresults22
3.3.1AgreementAnalyses22
3.3.2Raters'evaluations26
Mathematics26
Science34
AdditionalAnalyses42
3.3.3Raters’comments43
3.4SMEsPanelSummaryandDiscussion47
3.4.1Limitations48
3.4.2Futuredirections49
3.4.3Conclusions49
4.Validationstudy50
4.1Itemgeneration50
4.1.1OverallApproach50
4.1.2TheSimpleSetup50
4.1.3TheComplexSetup58
4.1.4TheFinalChoiceofItems59
4.1.5GeneratingImages60
4.2Testpreparation60
4.2.1Testplan(incompleteblockdesign)60
4.2.2Assessmentcontents62
4.2.3Paradatacollection64
4.3Fieldworkpreparation,proceduresandtimeline64
4.3.1Schoolrecruitmentandremuneration64
4.3.2Testingroompreparation65
4.3.3Assessmentsessionprocedures66
4.3.4Pilotadministration68
4.3.5Datacollectiontimeline68
4.3.6Samplecomposition72
4.3.7BookletCoverage73
3
4.3.8Itemcoding74
4.4Psychometricmodeling77
4.4.1Excludeditems77
4.4.2Modelestimationandmodelfit78
4.4.3ItemparametersofTIMSSandLLM-generateditems82
4.4.4DifferentialItemFunctioning102
4.4.5TestingLocalItemIndependence105
4.4.6Summaryofpsychometricanalyses107
5.Generaldiscussion112
5.1Resultssummaryandconclusion112
5.2Futuredirections115
6.References120
7.Supplement125
7.1Duplicatesremovedfromthegenerateditemset125
7.2Kruskal-WallisTestResults128
7.3AdditionalresultsforDIFanalysis134
7.4ListofAnnexes144
4
ExecutiveSummary
ThestudyaimedtovalidatethequalityofassessmentitemsgeneratedbyLargeLanguageModelsforuseinmathematicsandscienceassessmentontheexampleoftheTIMSSGrade4.Thevalidationprocessincludedexpertratingsandqualitativeassessment,aswellasusingthegenerateditemsinafieldtestthatenabledtheassessmentofthesubstantialandpsychometricpropertiesofthegenerateditems.SinceLLM-generateditemsweremixedwithoriginalTIMSSitems,wewereabletocomparethetwoitemtypesdirectly.
Thestudy’sobjectivewastoassessthecurrentpossibilitiesofthewidelyaccessiblemodelswithoutaheavyrelianceonhumaneditingandselecting.Tothisend,wedidnoteditthegenerateditemsnorselectedthembasedonanycriteriaapartfromcorrectingminorandobviouserrors(e.g.,spelling)andremovingrepeateditems.
ThegatheredempiricalevidenceshowedthatmanyoftheLLM-generateditemswereofhighquality.However,overall,thesubjectmatterexpertsratedthegenerateditemsaslesssuitableforassessmentsandlessattractivecomparedtooriginalTIMSSitems.Psychometrically,thegenerateditemsalsoperformedlesswellthanrealTIMSSitems.Themainpsychometricissuesincluded:difficultylevelnotalignedwiththetargetgroup,presenceofitemswithnegativeorlowdiscriminationparameter,andviolationsofthelocalindependenceassumption.NodifferenceswerefoundbetweenTIMSSandgenerateditemsintermsofdifferentialitemfunctioning,itemfit,ordistractorfunctionality.
Thestudyalsoanalysedthemostcommonproblemsobservedinthegenerateditemsandtheirunderlyingcauses.Thegenerateditemswereoftentooeasy(especiallythemultiple-choicemathematicsitems)ortoodifficult(especiallytheconstructed-responsescienceitems).LLMsfrequentlyproducedverysimilarandrepetitiveitems.Theresponseoptionsgeneratedalsodisplayedseveralissues,suchasimplausibleorunjustifieddistractors,repeateddistractors,absenceofacorrectanswer,orthepresenceofmultiplecorrectanswers.
TheLLMalsodemonstrateddifficultyinusingpreciseterminologyandappropriatestyle.OccasionalproblemswithstyleandgrammarmayhavestemmedfromtheuseofPolishastheassessmentlanguage,whichhasrelativelylessdigitalcontentavailableforLLMtrainingcomparedtoEnglish(acommonissuefor“resource-poor”languages).Overall,themathematicsitemsappearedtoperformbetterthanthescienceitems.
Werecommendcontinuingthislineofresearchusingdomain-specificfine-tunedmodelsrunlocally,astheycurrentlyholdthemostpromiseforproducinghigh-qualityLLM-generateditems.Humaninterventionandjudgmentremaincrucialforselectingthebestitems.
5
1.Introductionandstudyaims
Itemsconstituteafundamentalaspectofanyassessment.Acomprehensivemeasurementreliesonpsychometricallysound,linguisticallycorrect,andtheoreticallyaligneditems.Thegrowingdemandformeasurementprogramsresultedinanincreasedneedforsuchitems.However,developingexcellentitemsrequiresdauntingamountsofpreciousresources.Typically,itinvolvesteamsofsubjectmatterexperts,
psychometricians,andfieldworkers,whoareresponsiblefordeveloping,evaluating,piloting,andvalidatingitemsbeforetheyenterfieldwork.Theprocessislong,costly,andlabor-intensive.
Foralongtime,researchersaimedatstreamliningtheprocessofitemdevelopment.Thislineofresearchwasnamedautomaticitemgeneration(AIG)andusesdifferentmethodstoshortcuttheprocessofitem“production”.First,meanssuchasitemcloningwereused(e.g.,Glas&vanderLinden,2003).Here,verycloseduplicatesofamodelitemaregeneratedwithessentiallythesamecontentbutsmallchangestotheitem’sstemand/ordistractors.Thisframeworkwaslaterdevelopedtogenerateitemclonesonthebasisofpredefineditemmodelsandcomputeralgorithmsthathelpedtoalterkeyitemelementstogetnewitems(Gierletal.,2016).Thenumberandqualityofthese“automatically”generateditemsdependedonthenumberandqualityofprepareditemmodelsanditemelementsthatweresettovarybetweenitemversions.Thewholeprocesswasrathersemiratherthanfullyautomaticandrequiredalotofhumanwork(Attalietal.,2022).
Next,approachesbasedondeeplearningstartedtoemerge,wherecomputeralgorithmsspecializedinnaturallanguageprocessingwereutilizedtocreateelementsofnewitems(stems,vignettes,distractors,etc.).Atthisstage,neuralnetworksweremostlyemployedforAIG,theprocessremainedlonganddifficult,asearlymodelsneededtime-consumingandoftencostlypre-training/fine-tuningthatrequiredsignificanttechnicalknowledge.Theoutcomeswereoftenfarfromperfect,asthecreateditemswereverybasicandnecessitatedfurtherhumanpolishingup.ExamplesoftheseearlyworksareGaoetal.(2019),Liangetal.(2018),Shinetal.(2019).Susantietal.(2018)andZuetal.(2023)investigatedmethodsofautomaticdistractorcreationforitemstems.Mitkovet
6
al.(2023)showedamethodofcreatingwholeitems(stem+distractors)basedontextpassagesusedtotrainneuralnetworks.VonDavier(2023)andBezirhanandvonDavier(2023)presentedamethodtocreatetextpassagesforfurtheritemcreationbasedonGPT-2andGPT-3models(earlyOpenAImodels,seecitedpapersformoredetails).
Therapiddevelopmentoflargelanguagemodels(LLMs)leadtotheemergenceofnewtoolsthatarenowcapableofexcellentprocessingandgeneratingofnaturallanguagetextsandothertasks(e.g.Bubecketal.,2023).Theincreaseinabilitiesofthesemodelscallsforarevisitofpreviousresearchonthistopic.Recently,Drorietal.(2022)orAttalietal.(2022)cameupwithmethodstouselargelanguagemodelstoconstructwholeitemsinthedomainofmathematicsandEnglish,respectively.Severalstudiessucceededinautomaticallydevelopingself-reportitems,e.g.personalityscales(Götzetal.,2023;Hommeletal.2022;vonDavier,2018).Whatismore,theinitialresearchshowsthatitispossibletoautomaticallygenerateitemsofpre-setdifficultyandcomplexity(Raina&Gales,2022;Settlesetal.,2020).Forexample,BezirhanandvonDavier(2023)demonstratedthatLLMsrequireonlyshortpromptstogenerateitemsofvaryingdifficulty(asassessedbytheexperts).
AnotherresearchareaopenedbytheuseofLLMsispromptengineering,researchondevelopingandoptimizingpromptstousemodelsefficiently(Reynolds&McDonell,2021).TheoutcomeofAIGdiffersdependingonthequantityandqualityofthecontextusedtopromptLLMsandtheexactquestion(instruction)posed.Wangetal.(2022)comparedtheitemsgeneratedwiththeuseofpromptsdifferinginnumberandcomplexityofexamples,aswellaslengthoftheoverallcontextprovided.Inthisstudy,itemsofhigherqualityweregeneratedwithmorecontent-specificexamples,withfive,beingtheoptimalnumberofexamplesprovided.Shorterormedium-lengthcontextinformation(15-25word-long)yieldedgenerationof“better”itemsthanlongcontext(40words;Wangetal.,2022).Ontheotherhand,zero-shotprompts(noexamples)wereaseffectiveasone-shotprompts(oneexample)inthecontextofGPT-3andcreatingPIRLSpassages(Bezirhan&vonDavier,2023).
Theresearchonmodellingtaskdifficultybypromptsandmosteffectivewayofpromptingisjustatitsbeginning(Attalietal.,2022).However,itisalreadyevidencedthatLLMscanincorporate(“learn”)knowledgeprovided“ascontextualinformationwithin
7
prompts”(Lampinenetal.,2022;Wang,2023)asanalternativetofine-tuningthemwithanupdateddataset(Drorietal.,2022).Promptingisnaturallymoretime-andcost-effective,especiallythatnewerLLMsallowforlongerandmorecomplexpromptswithalotofcontextprovided.
ThisprojectaimstobroadenthefindingsofearlystudiesandshowthatAIGcanbeemployedintherigorouscontextofinternationallarge-scaleassessments,suchasTIMSS.Moreover,weintendtogofurtherthanpreviousstudiesinthecontextofIEA’sassessments,e.g.BezirhanandvonDavier(2023)presentedamethodtocreatetextpassagesforfurthercreationofPIRLSGrade4itemsandgotthecreatedpassagesevaluatedbyhumanraters.Here,weaimtogeneratecompleteitems(stemandresponseoptions)andthensubjectthemtohumanexperts’evaluationbutalsotovalidatetheminafieldtestmimickingrealIEAassessmentsetting.WewillalsofocusondevelopingitemscloselyalignedtotheTIMSSassessmentframework,aswellasgeneratingitemsofpre-defineddifficulty.Graphicalitemelements,commonforTIMSSGrade4,willbegeneratedbygraphicalgenerativeartificialintelligenceprograms,suchasDALL-E3orwiththeuseofgraphicalplug-insforGPT-4.Recently,GPT-4wasequippedwithnewplug-insthatlinkwithPythontoproducetablesandcharts,whichshouldbeveryhandyforgeneratingTIMSS4items.Inshort,themainresearchitemsare:
1.Developmethodsofautomaticitemgeneration(AIG)usinglargelanguagemodels(LLMs)formaths&scienceassessmentTIMSSGrade4(inPolish).
2.Validatethesemethodsusingasubjectmatterexpertspanelandaschool-basedassessmentfollowingcloselytheoriginalTIMSSGrade4methodologyanddesign.
3.Provideevidence-basedknowledgeonthepossibilitiesofemployingAIGforinternationallarge-scaleassessmentsbasedontheexamplesgeneratedforTIMSSGrade4.Thisisafastdevelopingdomainandassuchtheresultsofthestudy(Summer2024)mightnotbefullygeneralizabletothecurrent(September2025)capabilitiesofLLMmodels.
8
2.Studypreparation
2.1Itemextraction
TIMSSGrade4restricteduseitemsinEnglishandPolishwereobtainedfromtheIEA.ThedocumentswithPolishTIMSSitemswerepassedtotheOpenAI’s“gpt4-o”imageAPI,pageafterpage,withthefollowingquery:
ExtractexercisesfromthisTIMSSexampageimageinajsonformat.
Whenevertablesappear,encodetheminthequestionvariableusingmarkdownalongsidetheexercisetext.Wheneverimagesappear,describewhattheyportrayinPolishintheformat[OBRAZEK:image_description]andaddthedescriptiontothequestionvariable.
Theimportantvariablesare:
"exercise_id"(exampleformats:M_01_07;S_02_05)
"question":exercisetext
"answers":answers
"contains_image"(boolean)
Makesuretoproducevalidjsons,escapecharactersproperly,extractallexercises,anddonotwriteanythingelse!
Thisallowedustoextract174TIMSSscienceitemsTIMSS,and177TIMSSmathsitems(bothGrade4).Theextracteditemswerethenmanuallycheckedforhallucinations,andformattingproblemsbythemembersoftheresearchteam.Whilecheckingtheitems,theresearchteammembersalsosupplementedthedatabasewiththeinternationalIDsoftheitemsbyaddingthisargumenttoeveryiteminthejson.
Themetadatarelatedtocontent,andcognitivedomainfortheitemswasmanuallycopiedfromtherelevantTIMSSdocuments.
Ashortcommentappliestodatasafetyandresearchreproducibility.Accordingtothecompany’spolicy
(/api/
),theAPIdoesnotusethedatatotrainAImodelsanddoesnotprocessitinanyotherkind.Moreover,thedataisremovedfromtheplatformautomaticallyafteranindicatedamountoftime
9
(/docs/guides/your-data
).TheOpenAIAPIusewasconsultedwiththeIEAbeforeanyoperationstookplace.Hence,usingtheAPI,alongwithother“business”optionstoaccessOpenAImodelsseemsafefromthedataprotectionpointofview.
Withregardtotheresearchreproducibilityandopenscience,wecannotprovidetheIEArestricteduseitemsemployedtopromptthemodels,butanyinterestedresearchercanapplytousethemforresearchpurposes.HavingIEA’sconsentandourprompts,anyonecanreproducetheprocess.However,wecannotguaranteefullreproduction,astheOpenAImodelsconstantlychange-andwecandonothingwiththat.Thisissuewillberaisedagaininthediscussion.
2.2ItemGenerationDesign
Forthepilotstudy,theitemsweregeneratedusingthe“GTP-4o”APIfromOpenAI.Forthehyperparameters,theonlynon-defaultparameterwasthetemperature
1
,whichwassetto0.8basedonashortgenerationpilotstudy,whereweiterativelygenerateditemsuntilwehavefinetunedthisparametertoallowformaximumvariability,withoutperceivednegativeimpactonitemintelligibility.Testingthisandotherparametersinasystematicmannerwasdecidedagainstduetothetimeandresourceconsumingiterativeprocessthatitwouldhaverequired.Thenumberofcompletions(n)wasleftatitsdefaultof1,andbothpresence_penaltyandfrequency_penaltyremainedat0,sonoadditionalbiasfornoveltyoragainstrepetitionwasimposedontheoutputs.Becausemax_tokenswasnotset,themodelwasfreetogenerateuptoitsfull4096-tokencompletionlimitwithinGPT-4o’s128000-tokencontextwindow,effectivelylettingtheanswerrununtilthemodeldecidedtostop.Differentmultiple-shotsetupswereused,wheredifferentnumbersofexampleTIMSSitemsweresubmittedtotheLLMalongsidetherequesttogeneratenewitems.
1Temperatureistheparametertoadjusttheoutputsoflargelanguagemodels,thehigherthetemperature,thehighertherandomnessandmorediversetheoutputs.However,theparametercannotbedirectlylinkedto“creativity”(Peeperkornetal.,2024).
10
Thesetupswere:
●zero-shot-0examples,
●one-shot-1example,
●three-shot-3examples,●andfive-shot-5examples.
Thechoiceof0,1,3,and5exampleswasmadetosystematicallyexploremodelbehaviorunderdifferentlevelsofsupervision,inawaythatisappropriateforanewtaskwherenoestablishedfew-shotbenchmarksyetexist.
Giventhenoveltyofthetask,itwasimportanttotreattheevaluationsetupexploratorilyratherthanrelyingonconventionsthatmayhavebeendevelopedforsubstantiallydifferentproblems.Predefinedstandardsfromothertasksmaynotnecessarilytranslatewelltothiscontext,andcouldintroduceunjustifiedassumptionsaboutwhatlevelsofsupervisionareinformative.Wethereforeadoptedaprogressionof0,1,3,and5examples.
Ourchoicewasalsobasedonliterature.LaverghettaJr.&Licato(2023)usedthreeexamples,whileDrorietal.(2022)usedawiderarrayofexamples,startingfromzero-shotandendingwithuptofiveexamples.Wangetal.(2022)usedanevenlargernumberofexamples,reachingasmanyasseven,butconcludedthatfiveexamplesledtobestresults.Otherstudiesusedfewerexamples,e.g.Bezirhan&vonDavier(2023)-zeroandone,Omopekunola&YuKardanova(2024)-fromonetothree.Hence,ourpromptingschemeusedthemostcommonnumberofexamplesusedintheliterature.
Thisincrementaldesignensuresthatwecancapturemeaningfuldifferencesinmodelbehavioraslimitedsupervisionisintroduced,whileavoidingtherisksofoverfittingtoanarbitrarysetup.Intheabsenceofpreexistingstandardsforthistask,webelievethisapproachisbothreasonableandscientificallyjustified.
Foreachsetup,80itemsweregeneratedbytheLLM—40inscienceand40inmathematics.Eachqueryspecifiedboththecontentdomainandthecognitivedomain,whichwererandomlydrawnbasedonthemetadataofrealTIMSSitems.ThisensuredthatthedistributionofdomainsamongthegenerateditemsreflectedtheactualTIMSS
11
proportions.ExampleitemsweredrawnfromtheTIMSSdataset,limitedtothosematchingtheselectedcontentandcognitivedomains.
Thepromptusedtogeneratetheitemswascreatedinaniterativemanner,wherethemembersoftheteaminspectedsomeexample-generateditemsandrefinedtheprompt.Itsfinalformispresentedbelow;pleasemindthatthewordsincurlybracketswereautomaticallyswappedwiththespecificvalueschosenperspecificgenerationinstance:
"CreateoneitemforthePolishTIMSSassessment.Donotwriteanythingelse
thanthecontentoftheitem.TheitemhastobeinPolish.
Subject:{subject}
ContentDomain:{content_domain}
Definition:{content_domain_definitions[content_domain]}
CognitiveDomain:{cognitive_domain}
Definition:{cognitive_domain_definitions[cognitive_domain]}
Ageappropriateness:Grade4oftheelementaryschool
Itemformat:ConstructedresponseORmultiple-choice.Constructed
responseitemsusuallyrequirestudentstogiveanumericalresult,provideashortexplanationordescriptiongiveninoneortwophrasesorsentences,createalistorcompleteatable.Multiplechoiceitemshavefourresponseoptions,ONEOFTHEMISALWAYSCORRECTANDTHREEAREINCORRECTt.Theincorrectoptionsneedtobeplausiblechoices,demonstratingtypicalerrorsormisunderstandings.Theyneedtobesimilartothecorrectoptionintermsoflengthandformat,butbeundoubtedlyincorrect.
Additionalcomments:DONOTGENERATELINKSTOTABLESANDIMAGES,BUT
PRESENTADETAILEDDESCRIPTIONOFTABLESANDIMAGESIFYOUCHOOSETOUSETHEM
INYOURITEM(YOUDONTHAVETO)
"
Thepromptincludedthedefinitionsofthesampledcontentandcognitivedomain.ThesewerepreparedonthebasisoftheTIMSSassessmentframework(Mullisetal.,2023,pp.7-10&14-18formaths;pp.20-28&40-44)andareshownbelow:
CognitiveDomains:
12
Knowing
Facilityinapplyingmathematics,orreasoningaboutmathematicalsituations,dependsonfamiliaritywithmathematicalconceptsandfluencyinmathematicalskills.Themorerelevantknowledgeastudentisabletorecallandthewidertherangeofconceptsheorsheunderstands,thegreaterthepotentialforengagingwithawiderangeofproblemsituations.ItcontainsskillslikeRecall:Recalldefinitions,terminology,numberproperties,unitsofmeasurement,geometricproperties,andnotation(e.g.,a×b=ab,a+a+a=3a).Identify:Identifynumbers,expressions,quantities,andshapes.Recognizewhenentitiesaremathematicallyequivalent.Readinformationfromgraphs,tables,texts,orothersources.Order:Orderandclassifynumbers,expressions,quantities,andshapesbycommonproperties.Compute:Computearithmeticoperationswithwholenumbers,fractions,decimals,andintegersusingalgorithmicprocedures.Carryoutstraightforwardalgebraicmanipulation.
Applying
Theapplyingdomaininvolvestheapplicationofmathematicsinarangeofsituations.Problem-solvingiscentraltothisdomain.Studentswillneedtoselectsuitableoperations,strategies,andtoolsforsolvingproblems.Formulate:Determineefficient/appropriateoperations,strategies,andtoolsforsolvingproblems.Implem
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 货物装卸收费合同范本
- 牙齿耗材采购合同范本
- 第十九课 智慧物流教学设计-2025-2026学年初中信息技术(信息科技)九年级浙教版(广西、宁波)
- 牧场雇佣工人合同范本
- 火车车厢维修合同范本
- 物业临时看管合同范本
- 牧场养殖服务合同范本
- 租房合同养狗补充协议
- 美国购买游艇合同范本
- 货物渣土运输合同范本
- 仓储管理招聘题库及答案
- 采购玉米居间服务协议书
- 专题12 记叙文阅读写人记事专项训练(解析版)
- 2025年下半年成都农商银行综合柜员岗社会招聘笔试备考试题及答案解析
- 乌鲁木齐市辅警考试题库2025(附答案)
- 预防术中获得性压力性损伤专家共识
- 安全生产考核巡查办法全文
- 五年级上册数学同步拓展课件-取球问题 人教版(共11张PPT)
- 铁路工程提、抽、压、注水文地质试验教程
- HR工作法律手册(人力资源管理全案-法务篇)
- 第5章金属自由电子论
评论
0/150
提交评论