2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-

上传人：加*** IP属地：北京上传时间：2025-11-22 格式：DOCX 页数：255 大小：8.24MB 积分：15 举报 版权申诉

2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-_第2页

2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-_第3页

2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-_第4页

2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-_第5页

已阅读5页，还剩250页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

FinalReportfortheIEAR&D2023

ResearchProjectentitled:

UsingLargeLanguageModelsforAutomaticItem

Generation:DevelopmentandValidationfor

TIMSSGrade4

PI:MarekMuszyński

Co-Is:HubertPlisiecki,ArturPokropek,TomaszŻółtak

InstituteofPhilosophyandSociologyofthePolishAcademyofSciences

(IFiSPAN)

Tableofcontents

ExecutiveSummary4

1.Introductionandstudyaims5

2.Studypreparation8

2.1Itemextraction8

2.2ItemGenerationDesign9

2.3Itemgenerationandinitialrevisionprocess15

3.Subjectmatterexpertspanel16

3.1Subjectmatterexpertspaneldesign16

3.2Subjectmatterexpertpanelcomposition19

3.3SMEspanelresults22

3.3.1AgreementAnalyses22

3.3.2Raters'evaluations26

Mathematics26

Science34

AdditionalAnalyses42

3.3.3Raters’comments43

3.4SMEsPanelSummaryandDiscussion47

3.4.1Limitations48

3.4.2Futuredirections49

3.4.3Conclusions49

4.Validationstudy50

4.1Itemgeneration50

4.1.1OverallApproach50

4.1.2TheSimpleSetup50

4.1.3TheComplexSetup58

4.1.4TheFinalChoiceofItems59

4.1.5GeneratingImages60

4.2Testpreparation60

4.2.1Testplan(incompleteblockdesign)60

4.2.2Assessmentcontents62

4.2.3Paradatacollection64

4.3Fieldworkpreparation,proceduresandtimeline64

4.3.1Schoolrecruitmentandremuneration64

4.3.2Testingroompreparation65

4.3.3Assessmentsessionprocedures66

4.3.4Pilotadministration68

4.3.5Datacollectiontimeline68

4.3.6Samplecomposition72

4.3.7BookletCoverage73

4.3.8Itemcoding74

4.4Psychometricmodeling77

4.4.1Excludeditems77

4.4.2Modelestimationandmodelfit78

4.4.3ItemparametersofTIMSSandLLM-generateditems82

4.4.4DifferentialItemFunctioning102

4.4.5TestingLocalItemIndependence105

4.4.6Summaryofpsychometricanalyses107

5.Generaldiscussion112

5.1Resultssummaryandconclusion112

5.2Futuredirections115

6.References120

7.Supplement125

7.1Duplicatesremovedfromthegenerateditemset125

7.2Kruskal-WallisTestResults128

7.3AdditionalresultsforDIFanalysis134

7.4ListofAnnexes144

ExecutiveSummary

ThestudyaimedtovalidatethequalityofassessmentitemsgeneratedbyLargeLanguageModelsforuseinmathematicsandscienceassessmentontheexampleoftheTIMSSGrade4.Thevalidationprocessincludedexpertratingsandqualitativeassessment,aswellasusingthegenerateditemsinafieldtestthatenabledtheassessmentofthesubstantialandpsychometricpropertiesofthegenerateditems.SinceLLM-generateditemsweremixedwithoriginalTIMSSitems,wewereabletocomparethetwoitemtypesdirectly.

Thestudy’sobjectivewastoassessthecurrentpossibilitiesofthewidelyaccessiblemodelswithoutaheavyrelianceonhumaneditingandselecting.Tothisend,wedidnoteditthegenerateditemsnorselectedthembasedonanycriteriaapartfromcorrectingminorandobviouserrors(e.g.,spelling)andremovingrepeateditems.

ThegatheredempiricalevidenceshowedthatmanyoftheLLM-generateditemswereofhighquality.However,overall,thesubjectmatterexpertsratedthegenerateditemsaslesssuitableforassessmentsandlessattractivecomparedtooriginalTIMSSitems.Psychometrically,thegenerateditemsalsoperformedlesswellthanrealTIMSSitems.Themainpsychometricissuesincluded:difficultylevelnotalignedwiththetargetgroup,presenceofitemswithnegativeorlowdiscriminationparameter,andviolationsofthelocalindependenceassumption.NodifferenceswerefoundbetweenTIMSSandgenerateditemsintermsofdifferentialitemfunctioning,itemfit,ordistractorfunctionality.

Thestudyalsoanalysedthemostcommonproblemsobservedinthegenerateditemsandtheirunderlyingcauses.Thegenerateditemswereoftentooeasy(especiallythemultiple-choicemathematicsitems)ortoodifficult(especiallytheconstructed-responsescienceitems).LLMsfrequentlyproducedverysimilarandrepetitiveitems.Theresponseoptionsgeneratedalsodisplayedseveralissues,suchasimplausibleorunjustifieddistractors,repeateddistractors,absenceofacorrectanswer,orthepresenceofmultiplecorrectanswers.

TheLLMalsodemonstrateddifficultyinusingpreciseterminologyandappropriatestyle.OccasionalproblemswithstyleandgrammarmayhavestemmedfromtheuseofPolishastheassessmentlanguage,whichhasrelativelylessdigitalcontentavailableforLLMtrainingcomparedtoEnglish(acommonissuefor“resource-poor”languages).Overall,themathematicsitemsappearedtoperformbetterthanthescienceitems.

Werecommendcontinuingthislineofresearchusingdomain-specificfine-tunedmodelsrunlocally,astheycurrentlyholdthemostpromiseforproducinghigh-qualityLLM-generateditems.Humaninterventionandjudgmentremaincrucialforselectingthebestitems.

1.Introductionandstudyaims

Itemsconstituteafundamentalaspectofanyassessment.Acomprehensivemeasurementreliesonpsychometricallysound,linguisticallycorrect,andtheoreticallyaligneditems.Thegrowingdemandformeasurementprogramsresultedinanincreasedneedforsuchitems.However,developingexcellentitemsrequiresdauntingamountsofpreciousresources.Typically,itinvolvesteamsofsubjectmatterexperts,

psychometricians,andfieldworkers,whoareresponsiblefordeveloping,evaluating,piloting,andvalidatingitemsbeforetheyenterfieldwork.Theprocessislong,costly,andlabor-intensive.

Foralongtime,researchersaimedatstreamliningtheprocessofitemdevelopment.Thislineofresearchwasnamedautomaticitemgeneration(AIG)andusesdifferentmethodstoshortcuttheprocessofitem“production”.First,meanssuchasitemcloningwereused(e.g.,Glas&vanderLinden,2003).Here,verycloseduplicatesofamodelitemaregeneratedwithessentiallythesamecontentbutsmallchangestotheitem’sstemand/ordistractors.Thisframeworkwaslaterdevelopedtogenerateitemclonesonthebasisofpredefineditemmodelsandcomputeralgorithmsthathelpedtoalterkeyitemelementstogetnewitems(Gierletal.,2016).Thenumberandqualityofthese“automatically”generateditemsdependedonthenumberandqualityofprepareditemmodelsanditemelementsthatweresettovarybetweenitemversions.Thewholeprocesswasrathersemiratherthanfullyautomaticandrequiredalotofhumanwork(Attalietal.,2022).

Next,approachesbasedondeeplearningstartedtoemerge,wherecomputeralgorithmsspecializedinnaturallanguageprocessingwereutilizedtocreateelementsofnewitems(stems,vignettes,distractors,etc.).Atthisstage,neuralnetworksweremostlyemployedforAIG,theprocessremainedlonganddifficult,asearlymodelsneededtime-consumingandoftencostlypre-training/fine-tuningthatrequiredsignificanttechnicalknowledge.Theoutcomeswereoftenfarfromperfect,asthecreateditemswereverybasicandnecessitatedfurtherhumanpolishingup.ExamplesoftheseearlyworksareGaoetal.(2019),Liangetal.(2018),Shinetal.(2019).Susantietal.(2018)andZuetal.(2023)investigatedmethodsofautomaticdistractorcreationforitemstems.Mitkovet

al.(2023)showedamethodofcreatingwholeitems(stem+distractors)basedontextpassagesusedtotrainneuralnetworks.VonDavier(2023)andBezirhanandvonDavier(2023)presentedamethodtocreatetextpassagesforfurtheritemcreationbasedonGPT-2andGPT-3models(earlyOpenAImodels,seecitedpapersformoredetails).

Therapiddevelopmentoflargelanguagemodels(LLMs)leadtotheemergenceofnewtoolsthatarenowcapableofexcellentprocessingandgeneratingofnaturallanguagetextsandothertasks(e.g.Bubecketal.,2023).Theincreaseinabilitiesofthesemodelscallsforarevisitofpreviousresearchonthistopic.Recently,Drorietal.(2022)orAttalietal.(2022)cameupwithmethodstouselargelanguagemodelstoconstructwholeitemsinthedomainofmathematicsandEnglish,respectively.Severalstudiessucceededinautomaticallydevelopingself-reportitems,e.g.personalityscales(Götzetal.,2023;Hommeletal.2022;vonDavier,2018).Whatismore,theinitialresearchshowsthatitispossibletoautomaticallygenerateitemsofpre-setdifficultyandcomplexity(Raina&Gales,2022;Settlesetal.,2020).Forexample,BezirhanandvonDavier(2023)demonstratedthatLLMsrequireonlyshortpromptstogenerateitemsofvaryingdifficulty(asassessedbytheexperts).

AnotherresearchareaopenedbytheuseofLLMsispromptengineering,researchondevelopingandoptimizingpromptstousemodelsefficiently(Reynolds&McDonell,2021).TheoutcomeofAIGdiffersdependingonthequantityandqualityofthecontextusedtopromptLLMsandtheexactquestion(instruction)posed.Wangetal.(2022)comparedtheitemsgeneratedwiththeuseofpromptsdifferinginnumberandcomplexityofexamples,aswellaslengthoftheoverallcontextprovided.Inthisstudy,itemsofhigherqualityweregeneratedwithmorecontent-specificexamples,withfive,beingtheoptimalnumberofexamplesprovided.Shorterormedium-lengthcontextinformation(15-25word-long)yieldedgenerationof“better”itemsthanlongcontext(40words;Wangetal.,2022).Ontheotherhand,zero-shotprompts(noexamples)wereaseffectiveasone-shotprompts(oneexample)inthecontextofGPT-3andcreatingPIRLSpassages(Bezirhan&vonDavier,2023).

Theresearchonmodellingtaskdifficultybypromptsandmosteffectivewayofpromptingisjustatitsbeginning(Attalietal.,2022).However,itisalreadyevidencedthatLLMscanincorporate(“learn”)knowledgeprovided“ascontextualinformationwithin

prompts”(Lampinenetal.,2022;Wang,2023)asanalternativetofine-tuningthemwithanupdateddataset(Drorietal.,2022).Promptingisnaturallymoretime-andcost-effective,especiallythatnewerLLMsallowforlongerandmorecomplexpromptswithalotofcontextprovided.

ThisprojectaimstobroadenthefindingsofearlystudiesandshowthatAIGcanbeemployedintherigorouscontextofinternationallarge-scaleassessments,suchasTIMSS.Moreover,weintendtogofurtherthanpreviousstudiesinthecontextofIEA’sassessments,e.g.BezirhanandvonDavier(2023)presentedamethodtocreatetextpassagesforfurthercreationofPIRLSGrade4itemsandgotthecreatedpassagesevaluatedbyhumanraters.Here,weaimtogeneratecompleteitems(stemandresponseoptions)andthensubjectthemtohumanexperts’evaluationbutalsotovalidatetheminafieldtestmimickingrealIEAassessmentsetting.WewillalsofocusondevelopingitemscloselyalignedtotheTIMSSassessmentframework,aswellasgeneratingitemsofpre-defineddifficulty.Graphicalitemelements,commonforTIMSSGrade4,willbegeneratedbygraphicalgenerativeartificialintelligenceprograms,suchasDALL-E3orwiththeuseofgraphicalplug-insforGPT-4.Recently,GPT-4wasequippedwithnewplug-insthatlinkwithPythontoproducetablesandcharts,whichshouldbeveryhandyforgeneratingTIMSS4items.Inshort,themainresearchitemsare:

1.Developmethodsofautomaticitemgeneration(AIG)usinglargelanguagemodels(LLMs)formaths&scienceassessmentTIMSSGrade4(inPolish).

2.Validatethesemethodsusingasubjectmatterexpertspanelandaschool-basedassessmentfollowingcloselytheoriginalTIMSSGrade4methodologyanddesign.

3.Provideevidence-basedknowledgeonthepossibilitiesofemployingAIGforinternationallarge-scaleassessmentsbasedontheexamplesgeneratedforTIMSSGrade4.Thisisafastdevelopingdomainandassuchtheresultsofthestudy(Summer2024)mightnotbefullygeneralizabletothecurrent(September2025)capabilitiesofLLMmodels.

2.Studypreparation

2.1Itemextraction

TIMSSGrade4restricteduseitemsinEnglishandPolishwereobtainedfromtheIEA.ThedocumentswithPolishTIMSSitemswerepassedtotheOpenAI’s“gpt4-o”imageAPI,pageafterpage,withthefollowingquery:

ExtractexercisesfromthisTIMSSexampageimageinajsonformat.

Whenevertablesappear,encodetheminthequestionvariableusingmarkdownalongsidetheexercisetext.Wheneverimagesappear,describewhattheyportrayinPolishintheformat[OBRAZEK:image_description]andaddthedescriptiontothequestionvariable.

Theimportantvariablesare:

"exercise_id"(exampleformats:M_01_07;S_02_05)

"question":exercisetext

"answers":answers

"contains_image"(boolean)

Makesuretoproducevalidjsons,escapecharactersproperly,extractallexercises,anddonotwriteanythingelse!

Thisallowedustoextract174TIMSSscienceitemsTIMSS,and177TIMSSmathsitems(bothGrade4).Theextracteditemswerethenmanuallycheckedforhallucinations,andformattingproblemsbythemembersoftheresearchteam.Whilecheckingtheitems,theresearchteammembersalsosupplementedthedatabasewiththeinternationalIDsoftheitemsbyaddingthisargumenttoeveryiteminthejson.

Themetadatarelatedtocontent,andcognitivedomainfortheitemswasmanuallycopiedfromtherelevantTIMSSdocuments.

Ashortcommentappliestodatasafetyandresearchreproducibility.Accordingtothecompany’spolicy

(/api/

),theAPIdoesnotusethedatatotrainAImodelsanddoesnotprocessitinanyotherkind.Moreover,thedataisremovedfromtheplatformautomaticallyafteranindicatedamountoftime

(/docs/guides/your-data

).TheOpenAIAPIusewasconsultedwiththeIEAbeforeanyoperationstookplace.Hence,usingtheAPI,alongwithother“business”optionstoaccessOpenAImodelsseemsafefromthedataprotectionpointofview.

Withregardtotheresearchreproducibilityandopenscience,wecannotprovidetheIEArestricteduseitemsemployedtopromptthemodels,butanyinterestedresearchercanapplytousethemforresearchpurposes.HavingIEA’sconsentandourprompts,anyonecanreproducetheprocess.However,wecannotguaranteefullreproduction,astheOpenAImodelsconstantlychange-andwecandonothingwiththat.Thisissuewillberaisedagaininthediscussion.

2.2ItemGenerationDesign

Forthepilotstudy,theitemsweregeneratedusingthe“GTP-4o”APIfromOpenAI.Forthehyperparameters,theonlynon-defaultparameterwasthetemperature

,whichwassetto0.8basedonashortgenerationpilotstudy,whereweiterativelygenerateditemsuntilwehavefinetunedthisparametertoallowformaximumvariability,withoutperceivednegativeimpactonitemintelligibility.Testingthisandotherparametersinasystematicmannerwasdecidedagainstduetothetimeandresourceconsumingiterativeprocessthatitwouldhaverequired.Thenumberofcompletions(n)wasleftatitsdefaultof1,andbothpresence_penaltyandfrequency_penaltyremainedat0,sonoadditionalbiasfornoveltyoragainstrepetitionwasimposedontheoutputs.Becausemax_tokenswasnotset,themodelwasfreetogenerateuptoitsfull4096-tokencompletionlimitwithinGPT-4o’s128000-tokencontextwindow,effectivelylettingtheanswerrununtilthemodeldecidedtostop.Differentmultiple-shotsetupswereused,wheredifferentnumbersofexampleTIMSSitemsweresubmittedtotheLLMalongsidetherequesttogeneratenewitems.

1Temperatureistheparametertoadjusttheoutputsoflargelanguagemodels,thehigherthetemperature,thehighertherandomnessandmorediversetheoutputs.However,theparametercannotbedirectlylinkedto“creativity”(Peeperkornetal.,2024).

Thesetupswere:

●zero-shot-0examples,

●one-shot-1example,

●three-shot-3examples,●andfive-shot-5examples.

Thechoiceof0,1,3,and5exampleswasmadetosystematicallyexploremodelbehaviorunderdifferentlevelsofsupervision,inawaythatisappropriateforanewtaskwherenoestablishedfew-shotbenchmarksyetexist.

Giventhenoveltyofthetask,itwasimportanttotreattheevaluationsetupexploratorilyratherthanrelyingonconventionsthatmayhavebeendevelopedforsubstantiallydifferentproblems.Predefinedstandardsfromothertasksmaynotnecessarilytranslatewelltothiscontext,andcouldintroduceunjustifiedassumptionsaboutwhatlevelsofsupervisionareinformative.Wethereforeadoptedaprogressionof0,1,3,and5examples.

Ourchoicewasalsobasedonliterature.LaverghettaJr.&Licato(2023)usedthreeexamples,whileDrorietal.(2022)usedawiderarrayofexamples,startingfromzero-shotandendingwithuptofiveexamples.Wangetal.(2022)usedanevenlargernumberofexamples,reachingasmanyasseven,butconcludedthatfiveexamplesledtobestresults.Otherstudiesusedfewerexamples,e.g.Bezirhan&vonDavier(2023)-zeroandone,Omopekunola&YuKardanova(2024)-fromonetothree.Hence,ourpromptingschemeusedthemostcommonnumberofexamplesusedintheliterature.

Thisincrementaldesignensuresthatwecancapturemeaningfuldifferencesinmodelbehavioraslimitedsupervisionisintroduced,whileavoidingtherisksofoverfittingtoanarbitrarysetup.Intheabsenceofpreexistingstandardsforthistask,webelievethisapproachisbothreasonableandscientificallyjustified.

Foreachsetup,80itemsweregeneratedbytheLLM—40inscienceand40inmathematics.Eachqueryspecifiedboththecontentdomainandthecognitivedomain,whichwererandomlydrawnbasedonthemetadataofrealTIMSSitems.ThisensuredthatthedistributionofdomainsamongthegenerateditemsreflectedtheactualTIMSS

proportions.ExampleitemsweredrawnfromtheTIMSSdataset,limitedtothosematchingtheselectedcontentandcognitivedomains.

Thepromptusedtogeneratetheitemswascreatedinaniterativemanner,wherethemembersoftheteaminspectedsomeexample-generateditemsandrefinedtheprompt.Itsfinalformispresentedbelow;pleasemindthatthewordsincurlybracketswereautomaticallyswappedwiththespecificvalueschosenperspecificgenerationinstance:

"CreateoneitemforthePolishTIMSSassessment.Donotwriteanythingelse

thanthecontentoftheitem.TheitemhastobeinPolish.

Subject:{subject}

ContentDomain:{content_domain}

Definition:{content_domain_definitions[content_domain]}

CognitiveDomain:{cognitive_domain}

Definition:{cognitive_domain_definitions[cognitive_domain]}

Ageappropriateness:Grade4oftheelementaryschool

Itemformat:ConstructedresponseORmultiple-choice.Constructed

responseitemsusuallyrequirestudentstogiveanumericalresult,provideashortexplanationordescriptiongiveninoneortwophrasesorsentences,createalistorcompleteatable.Multiplechoiceitemshavefourresponseoptions,ONEOFTHEMISALWAYSCORRECTANDTHREEAREINCORRECTt.Theincorrectoptionsneedtobeplausiblechoices,demonstratingtypicalerrorsormisunderstandings.Theyneedtobesimilartothecorrectoptionintermsoflengthandformat,butbeundoubtedlyincorrect.

Additionalcomments:DONOTGENERATELINKSTOTABLESANDIMAGES,BUT

PRESENTADETAILEDDESCRIPTIONOFTABLESANDIMAGESIFYOUCHOOSETOUSETHEM

INYOURITEM(YOUDONTHAVETO)

Thepromptincludedthedefinitionsofthesampledcontentandcognitivedomain.ThesewerepreparedonthebasisoftheTIMSSassessmentframework(Mullisetal.,2023,pp.7-10&14-18formaths;pp.20-28&40-44)andareshownbelow:

CognitiveDomains:

Knowing

Facilityinapplyingmathematics,orreasoningaboutmathematicalsituations,dependsonfamiliaritywithmathematicalconceptsandfluencyinmathematicalskills.Themorerelevantknowledgeastudentisabletorecallandthewidertherangeofconceptsheorsheunderstands,thegreaterthepotentialforengagingwithawiderangeofproblemsituations.ItcontainsskillslikeRecall:Recalldefinitions,terminology,numberproperties,unitsofmeasurement,geometricproperties,andnotation(e.g.,a×b=ab,a+a+a=3a).Identify:Identifynumbers,expressions,quantities,andshapes.Recognizewhenentitiesaremathematicallyequivalent.Readinformationfromgraphs,tables,texts,orothersources.Order:Orderandclassifynumbers,expressions,quantities,andshapesbycommonproperties.Compute:Computearithmeticoperationswithwholenumbers,fractions,decimals,andintegersusingalgorithmicprocedures.Carryoutstraightforwardalgebraicmanipulation.

Applying

Theapplyingdomaininvolvestheapplicationofmathematicsinarangeofsituations.Problem-solvingiscentraltothisdomain.Studentswillneedtoselectsuitableoperations,strategies,andtoolsforsolvingproblems.Formulate:Determineefficient/appropriateoperations,strategies,andtoolsforsolvingproblems.Implem

人人文库> 全部分类> 行业资料 > 各类标准

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-

文档简介

温馨提示

最新文档

评论

2023年使用大语言模型进行自动题库生成：TIMSS四年级的发展与验证报告-

文档简介

温馨提示

最新文档

评论

相关文档