语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning

上传人：策*** IP属地：山西上传时间：2025-06-11 格式：DOCX 页数：67 大小：562.87KB 积分：15 举报 版权申诉

语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第2页

语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第3页

语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第4页

语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第5页

已阅读5页，还剩62页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

BenchmarkingLLMsonAdvancedMathematicalReasoning

JonathanYueDanielKlein

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2025-121

/Pubs/TechRpts/2025/EECS-2025-121.html

May16,2025

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

BenchmarkingLLMsonAdvancedMathematicalReasoning

JonathanYue

Athesissubmittedinpartialsatisfactionofthe

requirementsforthedegreeof

MasterofScience

ElectricalEngineeringandComputerSciences

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorDanKlein,ChairProfessorGireejaRanade

Spring2025

Abstract

BenchmarkingLLMsonAdvancedMathematicalReasoning

JonathanYue

MasterofScienceinElectricalEngineeringandComputerSciences

UniversityofCalifornia,Berkeley

ProfessorDanKlein,Chair

LargeLanguageModels(LLMs)haveimproveddramaticallyatmathematicalreasoning,progressingfrombasicarithmetictoolympiadlevelproofs.However,theexisting,short-answerbasedbenchmarkscansufferfromlimitedscopeforcomplexreasoningandthereforedonotsufficientlymeasurethereasoningcapabilitiesofLLMs.Formalproof-basedbench-marksexist,buttheneedtoconvertproblemstatementsintoformallanguageslimitsthescopeoftheproblems.Apotentialreasonforthissignificantgapincurrentliteratureisthedifficultyingradingproofproblemsatscale.Toaddressthis,wefirstproposeanLLM-as-a-judgeframeworktojudgemodel-generatedproofsandevaluateditsefficacy.Then,weproposeabenchmarkof77PhD-levelproofquestions,drawnfromRomanVershynin’s“High-DimensionalProbability:AnIntroductionwithApplicationsinDataScience”,andchallengedstate-of-the-artLLMswiththesequestions.WeevaluatedtheLLM-generatedsolutionsusingtheLLM-as-a-judgeframeworkandfoundthat,ingeneral,state-of-the-artLLMsarestillunabletoadequatelycompletetheseproofs.

Contents

1Introduction

2Relatedworks

2.1LLMformathematicalreasoning

2.2BenchmarkingMathematicalReasoninginLargeLanguageModels

2.3LLM-as-a-JudgeforEvaluatingMathematicalReasoning

3Datacollection

3.1Groundtruthhumanproofevaluation

3.2Mathematicaltextandgroundtruthsolutions

4LLM-as-a-judgeformathematicalproofs

4.1Evaluation

5LLMnaturallanguageproof-writing

5.1Methodology

5.2Results

6Discussion

6.1Limitations

7ConclusionandFutureDirections

Bibliography

Acknowledgments

Iamdeeplygratefultomyadvisor,DanKlein,forhissupportandmentorshipoverthepasttwoyears,andtoProfessorGireejaRanadeforbeingmysecondreaderandofferingvaluableconversationsandfeedback.Ialsowishtothankmymentor,NicholasTomlin,forhisguidanceandthoughtfulinputthroughoutthisprocess.

IwouldliketothankallmyclassmatesintheCompSci294-271course,AIinEducation,taughtbyProfessorGireejaRanadeandProfessorNargesNorouzi,whereIengagedindis-cussionsthatinspiredtheconceptionofthisproject.Iespeciallythankmyteammembersinthecourse,NoraGuoandSveinungMyhre,whoIcollaboratedwithtoobtaintheIRBapprovalandconductarelatedstudy,forhelpingshapetheideasforthiswork.

Finally,Iextendmyheartfeltthankstomyfamilyandfriendsfortheirconstantencourage-mentandsupport,whichhasbeeninvaluableineveryfacetofmylife.

Chapter1

Introduction

TheadventofLargeLanguageModels(LLMs)hasresultedremarkableadvancementsinartificialintelligence,withmathematicalreasoningemergingasasignificantareaofinterest.LLMshaveevolvedfromperformingrudimentaryarithmetictoassistingwithmathematicaldiscover(Romera-Paredesetal.,

2024

),drivenbyincreasedmodelsizeandsophisticatedpromptingtechniquessuchasChain-of-Thought(CoT)prompting(Weietal.,

2023

)andTool-IntegratedReasoning(TIR)(Ahnetal.,

2024

Despitethesestrides,aconsiderablegappersistsinevaluatingthedeeperreasoningcapabil-itiesofLLMs.Existingbenchmarkspredominantlyfocusonshort-answerormultiple-choicequestions,suchasMATH(Hendrycksetal.,

2021

)andGSM8K(Cobbeetal.,

2021

).Whileuseful,thesebenchmarksoftenemphasizethefinalnumericaloutput,potentiallyneglectingtherigoroftheintermediatereasoningstepsandsufferingfromissuessuchasbenchmarksaturationanddatacontamination(Hongetal.,

2025

;Petrovetal.,

2025

).Furthermore,theirlimitedscopemaynotsufficientlyassesscomplexconceptualunderstandingortheabil-itytoconstructintricatearguments(K.Huangetal.,

2025

).ThepoorperformanceofLLMsonproblemsthatrequirerigorousproofgeneration,asopposedtonumericalanswers,sug-gestsapotential“reasoningillusion,”wheresuccessinsometasksmightstemfrompatternmatchingortoolassistanceratherthangenuinemathematicalinsight(Petrovetal.,

2025

Ontheotherhand,benchmarkscenteredonformalproofs,suchasMinif2f(K.Zhengetal.,

2022

)andPutnamBench(Tsoukalasetal.,

2024

),operatewithinthestrictconfinesofsymbolicsystemslikeLeanorIsabelle.Whilevaluableforassessingmechanicallyverifiablereasoning,thenecessityoftranslatingproblemstatementsintotheseformallanguages(aut-oformalization)canbeachallengingtaskitself,whichrestrictsthebreadthofproblemsthatcanbeaddressed(Gulatietal.,

2024

;J.Zhangetal.,

2025

).Thisleavesasignificantgapinevaluatingthegenerationofnovel,naturallanguagemathematicalproofs,whicharemoreakintohumanmathematicalpracticeandcrucialforinterpretability.Aprimaryobstacletodevelopingbenchmarksforsuchproofsistheinherentdifficultyingradingthemconsistentlyandatscale,asnaturallanguageproofslacktheimmediateverifiabilityofformalproofsandoftenrequireexperthumanevaluation(Friederetal.,

2023

CHAPTER1.INTRODUCTION

Toaddresstheselimitationsandbridgethegapinevaluatingadvancedmathematicalrea-soninginnaturallanguage,thispapermakestwoprimarycontributions.First,weproposeandevaluateanLLM-as-a-judgeframeworkspecificallydesignedforassessingthecorrect-nessandcoherenceofmathematicalproofs.Thisapproachoffersascalablealternativetomanualexpertevaluation.Second,leveragingthisevaluationframework,weintroduceanewbenchmarkcomprising77PhD-levelproof-basedquestions,designedtorigorouslytesttheadvancedreasoningcapabilitiesofLLMsonproblemsrequiringnovelandintricatenaturallanguageproofs.

Wefindthat,despiterecentprogress,currentleadingLLMsaregenerallystillunabletoadequatelycompletethesecomplexproof-basedtasks.Thisunderscoresthenecessityformorechallengingbenchmarksthatprobedeeperlevelsofmathematicalunderstandingandproofgeneration,therebyprovidingamoreaccuratemeasureofLLMreasoningabilitiesandguidingfutureresearchtowardsdevelopingmorecapableandrobustAIsystemsformathematics.

Chapter2

Relatedworks

2.1LLMformathematicalreasoning

TheintroductionofLargeLanguageModels(LLMs)hasmarkedasignificantturninartificialintelligence.Amongthemanyapplications,mathematicalreasoningemergesasakeyareaofexplorationandadvancement.ThissectionsurveysthedevelopmentofLLMsintacklingmathematicaltasks.

AdvancesandCoreTechniquesinLLMMathematicalReasoning

LLMshaverapidlyimprovedatmathematicalreasoning,progressingfromsimplearithmetictosolvingmorecomplexproblems(Liuetal.,

2025

).Akeydriverforthisprogressistheincreaseinmodelsize.Largermodelsfollowinstructionsbetterandreasonmoreeffectively.ResearchshowsthatLLMswithover100billionparameterscansolvedifficultmathproblemswiththerightprompts(Forootani,

2025

Thediscoveryofseveraltechniqueshaveaugmentedthisimprovementduetomodelscale.AmajorcornerstoneinthisareaisChain-of-Thought(CoT)prompting.ItguidesLLMstoliststepsintheirreasoningbeforegivingafinalanswer.Thisresultsinanimprovementinboththeperformanceoncomplextasksandtheinterpretabilityofthereasoningprocess.Weietal.,

2023

showedthatCoTpromptingallowsLLMstodeconstructproblemstheywouldusuallyfail.ResearchershavebuiltuponbasicCoTwithmoreadvancedversionsandnewtypesof“thinking”models.ExamplesincludeSoftCoT(Y.Xuetal.,

2025

),ContinuousCoT(Cheng&Durme,

2024

),andCoconut(Haoetal.,

2024

).SoftCoT,forinstance,utilizesasmallerassistantmodeltogenerate“softthoughttokens”whicharethenprojectedintotheprimaryLLM’srepresentationspacethroughatrainablemodule.Thisfine-tuningap-proachisparameter-efficientandhasdemonstratedenhancedperformanceonmathematicalreasoningbenchmarks.

Theemergenceofmodelsexplicitlydesignedfor“thinking,”suchasOpenAI’sO1and

CHAPTER2.RELATEDWORKS

DeepSeek-R1,representsanotableprogression.ThesemodelsoftentakemoretimetoreasonduringinferenceandaretrainedwithReinforcementLearning(RL)tobuildadvancedcogni-tiveskillslikeself-checkingandreflection.DeepSeek-R1,forexample,isreportedtodevelopCoTreasoningcapabilitiesautonomouslythroughapureRLtrainingparadigm(DeepSeek-AIetal.,

2025

).Multi-roundThinking,whereamodelusespreviousanswerstotryagainandimprove,hasalsoledtobetterresults,asseeninmodelslikeQwQ-32BandDeepSeek-R1ondifficultbenchmarkssuchasAIME2024(Tianetal.,

2025

).RL-basedreasoningmethodsarealsobeingusedinmultimodalLLMs.Vision-R1,forinstance,usesProgressiveThinkingSuppressionTraining(PTST)toimproveperformanceinsolvingmultimodalmathproblems(W.Huangetal.,

2025

ToaddressLLMs’weaknessinprecisenumericalcalculationandrigoroussymbolicmanip-ulation,Tool-IntegratedReasoning(TIR)hasgainedprominence(Ahnetal.,

2024

).TIRletsLLMshandoffsubpartsofaproblemtotoolslikePythonformathorsymbolicsolversforalgebra.Thisimprovesaccuracy,especiallywhenhighprecisionorstructuredmathisneeded.Forexample,thewinningsolutionintheAIMO-2competitionusedCoTfine-tuningfirst,thenfine-tunedagainonaTIRdataset.Thishelpedthemodelcombinenaturallan-guagereasoningwithstructuredcomputation(Moshkovetal.,

2025

).TIRframeworkslikeTATA(TeachingLLMsAccordingtoTheirAptitude)(X.Xuetal.,

2025

)helpmodelschoosebetweenCoTandTIRdependingontheproblem.

Still,LLMsstrugglewithhighlyadvancedmathproblems,especiallyonesneedingfullproofslikethoseinmatholympiads.Arecentworkevaluatedstate-of-the-artLLMson2025US-AMOproblems.Theresultsshowsignificantlimits(Petrovetal.,

2025

).EvenaleadingmodellikeGemini-2.5-PROachievedanaveragescoreofonly25%,withotherprominentmodelsscoringbelow5%.Commonfailuremodesincludedflawedlogicalsteps,introductionofunjustifiedassumptions,alackofcreativeproblem-solvingstrategies,andatendencytofalselyclaimthataproblemhadbeensolved.

TheobserveddiscrepancybetweenLLMperformanceonbenchmarksrequiringnumericalanswersandthoserequiringrigorousproofgenerationsuggestsapotential“reasoningillu-sion.”WhiletechniqueslikeCoTandTIRdemonstrateimprovementsoncertaintypesofmathematicalproblems,thepoorperformanceonproof-basedtaskssuggeststhattheseim-provementsmightnotalwaystranslatetogenuine,deepmathematicalunderstanding.LLMsmaybeleveragingpatternmatchingandtool-assistedcomputationtosucceedinnumericalanswer-orientedtasks,withoutnecessarilydevelopingdeeplogicaldeductivecapabilities.Thisillustratesthecriticalneedforbenchmarksthatspecificallyevaluatethesedeeperrea-soningandproof-generationabilities.

Formalvs.InformalMathematicalReasoningbyLLMs

MathematicalreasoningbyLLMscanbebroadlycategorizedintotwodomains:formalmath-ematicalreasoning,whichoperatesundertherigoroussyntaxofsymbolicsystemsandproof

CHAPTER2.RELATEDWORKS

assistants,andinformalmathematicalreasoning,whichexpressesmathematicsinnaturallanguage.

FormalMathematicalReasoning

Thissub-fieldischaracterizedbyitsemphasisonmechanicalverifiabilityandlogicalrigor.

LLMsforFormalProofGeneration(Lean,Isabelle,etc.):MuchresearchisdedicatedtotrainingLLMstogenerateproofsinformallanguagessuchasLean4(J.Zhangetal.,

2025

)andIsabelle(X.Zhaoetal.,

2024

).Thegoalofthislineofresearchistodevelopsystemsthatcanautonomouslyproducemachine-verifiablemathematicalproofs.

Severalnotablemodelsandsystemshavedemonstratedprogress.AlphaGeometrymadeheadlinesbysolvingOlympiad-levelgeometryproblems.Itcombinedalanguagemodel,whichsuggestspotentiallyusefulgeometricconstructions,withasymbolicdeductionenginethatformallyverifiesthesesteps.Akeycomponentofitsdevelopmentwasthegenerationof100millionsyntheticdataexamplesfortraining(Trinhetal.,

2024

).Animprovedver-sion,AlphaGeometry2extendeditsformallanguagetohandleawiderrangeofgeometricproblemsandleveragedtheGeminiarchitecture.AlphaGeometry2reportedlysurpassedtheaverageperformanceofhumangoldmedalistsonabenchmarkofIMOgeometryproblems(Chervonyietal.,

2025

TheSelf-playTheoremProver(STP)addressesthecriticalissueofdatascarcityinformalmathematicsbyemployingadual-agentsystem:aconjecturerthatproposesnewtheoremsandaproverthatattemptstoprovethem.Thisself-playloopallowsthesystemtoiterativelygenerateitsowntrainingdata,whichresultsinsignificantperformanceim-provementsonvariousreasoningbenchmarks(Dong&Ma,

2025

).Goedel-Proverisanopen-sourceLLMdesignedforformalproofgenerationinLean4.Itachievedstate-of-the-artresultsbyfirstformalizingalargedatasetofnaturallanguagemathproblemsintoLean4statementsandthenemployinganexpertiterationstrategytotraintheprover(Linetal.,

2025

).Kimina-ProverPreviewdevelopedareasoning-drivenexplorationparadigm,utilizinglarge-scaleRLbasedonQwen2.5-72BtogenerateproofsinLean4.ThisapproachhasalsosetnewSOTAperformanceontheminiF2Fbenchmark(H.Wangetal.,

2025

Despitetheseadvancements,akeychallengeremainsthescarcityoflarge-scale,high-qualitydatasetsofformalizedmathematicalstatementsandproofs.TherigoroussyntaxofformallanguagesoftenmakesproofgenerationinthesesystemsmoredifficultforLLMscomparedtonaturallanguage(J.Zhangetal.,

2025

Autoformalization:Thiscriticalsub-areafocusesontheautomatictranslationofinformalmathematicallanguageintoformallanguagesthatcanbeprocessedbyproofassistants.LLMslikeGPT-4haveconsiderablyadvancedautoformalizationcapabilitiesthroughin-contextlearning.Techniquessuchasback-translation,whereexistingformalstatementsareautomaticallyinformalizedintonaturallanguagetoresultinsyntheticpairsofformalandinformalmathematicalstatements,havebeenexploredtoaugmenttrainingdata(Yanget

CHAPTER2.RELATEDWORKS

al.,

2024

However,autoformalizationremainsadifficulttask,especiallyforcomplexorabstractmath-ematicalconcepts(Gulatietal.,

2024

).LLMscanstrugglewithselectingthecorrectpream-blesorlibraryimportsfromvastformalmathematicslibrariesandmayevengeneraterefer-encestonon-existentpreambles.Evaluatingautoformalizationfaithfulness—thattheformalstatementaccuratelycapturesthesemanticsoftheoriginalinformalinput—remainsanareaofongoingresearch.

Informal(NaturalLanguage)MathematicalProofs

Thisdomainconcernsthegenerationandcomprehensionofmathematicalproofs,intheformmostfamiliartohumans:innaturallanguage.

CurrentLandscape:WhileCoTtechniquesenabledLLMstogeneratenaturallanguagestep-by-stepreasoning,therehasbeenrelativelylittleresearchintothegenerationofrigorousandverifiablenaturallanguagemathematicalproofsforcomplexproblems.MuchoftheexistingworkonmathematicalreasoninginLLMsfocusesonsolvingproblemsthatyieldnumericalorshort-formanswers.

ThegenerationandevaluationofnaturallanguageproofsbyLLMspresentauniquesetofchallenges:

•Verifiability:Afundamentaldifficultyliesintheautomatedverificationofnaturallanguageproofs.Unlikeformalproofs,whichcanbemechanicallycheckedbyproofassistants,naturallanguageproofstypicallyrequirehumanexpertevaluationorhighlysophisticatedAI-drivenverifiers.Evenminorerrorsinlogicorcalculationcaninvali-dateanentireproof,andthesecanbedifficultforcurrentsystemstoreliablydetectinanaturallanguagecontext.

•Ambiguity:Naturallanguageisinherentlyambiguousandoftenreliesonimplicitcontextualknowledgeandcommonsense.Forinstance,whetheromittingcertainstepsinamathematicalargumentisacceptabledependsontheassumedleveloffamiliarityofboththeauthorandtheaudience.LLMsmightnotfullygrasptheseambiguitiesandmaintainthenecessarylevelofprecisionthroughoutaproof.AnotherconcernisthatanLLMmayinadvertentlyexploittheseambiguitiesandpretendasifithadcompletedaproof.

•DataScarcityforComplexProofs:High-quality,large-scaledatasetsofcomplexnaturallanguageproofs,particularlythosewithdetailedstep-by-stepannotationsorexplanationsofreasoning,arescarce.Thisisespeciallytruefornicheoradvancedmathematicaltopics.

ThecomparativelysmallervolumeofresearchfocusedongeneratingandverifyingnaturallanguagemathematicalproofsbyLLMs,asopposedtoformalproofsorproblem-solving

CHAPTER2.RELATEDWORKS

withnumericalanswers,maybeattributedtoseveralfactors.FormalsystemsandproofassistantsmakeformalproofgenerationanattractivetargetforAIresearchasitprovidesaclear,objectivemeasureofcorrectness.Thisreliablefeedbackloopsformodeltrainingandevaluationstandsincontrasttotheinherentambiguityofnaturallanguage.Additionally,formalmathematicslibraries(e.g.,MathlibforLean)offeragrowing,structured,andverifiedcorpusofmathematicalknowledge.ThefieldofAutomatedTheoremProving(ATP)alsohasalong-standingtraditioninAI,historicallyfocusingonformallogic.

Despitethesechallenges,webelievethatthegenerationandevaluationofcomplexinformalmathematicalproofsisvitallyimportant.Naturallanguageproofsareoftenmoreinter-pretable,whichmakesthemmoredesirableinmanysettings.Forinstance,inaneducationsetting,capabilitiesincomplexmathematicalreasoningmayallowformoreefficientproofgradingorstudenthintgeneration.ThespaceofmathematicalreasoningcanalsoactasatestbedtoAIreasoninginless-definedreal-worldenvironments,whereformalizationwouldbenearlyimpossibleandwhereinterpretabilitywouldbeevermoreimportant.

2.2BenchmarkingMathematicalReasoninginLargeLanguageModels

TheevaluationofmathematicalreasoningcapabilitiesinLLMsreliesheavilyonbenchmarks.Thissectionreviewsexistingbenchmarks,categorizingthembytheirfocusandproblemtypes,andcriticallyexaminestheirlimitations.

Short-AnswerBenchmarks

ThesebenchmarkstypicallyassessLLMsonmathematicalproblemswheretheexpectedoutputisafinalnumericalansweroraconcisetextualresponse.Prominentexamplesinclude:

•MathQA(Aminietal.,2019):DerivedfromtheAQuAdataset,MathQApresentsmathwordproblemswithmultiple-choiceanswers.Itwasdesignedwithanempha-sisoninterpretabilitybymappingproblemstosequencesofpredefinedmathematicaloperations(Aminietal.,

2019

•MATH(Hendrycksetal.,2021b):Thisdatasetiscomprisedofchallengingprob-lemsfrommathematicscompetitionssuchastheAmericanMathematicsCompeti-tions(AMC10/12)andtheAmericanInvitationalMathematicsExamination(AIME).Whilesolutionsoftenrequiremultiplereasoningsteps,evaluationtypicallyfocusesonthecorrectnessofthefinalnumericalanswer(Hendrycksetal.,

2021

•GSM8K(Cobbeetal.,2021):Consistingofgradeschoolmathematicswordprob-lems,GSM8Kisdesignedtotestmulti-steparithmeticreasoning.Modelsareexpected

CHAPTER2.RELATEDWORKS

toproduceanumericalanswerderivedthroughasequenceofcalculations(Cobbeetal.,

2021

Despitetheirwidespreaduse,theseshort-answerbenchmarksfaceseveralcriticisms:

•EmphasisonFinalAnswers:Aprimarylimitationistheirpredominantfocusonthecorrectnessofthefinalanswer,oftenneglectingtheintermediatereasoningsteps.Thisevaluationapproachcaninadvertentlyrewardmodelsthatarriveatthecorrectsolutionthroughflawedreasoningprocesses(Petrovetal.,

2025

•BenchmarkSaturation:Manyofthesebenchmarksarerapidlyapproachingsatu-ration,withstate-of-the-artLLMsachievingveryhighaccuracyscores(e.g.,over97%onGSM8KasreportedinBudagametal.,

2024

).Thissaturationdiminishestheirefficacyindifferentiatingthecapabilitiesofadvancedmodels.

•DataContaminationandOverfitting:AsignificantconcernisthepotentialforbenchmarkdatatohavebeenincludedinthevasttrainingcorporaofLLMs.Thisdatacontaminationcanleadtoinflatedperformancemetricsthatreflectmemorizationratherthantruereasoningability.Studiesthathavecreatedperturbedversionsofthesebenchmarks,suchasGSM1k(acontamination-freeversionofGSM8K)(H.Zhangetal.,

2024

)andRV-Bench(whichintroducesrandomvariablesintoMATHproblems)(Hongetal.,

2025

),haveoftenobservednotableperformancedrops,suggestingthatmodelsmaybeoverfittingtotheoriginalbenchmarkdistributions.

•LimitedScopeforComplexReasoning:Thesebenchmarks,oftencenteredonarithmeticorbasicalgebraicmanipulation,maynotadequatelytestdeeperconcep-tualunderstanding,abstractreasoning,ortheabilitytoconstructcomplexproofs(K.Huangetal.,

2025

).MathQA,forexample,explicitlyomitsproblemsinvolvinghigher-orderpolynomialsorentirelynon-numericsolutions.

•SensitivitytoInputPerturbations:LLMperformanceonthesebenchmarkscanbesurprisinglyfragile,exhibitingsignificantdegradationinresponsetominoralterationsinproblemphrasing,numericalvalues,ortheintroductionofirrelevantinformation.Thissensitivitysuggestsalackofrobustunderstandingofunderlyingmathematicalprinciples.Forinstance,MATH-P-HardintroduceshardperturbationstoMATHprob-lemssothattheoriginalsolutionstepsdon’tapply,resultinginsignificantperformancedrops.Theresearchersfoundthatthemodelsblindlyapplylearnedproblem-solvingtacticseventowardsunsuitableproblems(K.Huangetal.,

2025

•LackofDiversityinProblemTypesandAssessedReasoningSkills:Existingbenchmarksmaynotencompassasufficientlybroadrangeofmathematicaltopicsorthediversecognitiveskillsrequiredforadvancedmathematicalthought.TheMaTTbenchmark,forexample,foundthatLLMperformancecanvarysignificantlyeven

CHAPTER2.RELATEDWORKS

acrosscloselyrelatedsubtopicswithinthesamegeneralmathematicalarea(Davoodietal.,

2025

Olympiad-LevelandProof-FocusedBenchmarks

Toaddressthelimitationsofshort-answerbenchmarksandtoprobemoreadvancedmathe-maticalreasoning,severalbenchmarksfocusingonOlympiad-levelproblemsandproofgen-erationhavebeendeveloped.

•Minif2f(K.Zhengetal.,

2022

):Thisbenchmarkprovidesacollectionof488formalproblemstatementsderivedfromOlympiad-levelmathematics(AIME,AMC,IMO)andundergraduatecourses,formalizedinmultipleinteractivetheoremprovingsystemslikeLean,Isabelle,andHOLLight.Itprimarilycoversalgebra,numbertheory,andinequalities,withformalizationsdonemanually.

•PutnamBench(Tsoukalasetal.,

2024

):SourcedfromthechallengingWilliam

LowellPutnamMathematicalCompetition,PutnamBenchfeatures1692hand-constructedformalizationsof640theoremsinLean4,Isabelle,andCoq.Currentmodelshave

demonstratedverylimitedsuccessonthisbenchmark,solvingonlyahandfuloftheproblems.

•Omni-math(Gaoetal.,

2025

):ThisbenchmarkaimstobeauniversalOlympiad-levelmathematicsevaluationsuite,comprising4,428problemsspanning33sub-domainsand10difficultylevels.ItutilizesGPT-4oasajudge(Omni-Judge)forevaluatingan-swers,whicharetypicallynumericalorSymPyobjects.

•FrontierMath(Glazeretal.,

2024

):Thisbenchmarkintroduceshundredsoforig-inal,exceptionallychallengingmathematicsproblemsattheadvancedundergraduate,graduate,andresearchlevels,craftedandvettedbyexpertmathematicia

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning

文档简介

温馨提示

最新文档

评论

语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning

文档简介

温馨提示

最新文档

评论

相关文档