语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第1页
语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第2页
语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第3页
语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第4页
语言模型在高等数学推理方面的基准测试 Benchmarking LLMs on Advanced Mathematical Reasoning_第5页
已阅读5页,还剩62页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

BenchmarkingLLMsonAdvancedMathematicalReasoning

JonathanYueDanielKlein

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2025-121

/Pubs/TechRpts/2025/EECS-2025-121.html

May16,2025

Copyright©2025,bytheauthor(s).

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

BenchmarkingLLMsonAdvancedMathematicalReasoning

by

JonathanYue

Athesissubmittedinpartialsatisfactionofthe

requirementsforthedegreeof

MasterofScience

in

ElectricalEngineeringandComputerSciences

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorDanKlein,ChairProfessorGireejaRanade

Spring2025

1

Abstract

BenchmarkingLLMsonAdvancedMathematicalReasoning

by

JonathanYue

MasterofScienceinElectricalEngineeringandComputerSciences

UniversityofCalifornia,Berkeley

ProfessorDanKlein,Chair

LargeLanguageModels(LLMs)haveimproveddramaticallyatmathematicalreasoning,progressingfrombasicarithmetictoolympiadlevelproofs.However,theexisting,short-answerbasedbenchmarkscansufferfromlimitedscopeforcomplexreasoningandthereforedonotsufficientlymeasurethereasoningcapabilitiesofLLMs.Formalproof-basedbench-marksexist,buttheneedtoconvertproblemstatementsintoformallanguageslimitsthescopeoftheproblems.Apotentialreasonforthissignificantgapincurrentliteratureisthedifficultyingradingproofproblemsatscale.Toaddressthis,wefirstproposeanLLM-as-a-judgeframeworktojudgemodel-generatedproofsandevaluateditsefficacy.Then,weproposeabenchmarkof77PhD-levelproofquestions,drawnfromRomanVershynin’s“High-DimensionalProbability:AnIntroductionwithApplicationsinDataScience”,andchallengedstate-of-the-artLLMswiththesequestions.WeevaluatedtheLLM-generatedsolutionsusingtheLLM-as-a-judgeframeworkandfoundthat,ingeneral,state-of-the-artLLMsarestillunabletoadequatelycompletetheseproofs.

i

Contents

Contents

i

1Introduction

1

2Relatedworks

3

2.1LLMformathematicalreasoning

3

2.2BenchmarkingMathematicalReasoninginLargeLanguageModels

7

2.3LLM-as-a-JudgeforEvaluatingMathematicalReasoning

10

3Datacollection

13

3.1Groundtruthhumanproofevaluation

13

3.2Mathematicaltextandgroundtruthsolutions

14

4LLM-as-a-judgeformathematicalproofs

16

4.1Evaluation

18

5LLMnaturallanguageproof-writing

21

5.1Methodology

21

5.2Results

23

6Discussion

25

6.1Limitations

26

7ConclusionandFutureDirections

27

Bibliography

29

ii

Acknowledgments

Iamdeeplygratefultomyadvisor,DanKlein,forhissupportandmentorshipoverthepasttwoyears,andtoProfessorGireejaRanadeforbeingmysecondreaderandofferingvaluableconversationsandfeedback.Ialsowishtothankmymentor,NicholasTomlin,forhisguidanceandthoughtfulinputthroughoutthisprocess.

IwouldliketothankallmyclassmatesintheCompSci294-271course,AIinEducation,taughtbyProfessorGireejaRanadeandProfessorNargesNorouzi,whereIengagedindis-cussionsthatinspiredtheconceptionofthisproject.Iespeciallythankmyteammembersinthecourse,NoraGuoandSveinungMyhre,whoIcollaboratedwithtoobtaintheIRBapprovalandconductarelatedstudy,forhelpingshapetheideasforthiswork.

Finally,Iextendmyheartfeltthankstomyfamilyandfriendsfortheirconstantencourage-mentandsupport,whichhasbeeninvaluableineveryfacetofmylife.

1

Chapter1

Introduction

TheadventofLargeLanguageModels(LLMs)hasresultedremarkableadvancementsinartificialintelligence,withmathematicalreasoningemergingasasignificantareaofinterest.LLMshaveevolvedfromperformingrudimentaryarithmetictoassistingwithmathematicaldiscover(Romera-Paredesetal.,

2024

),drivenbyincreasedmodelsizeandsophisticatedpromptingtechniquessuchasChain-of-Thought(CoT)prompting(Weietal.,

2023

)andTool-IntegratedReasoning(TIR)(Ahnetal.,

2024

).

Despitethesestrides,aconsiderablegappersistsinevaluatingthedeeperreasoningcapabil-itiesofLLMs.Existingbenchmarkspredominantlyfocusonshort-answerormultiple-choicequestions,suchasMATH(Hendrycksetal.,

2021

)andGSM8K(Cobbeetal.,

2021

).Whileuseful,thesebenchmarksoftenemphasizethefinalnumericaloutput,potentiallyneglectingtherigoroftheintermediatereasoningstepsandsufferingfromissuessuchasbenchmarksaturationanddatacontamination(Hongetal.,

2025

;Petrovetal.,

2025

).Furthermore,theirlimitedscopemaynotsufficientlyassesscomplexconceptualunderstandingortheabil-itytoconstructintricatearguments(K.Huangetal.,

2025

).ThepoorperformanceofLLMsonproblemsthatrequirerigorousproofgeneration,asopposedtonumericalanswers,sug-gestsapotential“reasoningillusion,”wheresuccessinsometasksmightstemfrompatternmatchingortoolassistanceratherthangenuinemathematicalinsight(Petrovetal.,

2025

).

Ontheotherhand,benchmarkscenteredonformalproofs,suchasMinif2f(K.Zhengetal.,

2022

)andPutnamBench(Tsoukalasetal.,

2024

),operatewithinthestrictconfinesofsymbolicsystemslikeLeanorIsabelle.Whilevaluableforassessingmechanicallyverifiablereasoning,thenecessityoftranslatingproblemstatementsintotheseformallanguages(aut-oformalization)canbeachallengingtaskitself,whichrestrictsthebreadthofproblemsthatcanbeaddressed(Gulatietal.,

2024

;J.Zhangetal.,

2025

).Thisleavesasignificantgapinevaluatingthegenerationofnovel,naturallanguagemathematicalproofs,whicharemoreakintohumanmathematicalpracticeandcrucialforinterpretability.Aprimaryobstacletodevelopingbenchmarksforsuchproofsistheinherentdifficultyingradingthemconsistentlyandatscale,asnaturallanguageproofslacktheimmediateverifiabilityofformalproofsandoftenrequireexperthumanevaluation(Friederetal.,

2023

).

2

CHAPTER1.INTRODUCTION

Toaddresstheselimitationsandbridgethegapinevaluatingadvancedmathematicalrea-soninginnaturallanguage,thispapermakestwoprimarycontributions.First,weproposeandevaluateanLLM-as-a-judgeframeworkspecificallydesignedforassessingthecorrect-nessandcoherenceofmathematicalproofs.Thisapproachoffersascalablealternativetomanualexpertevaluation.Second,leveragingthisevaluationframework,weintroduceanewbenchmarkcomprising77PhD-levelproof-basedquestions,designedtorigorouslytesttheadvancedreasoningcapabilitiesofLLMsonproblemsrequiringnovelandintricatenaturallanguageproofs.

Wefindthat,despiterecentprogress,currentleadingLLMsaregenerallystillunabletoadequatelycompletethesecomplexproof-basedtasks.Thisunderscoresthenecessityformorechallengingbenchmarksthatprobedeeperlevelsofmathematicalunderstandingandproofgeneration,therebyprovidingamoreaccuratemeasureofLLMreasoningabilitiesandguidingfutureresearchtowardsdevelopingmorecapableandrobustAIsystemsformathematics.

3

Chapter2

Relatedworks

2.1LLMformathematicalreasoning

TheintroductionofLargeLanguageModels(LLMs)hasmarkedasignificantturninartificialintelligence.Amongthemanyapplications,mathematicalreasoningemergesasakeyareaofexplorationandadvancement.ThissectionsurveysthedevelopmentofLLMsintacklingmathematicaltasks.

AdvancesandCoreTechniquesinLLMMathematicalReasoning

LLMshaverapidlyimprovedatmathematicalreasoning,progressingfromsimplearithmetictosolvingmorecomplexproblems(Liuetal.,

2025

).Akeydriverforthisprogressistheincreaseinmodelsize.Largermodelsfollowinstructionsbetterandreasonmoreeffectively.ResearchshowsthatLLMswithover100billionparameterscansolvedifficultmathproblemswiththerightprompts(Forootani,

2025

).

Thediscoveryofseveraltechniqueshaveaugmentedthisimprovementduetomodelscale.AmajorcornerstoneinthisareaisChain-of-Thought(CoT)prompting.ItguidesLLMstoliststepsintheirreasoningbeforegivingafinalanswer.Thisresultsinanimprovementinboththeperformanceoncomplextasksandtheinterpretabilityofthereasoningprocess.Weietal.,

2023

showedthatCoTpromptingallowsLLMstodeconstructproblemstheywouldusuallyfail.ResearchershavebuiltuponbasicCoTwithmoreadvancedversionsandnewtypesof“thinking”models.ExamplesincludeSoftCoT(Y.Xuetal.,

2025

),ContinuousCoT(Cheng&Durme,

2024

),andCoconut(Haoetal.,

2024

).SoftCoT,forinstance,utilizesasmallerassistantmodeltogenerate“softthoughttokens”whicharethenprojectedintotheprimaryLLM’srepresentationspacethroughatrainablemodule.Thisfine-tuningap-proachisparameter-efficientandhasdemonstratedenhancedperformanceonmathematicalreasoningbenchmarks.

Theemergenceofmodelsexplicitlydesignedfor“thinking,”suchasOpenAI’sO1and

4

CHAPTER2.RELATEDWORKS

DeepSeek-R1,representsanotableprogression.ThesemodelsoftentakemoretimetoreasonduringinferenceandaretrainedwithReinforcementLearning(RL)tobuildadvancedcogni-tiveskillslikeself-checkingandreflection.DeepSeek-R1,forexample,isreportedtodevelopCoTreasoningcapabilitiesautonomouslythroughapureRLtrainingparadigm(DeepSeek-AIetal.,

2025

).Multi-roundThinking,whereamodelusespreviousanswerstotryagainandimprove,hasalsoledtobetterresults,asseeninmodelslikeQwQ-32BandDeepSeek-R1ondifficultbenchmarkssuchasAIME2024(Tianetal.,

2025

).RL-basedreasoningmethodsarealsobeingusedinmultimodalLLMs.Vision-R1,forinstance,usesProgressiveThinkingSuppressionTraining(PTST)toimproveperformanceinsolvingmultimodalmathproblems(W.Huangetal.,

2025

).

ToaddressLLMs’weaknessinprecisenumericalcalculationandrigoroussymbolicmanip-ulation,Tool-IntegratedReasoning(TIR)hasgainedprominence(Ahnetal.,

2024

).TIRletsLLMshandoffsubpartsofaproblemtotoolslikePythonformathorsymbolicsolversforalgebra.Thisimprovesaccuracy,especiallywhenhighprecisionorstructuredmathisneeded.Forexample,thewinningsolutionintheAIMO-2competitionusedCoTfine-tuningfirst,thenfine-tunedagainonaTIRdataset.Thishelpedthemodelcombinenaturallan-guagereasoningwithstructuredcomputation(Moshkovetal.,

2025

).TIRframeworkslikeTATA(TeachingLLMsAccordingtoTheirAptitude)(X.Xuetal.,

2025

)helpmodelschoosebetweenCoTandTIRdependingontheproblem.

Still,LLMsstrugglewithhighlyadvancedmathproblems,especiallyonesneedingfullproofslikethoseinmatholympiads.Arecentworkevaluatedstate-of-the-artLLMson2025US-AMOproblems.Theresultsshowsignificantlimits(Petrovetal.,

2025

).EvenaleadingmodellikeGemini-2.5-PROachievedanaveragescoreofonly25%,withotherprominentmodelsscoringbelow5%.Commonfailuremodesincludedflawedlogicalsteps,introductionofunjustifiedassumptions,alackofcreativeproblem-solvingstrategies,andatendencytofalselyclaimthataproblemhadbeensolved.

TheobserveddiscrepancybetweenLLMperformanceonbenchmarksrequiringnumericalanswersandthoserequiringrigorousproofgenerationsuggestsapotential“reasoningillu-sion.”WhiletechniqueslikeCoTandTIRdemonstrateimprovementsoncertaintypesofmathematicalproblems,thepoorperformanceonproof-basedtaskssuggeststhattheseim-provementsmightnotalwaystranslatetogenuine,deepmathematicalunderstanding.LLMsmaybeleveragingpatternmatchingandtool-assistedcomputationtosucceedinnumericalanswer-orientedtasks,withoutnecessarilydevelopingdeeplogicaldeductivecapabilities.Thisillustratesthecriticalneedforbenchmarksthatspecificallyevaluatethesedeeperrea-soningandproof-generationabilities.

Formalvs.InformalMathematicalReasoningbyLLMs

MathematicalreasoningbyLLMscanbebroadlycategorizedintotwodomains:formalmath-ematicalreasoning,whichoperatesundertherigoroussyntaxofsymbolicsystemsandproof

5

CHAPTER2.RELATEDWORKS

assistants,andinformalmathematicalreasoning,whichexpressesmathematicsinnaturallanguage.

FormalMathematicalReasoning

Thissub-fieldischaracterizedbyitsemphasisonmechanicalverifiabilityandlogicalrigor.

LLMsforFormalProofGeneration(Lean,Isabelle,etc.):MuchresearchisdedicatedtotrainingLLMstogenerateproofsinformallanguagessuchasLean4(J.Zhangetal.,

2025

)andIsabelle(X.Zhaoetal.,

2024

).Thegoalofthislineofresearchistodevelopsystemsthatcanautonomouslyproducemachine-verifiablemathematicalproofs.

Severalnotablemodelsandsystemshavedemonstratedprogress.AlphaGeometrymadeheadlinesbysolvingOlympiad-levelgeometryproblems.Itcombinedalanguagemodel,whichsuggestspotentiallyusefulgeometricconstructions,withasymbolicdeductionenginethatformallyverifiesthesesteps.Akeycomponentofitsdevelopmentwasthegenerationof100millionsyntheticdataexamplesfortraining(Trinhetal.,

2024

).Animprovedver-sion,AlphaGeometry2extendeditsformallanguagetohandleawiderrangeofgeometricproblemsandleveragedtheGeminiarchitecture.AlphaGeometry2reportedlysurpassedtheaverageperformanceofhumangoldmedalistsonabenchmarkofIMOgeometryproblems(Chervonyietal.,

2025

).

TheSelf-playTheoremProver(STP)addressesthecriticalissueofdatascarcityinformalmathematicsbyemployingadual-agentsystem:aconjecturerthatproposesnewtheoremsandaproverthatattemptstoprovethem.Thisself-playloopallowsthesystemtoiterativelygenerateitsowntrainingdata,whichresultsinsignificantperformanceim-provementsonvariousreasoningbenchmarks(Dong&Ma,

2025

).Goedel-Proverisanopen-sourceLLMdesignedforformalproofgenerationinLean4.Itachievedstate-of-the-artresultsbyfirstformalizingalargedatasetofnaturallanguagemathproblemsintoLean4statementsandthenemployinganexpertiterationstrategytotraintheprover(Linetal.,

2025

).Kimina-ProverPreviewdevelopedareasoning-drivenexplorationparadigm,utilizinglarge-scaleRLbasedonQwen2.5-72BtogenerateproofsinLean4.ThisapproachhasalsosetnewSOTAperformanceontheminiF2Fbenchmark(H.Wangetal.,

2025

).

Despitetheseadvancements,akeychallengeremainsthescarcityoflarge-scale,high-qualitydatasetsofformalizedmathematicalstatementsandproofs.TherigoroussyntaxofformallanguagesoftenmakesproofgenerationinthesesystemsmoredifficultforLLMscomparedtonaturallanguage(J.Zhangetal.,

2025

).

Autoformalization:Thiscriticalsub-areafocusesontheautomatictranslationofinformalmathematicallanguageintoformallanguagesthatcanbeprocessedbyproofassistants.LLMslikeGPT-4haveconsiderablyadvancedautoformalizationcapabilitiesthroughin-contextlearning.Techniquessuchasback-translation,whereexistingformalstatementsareautomaticallyinformalizedintonaturallanguagetoresultinsyntheticpairsofformalandinformalmathematicalstatements,havebeenexploredtoaugmenttrainingdata(Yanget

6

CHAPTER2.RELATEDWORKS

al.,

2024

).

However,autoformalizationremainsadifficulttask,especiallyforcomplexorabstractmath-ematicalconcepts(Gulatietal.,

2024

).LLMscanstrugglewithselectingthecorrectpream-blesorlibraryimportsfromvastformalmathematicslibrariesandmayevengeneraterefer-encestonon-existentpreambles.Evaluatingautoformalizationfaithfulness—thattheformalstatementaccuratelycapturesthesemanticsoftheoriginalinformalinput—remainsanareaofongoingresearch.

Informal(NaturalLanguage)MathematicalProofs

Thisdomainconcernsthegenerationandcomprehensionofmathematicalproofs,intheformmostfamiliartohumans:innaturallanguage.

CurrentLandscape:WhileCoTtechniquesenabledLLMstogeneratenaturallanguagestep-by-stepreasoning,therehasbeenrelativelylittleresearchintothegenerationofrigorousandverifiablenaturallanguagemathematicalproofsforcomplexproblems.MuchoftheexistingworkonmathematicalreasoninginLLMsfocusesonsolvingproblemsthatyieldnumericalorshort-formanswers.

ThegenerationandevaluationofnaturallanguageproofsbyLLMspresentauniquesetofchallenges:

•Verifiability:Afundamentaldifficultyliesintheautomatedverificationofnaturallanguageproofs.Unlikeformalproofs,whichcanbemechanicallycheckedbyproofassistants,naturallanguageproofstypicallyrequirehumanexpertevaluationorhighlysophisticatedAI-drivenverifiers.Evenminorerrorsinlogicorcalculationcaninvali-dateanentireproof,andthesecanbedifficultforcurrentsystemstoreliablydetectinanaturallanguagecontext.

•Ambiguity:Naturallanguageisinherentlyambiguousandoftenreliesonimplicitcontextualknowledgeandcommonsense.Forinstance,whetheromittingcertainstepsinamathematicalargumentisacceptabledependsontheassumedleveloffamiliarityofboththeauthorandtheaudience.LLMsmightnotfullygrasptheseambiguitiesandmaintainthenecessarylevelofprecisionthroughoutaproof.AnotherconcernisthatanLLMmayinadvertentlyexploittheseambiguitiesandpretendasifithadcompletedaproof.

•DataScarcityforComplexProofs:High-quality,large-scaledatasetsofcomplexnaturallanguageproofs,particularlythosewithdetailedstep-by-stepannotationsorexplanationsofreasoning,arescarce.Thisisespeciallytruefornicheoradvancedmathematicaltopics.

ThecomparativelysmallervolumeofresearchfocusedongeneratingandverifyingnaturallanguagemathematicalproofsbyLLMs,asopposedtoformalproofsorproblem-solving

7

CHAPTER2.RELATEDWORKS

withnumericalanswers,maybeattributedtoseveralfactors.FormalsystemsandproofassistantsmakeformalproofgenerationanattractivetargetforAIresearchasitprovidesaclear,objectivemeasureofcorrectness.Thisreliablefeedbackloopsformodeltrainingandevaluationstandsincontrasttotheinherentambiguityofnaturallanguage.Additionally,formalmathematicslibraries(e.g.,MathlibforLean)offeragrowing,structured,andverifiedcorpusofmathematicalknowledge.ThefieldofAutomatedTheoremProving(ATP)alsohasalong-standingtraditioninAI,historicallyfocusingonformallogic.

Despitethesechallenges,webelievethatthegenerationandevaluationofcomplexinformalmathematicalproofsisvitallyimportant.Naturallanguageproofsareoftenmoreinter-pretable,whichmakesthemmoredesirableinmanysettings.Forinstance,inaneducationsetting,capabilitiesincomplexmathematicalreasoningmayallowformoreefficientproofgradingorstudenthintgeneration.ThespaceofmathematicalreasoningcanalsoactasatestbedtoAIreasoninginless-definedreal-worldenvironments,whereformalizationwouldbenearlyimpossibleandwhereinterpretabilitywouldbeevermoreimportant.

2.2BenchmarkingMathematicalReasoninginLargeLanguageModels

TheevaluationofmathematicalreasoningcapabilitiesinLLMsreliesheavilyonbenchmarks.Thissectionreviewsexistingbenchmarks,categorizingthembytheirfocusandproblemtypes,andcriticallyexaminestheirlimitations.

Short-AnswerBenchmarks

ThesebenchmarkstypicallyassessLLMsonmathematicalproblemswheretheexpectedoutputisafinalnumericalansweroraconcisetextualresponse.Prominentexamplesinclude:

•MathQA(Aminietal.,2019):DerivedfromtheAQuAdataset,MathQApresentsmathwordproblemswithmultiple-choiceanswers.Itwasdesignedwithanempha-sisoninterpretabilitybymappingproblemstosequencesofpredefinedmathematicaloperations(Aminietal.,

2019

).

•MATH(Hendrycksetal.,2021b):Thisdatasetiscomprisedofchallengingprob-lemsfrommathematicscompetitionssuchastheAmericanMathematicsCompeti-tions(AMC10/12)andtheAmericanInvitationalMathematicsExamination(AIME).Whilesolutionsoftenrequiremultiplereasoningsteps,evaluationtypicallyfocusesonthecorrectnessofthefinalnumericalanswer(Hendrycksetal.,

2021

).

•GSM8K(Cobbeetal.,2021):Consistingofgradeschoolmathematicswordprob-lems,GSM8Kisdesignedtotestmulti-steparithmeticreasoning.Modelsareexpected

8

CHAPTER2.RELATEDWORKS

toproduceanumericalanswerderivedthroughasequenceofcalculations(Cobbeetal.,

2021

).

Despitetheirwidespreaduse,theseshort-answerbenchmarksfaceseveralcriticisms:

•EmphasisonFinalAnswers:Aprimarylimitationistheirpredominantfocusonthecorrectnessofthefinalanswer,oftenneglectingtheintermediatereasoningsteps.Thisevaluationapproachcaninadvertentlyrewardmodelsthatarriveatthecorrectsolutionthroughflawedreasoningprocesses(Petrovetal.,

2025

).

•BenchmarkSaturation:Manyofthesebenchmarksarerapidlyapproachingsatu-ration,withstate-of-the-artLLMsachievingveryhighaccuracyscores(e.g.,over97%onGSM8KasreportedinBudagametal.,

2024

).Thissaturationdiminishestheirefficacyindifferentiatingthecapabilitiesofadvancedmodels.

•DataContaminationandOverfitting:AsignificantconcernisthepotentialforbenchmarkdatatohavebeenincludedinthevasttrainingcorporaofLLMs.Thisdatacontaminationcanleadtoinflatedperformancemetricsthatreflectmemorizationratherthantruereasoningability.Studiesthathavecreatedperturbedversionsofthesebenchmarks,suchasGSM1k(acontamination-freeversionofGSM8K)(H.Zhangetal.,

2024

)andRV-Bench(whichintroducesrandomvariablesintoMATHproblems)(Hongetal.,

2025

),haveoftenobservednotableperformancedrops,suggestingthatmodelsmaybeoverfittingtotheoriginalbenchmarkdistributions.

•LimitedScopeforComplexReasoning:Thesebenchmarks,oftencenteredonarithmeticorbasicalgebraicmanipulation,maynotadequatelytestdeeperconcep-tualunderstanding,abstractreasoning,ortheabilitytoconstructcomplexproofs(K.Huangetal.,

2025

).MathQA,forexample,explicitlyomitsproblemsinvolvinghigher-orderpolynomialsorentirelynon-numericsolutions.

•SensitivitytoInputPerturbations:LLMperformanceonthesebenchmarkscanbesurprisinglyfragile,exhibitingsignificantdegradationinresponsetominoralterationsinproblemphrasing,numericalvalues,ortheintroductionofirrelevantinformation.Thissensitivitysuggestsalackofrobustunderstandingofunderlyingmathematicalprinciples.Forinstance,MATH-P-HardintroduceshardperturbationstoMATHprob-lemssothattheoriginalsolutionstepsdon’tapply,resultinginsignificantperformancedrops.Theresearchersfoundthatthemodelsblindlyapplylearnedproblem-solvingtacticseventowardsunsuitableproblems(K.Huangetal.,

2025

).

•LackofDiversityinProblemTypesandAssessedReasoningSkills:Existingbenchmarksmaynotencompassasufficientlybroadrangeofmathematicaltopicsorthediversecognitiveskillsrequiredforadvancedmathematicalthought.TheMaTTbenchmark,forexample,foundthatLLMperformancecanvarysignificantlyeven

9

CHAPTER2.RELATEDWORKS

acrosscloselyrelatedsubtopicswithinthesamegeneralmathematicalarea(Davoodietal.,

2025

).

Olympiad-LevelandProof-FocusedBenchmarks

Toaddressthelimitationsofshort-answerbenchmarksandtoprobemoreadvancedmathe-maticalreasoning,severalbenchmarksfocusingonOlympiad-levelproblemsandproofgen-erationhavebeendeveloped.

•Minif2f(K.Zhengetal.,

2022

):Thisbenchmarkprovidesacollectionof488formalproblemstatementsderivedfromOlympiad-levelmathematics(AIME,AMC,IMO)andundergraduatecourses,formalizedinmultipleinteractivetheoremprovingsystemslikeLean,Isabelle,andHOLLight.Itprimarilycoversalgebra,numbertheory,andinequalities,withformalizationsdonemanually.

•PutnamBench(Tsoukalasetal.,

2024

):SourcedfromthechallengingWilliam

LowellPutnamMathematicalCompetition,PutnamBenchfeatures1692hand-constructedformalizationsof640theoremsinLean4,Isabelle,andCoq.Currentmodelshave

demonstratedverylimitedsuccessonthisbenchmark,solvingonlyahandfuloftheproblems.

•Omni-math(Gaoetal.,

2025

):ThisbenchmarkaimstobeauniversalOlympiad-levelmathematicsevaluationsuite,comprising4,428problemsspanning33sub-domainsand10difficultylevels.ItutilizesGPT-4oasajudge(Omni-Judge)forevaluatingan-swers,whicharetypicallynumericalorSymPyobjects.

•FrontierMath(Glazeretal.,

2024

):Thisbenchmarkintroduceshundredsoforig-inal,exceptionallychallengingmathematicsproblemsattheadvancedundergraduate,graduate,andresearchlevels,craftedandvettedbyexpertmathematicia

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论