




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
BenchmarkingLLMsonAdvancedMathematicalReasoning
JonathanYueDanielKlein
ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley
TechnicalReportNo.UCB/EECS-2025-121
/Pubs/TechRpts/2025/EECS-2025-121.html
May16,2025
Copyright©2025,bytheauthor(s).
Allrightsreserved.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.
BenchmarkingLLMsonAdvancedMathematicalReasoning
by
JonathanYue
Athesissubmittedinpartialsatisfactionofthe
requirementsforthedegreeof
MasterofScience
in
ElectricalEngineeringandComputerSciences
inthe
GraduateDivision
ofthe
UniversityofCalifornia,Berkeley
Committeeincharge:
ProfessorDanKlein,ChairProfessorGireejaRanade
Spring2025
1
Abstract
BenchmarkingLLMsonAdvancedMathematicalReasoning
by
JonathanYue
MasterofScienceinElectricalEngineeringandComputerSciences
UniversityofCalifornia,Berkeley
ProfessorDanKlein,Chair
LargeLanguageModels(LLMs)haveimproveddramaticallyatmathematicalreasoning,progressingfrombasicarithmetictoolympiadlevelproofs.However,theexisting,short-answerbasedbenchmarkscansufferfromlimitedscopeforcomplexreasoningandthereforedonotsufficientlymeasurethereasoningcapabilitiesofLLMs.Formalproof-basedbench-marksexist,buttheneedtoconvertproblemstatementsintoformallanguageslimitsthescopeoftheproblems.Apotentialreasonforthissignificantgapincurrentliteratureisthedifficultyingradingproofproblemsatscale.Toaddressthis,wefirstproposeanLLM-as-a-judgeframeworktojudgemodel-generatedproofsandevaluateditsefficacy.Then,weproposeabenchmarkof77PhD-levelproofquestions,drawnfromRomanVershynin’s“High-DimensionalProbability:AnIntroductionwithApplicationsinDataScience”,andchallengedstate-of-the-artLLMswiththesequestions.WeevaluatedtheLLM-generatedsolutionsusingtheLLM-as-a-judgeframeworkandfoundthat,ingeneral,state-of-the-artLLMsarestillunabletoadequatelycompletetheseproofs.
i
Contents
Contents
i
1Introduction
1
2Relatedworks
3
2.1LLMformathematicalreasoning
3
2.2BenchmarkingMathematicalReasoninginLargeLanguageModels
7
2.3LLM-as-a-JudgeforEvaluatingMathematicalReasoning
10
3Datacollection
13
3.1Groundtruthhumanproofevaluation
13
3.2Mathematicaltextandgroundtruthsolutions
14
4LLM-as-a-judgeformathematicalproofs
16
4.1Evaluation
18
5LLMnaturallanguageproof-writing
21
5.1Methodology
21
5.2Results
23
6Discussion
25
6.1Limitations
26
7ConclusionandFutureDirections
27
Bibliography
29
ii
Acknowledgments
Iamdeeplygratefultomyadvisor,DanKlein,forhissupportandmentorshipoverthepasttwoyears,andtoProfessorGireejaRanadeforbeingmysecondreaderandofferingvaluableconversationsandfeedback.Ialsowishtothankmymentor,NicholasTomlin,forhisguidanceandthoughtfulinputthroughoutthisprocess.
IwouldliketothankallmyclassmatesintheCompSci294-271course,AIinEducation,taughtbyProfessorGireejaRanadeandProfessorNargesNorouzi,whereIengagedindis-cussionsthatinspiredtheconceptionofthisproject.Iespeciallythankmyteammembersinthecourse,NoraGuoandSveinungMyhre,whoIcollaboratedwithtoobtaintheIRBapprovalandconductarelatedstudy,forhelpingshapetheideasforthiswork.
Finally,Iextendmyheartfeltthankstomyfamilyandfriendsfortheirconstantencourage-mentandsupport,whichhasbeeninvaluableineveryfacetofmylife.
1
Chapter1
Introduction
TheadventofLargeLanguageModels(LLMs)hasresultedremarkableadvancementsinartificialintelligence,withmathematicalreasoningemergingasasignificantareaofinterest.LLMshaveevolvedfromperformingrudimentaryarithmetictoassistingwithmathematicaldiscover(Romera-Paredesetal.,
2024
),drivenbyincreasedmodelsizeandsophisticatedpromptingtechniquessuchasChain-of-Thought(CoT)prompting(Weietal.,
2023
)andTool-IntegratedReasoning(TIR)(Ahnetal.,
2024
).
Despitethesestrides,aconsiderablegappersistsinevaluatingthedeeperreasoningcapabil-itiesofLLMs.Existingbenchmarkspredominantlyfocusonshort-answerormultiple-choicequestions,suchasMATH(Hendrycksetal.,
2021
)andGSM8K(Cobbeetal.,
2021
).Whileuseful,thesebenchmarksoftenemphasizethefinalnumericaloutput,potentiallyneglectingtherigoroftheintermediatereasoningstepsandsufferingfromissuessuchasbenchmarksaturationanddatacontamination(Hongetal.,
2025
;Petrovetal.,
2025
).Furthermore,theirlimitedscopemaynotsufficientlyassesscomplexconceptualunderstandingortheabil-itytoconstructintricatearguments(K.Huangetal.,
2025
).ThepoorperformanceofLLMsonproblemsthatrequirerigorousproofgeneration,asopposedtonumericalanswers,sug-gestsapotential“reasoningillusion,”wheresuccessinsometasksmightstemfrompatternmatchingortoolassistanceratherthangenuinemathematicalinsight(Petrovetal.,
2025
).
Ontheotherhand,benchmarkscenteredonformalproofs,suchasMinif2f(K.Zhengetal.,
2022
)andPutnamBench(Tsoukalasetal.,
2024
),operatewithinthestrictconfinesofsymbolicsystemslikeLeanorIsabelle.Whilevaluableforassessingmechanicallyverifiablereasoning,thenecessityoftranslatingproblemstatementsintotheseformallanguages(aut-oformalization)canbeachallengingtaskitself,whichrestrictsthebreadthofproblemsthatcanbeaddressed(Gulatietal.,
2024
;J.Zhangetal.,
2025
).Thisleavesasignificantgapinevaluatingthegenerationofnovel,naturallanguagemathematicalproofs,whicharemoreakintohumanmathematicalpracticeandcrucialforinterpretability.Aprimaryobstacletodevelopingbenchmarksforsuchproofsistheinherentdifficultyingradingthemconsistentlyandatscale,asnaturallanguageproofslacktheimmediateverifiabilityofformalproofsandoftenrequireexperthumanevaluation(Friederetal.,
2023
).
2
CHAPTER1.INTRODUCTION
Toaddresstheselimitationsandbridgethegapinevaluatingadvancedmathematicalrea-soninginnaturallanguage,thispapermakestwoprimarycontributions.First,weproposeandevaluateanLLM-as-a-judgeframeworkspecificallydesignedforassessingthecorrect-nessandcoherenceofmathematicalproofs.Thisapproachoffersascalablealternativetomanualexpertevaluation.Second,leveragingthisevaluationframework,weintroduceanewbenchmarkcomprising77PhD-levelproof-basedquestions,designedtorigorouslytesttheadvancedreasoningcapabilitiesofLLMsonproblemsrequiringnovelandintricatenaturallanguageproofs.
Wefindthat,despiterecentprogress,currentleadingLLMsaregenerallystillunabletoadequatelycompletethesecomplexproof-basedtasks.Thisunderscoresthenecessityformorechallengingbenchmarksthatprobedeeperlevelsofmathematicalunderstandingandproofgeneration,therebyprovidingamoreaccuratemeasureofLLMreasoningabilitiesandguidingfutureresearchtowardsdevelopingmorecapableandrobustAIsystemsformathematics.
3
Chapter2
Relatedworks
2.1LLMformathematicalreasoning
TheintroductionofLargeLanguageModels(LLMs)hasmarkedasignificantturninartificialintelligence.Amongthemanyapplications,mathematicalreasoningemergesasakeyareaofexplorationandadvancement.ThissectionsurveysthedevelopmentofLLMsintacklingmathematicaltasks.
AdvancesandCoreTechniquesinLLMMathematicalReasoning
LLMshaverapidlyimprovedatmathematicalreasoning,progressingfromsimplearithmetictosolvingmorecomplexproblems(Liuetal.,
2025
).Akeydriverforthisprogressistheincreaseinmodelsize.Largermodelsfollowinstructionsbetterandreasonmoreeffectively.ResearchshowsthatLLMswithover100billionparameterscansolvedifficultmathproblemswiththerightprompts(Forootani,
2025
).
Thediscoveryofseveraltechniqueshaveaugmentedthisimprovementduetomodelscale.AmajorcornerstoneinthisareaisChain-of-Thought(CoT)prompting.ItguidesLLMstoliststepsintheirreasoningbeforegivingafinalanswer.Thisresultsinanimprovementinboththeperformanceoncomplextasksandtheinterpretabilityofthereasoningprocess.Weietal.,
2023
showedthatCoTpromptingallowsLLMstodeconstructproblemstheywouldusuallyfail.ResearchershavebuiltuponbasicCoTwithmoreadvancedversionsandnewtypesof“thinking”models.ExamplesincludeSoftCoT(Y.Xuetal.,
2025
),ContinuousCoT(Cheng&Durme,
2024
),andCoconut(Haoetal.,
2024
).SoftCoT,forinstance,utilizesasmallerassistantmodeltogenerate“softthoughttokens”whicharethenprojectedintotheprimaryLLM’srepresentationspacethroughatrainablemodule.Thisfine-tuningap-proachisparameter-efficientandhasdemonstratedenhancedperformanceonmathematicalreasoningbenchmarks.
Theemergenceofmodelsexplicitlydesignedfor“thinking,”suchasOpenAI’sO1and
4
CHAPTER2.RELATEDWORKS
DeepSeek-R1,representsanotableprogression.ThesemodelsoftentakemoretimetoreasonduringinferenceandaretrainedwithReinforcementLearning(RL)tobuildadvancedcogni-tiveskillslikeself-checkingandreflection.DeepSeek-R1,forexample,isreportedtodevelopCoTreasoningcapabilitiesautonomouslythroughapureRLtrainingparadigm(DeepSeek-AIetal.,
2025
).Multi-roundThinking,whereamodelusespreviousanswerstotryagainandimprove,hasalsoledtobetterresults,asseeninmodelslikeQwQ-32BandDeepSeek-R1ondifficultbenchmarkssuchasAIME2024(Tianetal.,
2025
).RL-basedreasoningmethodsarealsobeingusedinmultimodalLLMs.Vision-R1,forinstance,usesProgressiveThinkingSuppressionTraining(PTST)toimproveperformanceinsolvingmultimodalmathproblems(W.Huangetal.,
2025
).
ToaddressLLMs’weaknessinprecisenumericalcalculationandrigoroussymbolicmanip-ulation,Tool-IntegratedReasoning(TIR)hasgainedprominence(Ahnetal.,
2024
).TIRletsLLMshandoffsubpartsofaproblemtotoolslikePythonformathorsymbolicsolversforalgebra.Thisimprovesaccuracy,especiallywhenhighprecisionorstructuredmathisneeded.Forexample,thewinningsolutionintheAIMO-2competitionusedCoTfine-tuningfirst,thenfine-tunedagainonaTIRdataset.Thishelpedthemodelcombinenaturallan-guagereasoningwithstructuredcomputation(Moshkovetal.,
2025
).TIRframeworkslikeTATA(TeachingLLMsAccordingtoTheirAptitude)(X.Xuetal.,
2025
)helpmodelschoosebetweenCoTandTIRdependingontheproblem.
Still,LLMsstrugglewithhighlyadvancedmathproblems,especiallyonesneedingfullproofslikethoseinmatholympiads.Arecentworkevaluatedstate-of-the-artLLMson2025US-AMOproblems.Theresultsshowsignificantlimits(Petrovetal.,
2025
).EvenaleadingmodellikeGemini-2.5-PROachievedanaveragescoreofonly25%,withotherprominentmodelsscoringbelow5%.Commonfailuremodesincludedflawedlogicalsteps,introductionofunjustifiedassumptions,alackofcreativeproblem-solvingstrategies,andatendencytofalselyclaimthataproblemhadbeensolved.
TheobserveddiscrepancybetweenLLMperformanceonbenchmarksrequiringnumericalanswersandthoserequiringrigorousproofgenerationsuggestsapotential“reasoningillu-sion.”WhiletechniqueslikeCoTandTIRdemonstrateimprovementsoncertaintypesofmathematicalproblems,thepoorperformanceonproof-basedtaskssuggeststhattheseim-provementsmightnotalwaystranslatetogenuine,deepmathematicalunderstanding.LLMsmaybeleveragingpatternmatchingandtool-assistedcomputationtosucceedinnumericalanswer-orientedtasks,withoutnecessarilydevelopingdeeplogicaldeductivecapabilities.Thisillustratesthecriticalneedforbenchmarksthatspecificallyevaluatethesedeeperrea-soningandproof-generationabilities.
Formalvs.InformalMathematicalReasoningbyLLMs
MathematicalreasoningbyLLMscanbebroadlycategorizedintotwodomains:formalmath-ematicalreasoning,whichoperatesundertherigoroussyntaxofsymbolicsystemsandproof
5
CHAPTER2.RELATEDWORKS
assistants,andinformalmathematicalreasoning,whichexpressesmathematicsinnaturallanguage.
FormalMathematicalReasoning
Thissub-fieldischaracterizedbyitsemphasisonmechanicalverifiabilityandlogicalrigor.
LLMsforFormalProofGeneration(Lean,Isabelle,etc.):MuchresearchisdedicatedtotrainingLLMstogenerateproofsinformallanguagessuchasLean4(J.Zhangetal.,
2025
)andIsabelle(X.Zhaoetal.,
2024
).Thegoalofthislineofresearchistodevelopsystemsthatcanautonomouslyproducemachine-verifiablemathematicalproofs.
Severalnotablemodelsandsystemshavedemonstratedprogress.AlphaGeometrymadeheadlinesbysolvingOlympiad-levelgeometryproblems.Itcombinedalanguagemodel,whichsuggestspotentiallyusefulgeometricconstructions,withasymbolicdeductionenginethatformallyverifiesthesesteps.Akeycomponentofitsdevelopmentwasthegenerationof100millionsyntheticdataexamplesfortraining(Trinhetal.,
2024
).Animprovedver-sion,AlphaGeometry2extendeditsformallanguagetohandleawiderrangeofgeometricproblemsandleveragedtheGeminiarchitecture.AlphaGeometry2reportedlysurpassedtheaverageperformanceofhumangoldmedalistsonabenchmarkofIMOgeometryproblems(Chervonyietal.,
2025
).
TheSelf-playTheoremProver(STP)addressesthecriticalissueofdatascarcityinformalmathematicsbyemployingadual-agentsystem:aconjecturerthatproposesnewtheoremsandaproverthatattemptstoprovethem.Thisself-playloopallowsthesystemtoiterativelygenerateitsowntrainingdata,whichresultsinsignificantperformanceim-provementsonvariousreasoningbenchmarks(Dong&Ma,
2025
).Goedel-Proverisanopen-sourceLLMdesignedforformalproofgenerationinLean4.Itachievedstate-of-the-artresultsbyfirstformalizingalargedatasetofnaturallanguagemathproblemsintoLean4statementsandthenemployinganexpertiterationstrategytotraintheprover(Linetal.,
2025
).Kimina-ProverPreviewdevelopedareasoning-drivenexplorationparadigm,utilizinglarge-scaleRLbasedonQwen2.5-72BtogenerateproofsinLean4.ThisapproachhasalsosetnewSOTAperformanceontheminiF2Fbenchmark(H.Wangetal.,
2025
).
Despitetheseadvancements,akeychallengeremainsthescarcityoflarge-scale,high-qualitydatasetsofformalizedmathematicalstatementsandproofs.TherigoroussyntaxofformallanguagesoftenmakesproofgenerationinthesesystemsmoredifficultforLLMscomparedtonaturallanguage(J.Zhangetal.,
2025
).
Autoformalization:Thiscriticalsub-areafocusesontheautomatictranslationofinformalmathematicallanguageintoformallanguagesthatcanbeprocessedbyproofassistants.LLMslikeGPT-4haveconsiderablyadvancedautoformalizationcapabilitiesthroughin-contextlearning.Techniquessuchasback-translation,whereexistingformalstatementsareautomaticallyinformalizedintonaturallanguagetoresultinsyntheticpairsofformalandinformalmathematicalstatements,havebeenexploredtoaugmenttrainingdata(Yanget
6
CHAPTER2.RELATEDWORKS
al.,
2024
).
However,autoformalizationremainsadifficulttask,especiallyforcomplexorabstractmath-ematicalconcepts(Gulatietal.,
2024
).LLMscanstrugglewithselectingthecorrectpream-blesorlibraryimportsfromvastformalmathematicslibrariesandmayevengeneraterefer-encestonon-existentpreambles.Evaluatingautoformalizationfaithfulness—thattheformalstatementaccuratelycapturesthesemanticsoftheoriginalinformalinput—remainsanareaofongoingresearch.
Informal(NaturalLanguage)MathematicalProofs
Thisdomainconcernsthegenerationandcomprehensionofmathematicalproofs,intheformmostfamiliartohumans:innaturallanguage.
CurrentLandscape:WhileCoTtechniquesenabledLLMstogeneratenaturallanguagestep-by-stepreasoning,therehasbeenrelativelylittleresearchintothegenerationofrigorousandverifiablenaturallanguagemathematicalproofsforcomplexproblems.MuchoftheexistingworkonmathematicalreasoninginLLMsfocusesonsolvingproblemsthatyieldnumericalorshort-formanswers.
ThegenerationandevaluationofnaturallanguageproofsbyLLMspresentauniquesetofchallenges:
•Verifiability:Afundamentaldifficultyliesintheautomatedverificationofnaturallanguageproofs.Unlikeformalproofs,whichcanbemechanicallycheckedbyproofassistants,naturallanguageproofstypicallyrequirehumanexpertevaluationorhighlysophisticatedAI-drivenverifiers.Evenminorerrorsinlogicorcalculationcaninvali-dateanentireproof,andthesecanbedifficultforcurrentsystemstoreliablydetectinanaturallanguagecontext.
•Ambiguity:Naturallanguageisinherentlyambiguousandoftenreliesonimplicitcontextualknowledgeandcommonsense.Forinstance,whetheromittingcertainstepsinamathematicalargumentisacceptabledependsontheassumedleveloffamiliarityofboththeauthorandtheaudience.LLMsmightnotfullygrasptheseambiguitiesandmaintainthenecessarylevelofprecisionthroughoutaproof.AnotherconcernisthatanLLMmayinadvertentlyexploittheseambiguitiesandpretendasifithadcompletedaproof.
•DataScarcityforComplexProofs:High-quality,large-scaledatasetsofcomplexnaturallanguageproofs,particularlythosewithdetailedstep-by-stepannotationsorexplanationsofreasoning,arescarce.Thisisespeciallytruefornicheoradvancedmathematicaltopics.
ThecomparativelysmallervolumeofresearchfocusedongeneratingandverifyingnaturallanguagemathematicalproofsbyLLMs,asopposedtoformalproofsorproblem-solving
7
CHAPTER2.RELATEDWORKS
withnumericalanswers,maybeattributedtoseveralfactors.FormalsystemsandproofassistantsmakeformalproofgenerationanattractivetargetforAIresearchasitprovidesaclear,objectivemeasureofcorrectness.Thisreliablefeedbackloopsformodeltrainingandevaluationstandsincontrasttotheinherentambiguityofnaturallanguage.Additionally,formalmathematicslibraries(e.g.,MathlibforLean)offeragrowing,structured,andverifiedcorpusofmathematicalknowledge.ThefieldofAutomatedTheoremProving(ATP)alsohasalong-standingtraditioninAI,historicallyfocusingonformallogic.
Despitethesechallenges,webelievethatthegenerationandevaluationofcomplexinformalmathematicalproofsisvitallyimportant.Naturallanguageproofsareoftenmoreinter-pretable,whichmakesthemmoredesirableinmanysettings.Forinstance,inaneducationsetting,capabilitiesincomplexmathematicalreasoningmayallowformoreefficientproofgradingorstudenthintgeneration.ThespaceofmathematicalreasoningcanalsoactasatestbedtoAIreasoninginless-definedreal-worldenvironments,whereformalizationwouldbenearlyimpossibleandwhereinterpretabilitywouldbeevermoreimportant.
2.2BenchmarkingMathematicalReasoninginLargeLanguageModels
TheevaluationofmathematicalreasoningcapabilitiesinLLMsreliesheavilyonbenchmarks.Thissectionreviewsexistingbenchmarks,categorizingthembytheirfocusandproblemtypes,andcriticallyexaminestheirlimitations.
Short-AnswerBenchmarks
ThesebenchmarkstypicallyassessLLMsonmathematicalproblemswheretheexpectedoutputisafinalnumericalansweroraconcisetextualresponse.Prominentexamplesinclude:
•MathQA(Aminietal.,2019):DerivedfromtheAQuAdataset,MathQApresentsmathwordproblemswithmultiple-choiceanswers.Itwasdesignedwithanempha-sisoninterpretabilitybymappingproblemstosequencesofpredefinedmathematicaloperations(Aminietal.,
2019
).
•MATH(Hendrycksetal.,2021b):Thisdatasetiscomprisedofchallengingprob-lemsfrommathematicscompetitionssuchastheAmericanMathematicsCompeti-tions(AMC10/12)andtheAmericanInvitationalMathematicsExamination(AIME).Whilesolutionsoftenrequiremultiplereasoningsteps,evaluationtypicallyfocusesonthecorrectnessofthefinalnumericalanswer(Hendrycksetal.,
2021
).
•GSM8K(Cobbeetal.,2021):Consistingofgradeschoolmathematicswordprob-lems,GSM8Kisdesignedtotestmulti-steparithmeticreasoning.Modelsareexpected
8
CHAPTER2.RELATEDWORKS
toproduceanumericalanswerderivedthroughasequenceofcalculations(Cobbeetal.,
2021
).
Despitetheirwidespreaduse,theseshort-answerbenchmarksfaceseveralcriticisms:
•EmphasisonFinalAnswers:Aprimarylimitationistheirpredominantfocusonthecorrectnessofthefinalanswer,oftenneglectingtheintermediatereasoningsteps.Thisevaluationapproachcaninadvertentlyrewardmodelsthatarriveatthecorrectsolutionthroughflawedreasoningprocesses(Petrovetal.,
2025
).
•BenchmarkSaturation:Manyofthesebenchmarksarerapidlyapproachingsatu-ration,withstate-of-the-artLLMsachievingveryhighaccuracyscores(e.g.,over97%onGSM8KasreportedinBudagametal.,
2024
).Thissaturationdiminishestheirefficacyindifferentiatingthecapabilitiesofadvancedmodels.
•DataContaminationandOverfitting:AsignificantconcernisthepotentialforbenchmarkdatatohavebeenincludedinthevasttrainingcorporaofLLMs.Thisdatacontaminationcanleadtoinflatedperformancemetricsthatreflectmemorizationratherthantruereasoningability.Studiesthathavecreatedperturbedversionsofthesebenchmarks,suchasGSM1k(acontamination-freeversionofGSM8K)(H.Zhangetal.,
2024
)andRV-Bench(whichintroducesrandomvariablesintoMATHproblems)(Hongetal.,
2025
),haveoftenobservednotableperformancedrops,suggestingthatmodelsmaybeoverfittingtotheoriginalbenchmarkdistributions.
•LimitedScopeforComplexReasoning:Thesebenchmarks,oftencenteredonarithmeticorbasicalgebraicmanipulation,maynotadequatelytestdeeperconcep-tualunderstanding,abstractreasoning,ortheabilitytoconstructcomplexproofs(K.Huangetal.,
2025
).MathQA,forexample,explicitlyomitsproblemsinvolvinghigher-orderpolynomialsorentirelynon-numericsolutions.
•SensitivitytoInputPerturbations:LLMperformanceonthesebenchmarkscanbesurprisinglyfragile,exhibitingsignificantdegradationinresponsetominoralterationsinproblemphrasing,numericalvalues,ortheintroductionofirrelevantinformation.Thissensitivitysuggestsalackofrobustunderstandingofunderlyingmathematicalprinciples.Forinstance,MATH-P-HardintroduceshardperturbationstoMATHprob-lemssothattheoriginalsolutionstepsdon’tapply,resultinginsignificantperformancedrops.Theresearchersfoundthatthemodelsblindlyapplylearnedproblem-solvingtacticseventowardsunsuitableproblems(K.Huangetal.,
2025
).
•LackofDiversityinProblemTypesandAssessedReasoningSkills:Existingbenchmarksmaynotencompassasufficientlybroadrangeofmathematicaltopicsorthediversecognitiveskillsrequiredforadvancedmathematicalthought.TheMaTTbenchmark,forexample,foundthatLLMperformancecanvarysignificantlyeven
9
CHAPTER2.RELATEDWORKS
acrosscloselyrelatedsubtopicswithinthesamegeneralmathematicalarea(Davoodietal.,
2025
).
Olympiad-LevelandProof-FocusedBenchmarks
Toaddressthelimitationsofshort-answerbenchmarksandtoprobemoreadvancedmathe-maticalreasoning,severalbenchmarksfocusingonOlympiad-levelproblemsandproofgen-erationhavebeendeveloped.
•Minif2f(K.Zhengetal.,
2022
):Thisbenchmarkprovidesacollectionof488formalproblemstatementsderivedfromOlympiad-levelmathematics(AIME,AMC,IMO)andundergraduatecourses,formalizedinmultipleinteractivetheoremprovingsystemslikeLean,Isabelle,andHOLLight.Itprimarilycoversalgebra,numbertheory,andinequalities,withformalizationsdonemanually.
•PutnamBench(Tsoukalasetal.,
2024
):SourcedfromthechallengingWilliam
LowellPutnamMathematicalCompetition,PutnamBenchfeatures1692hand-constructedformalizationsof640theoremsinLean4,Isabelle,andCoq.Currentmodelshave
demonstratedverylimitedsuccessonthisbenchmark,solvingonlyahandfuloftheproblems.
•Omni-math(Gaoetal.,
2025
):ThisbenchmarkaimstobeauniversalOlympiad-levelmathematicsevaluationsuite,comprising4,428problemsspanning33sub-domainsand10difficultylevels.ItutilizesGPT-4oasajudge(Omni-Judge)forevaluatingan-swers,whicharetypicallynumericalorSymPyobjects.
•FrontierMath(Glazeretal.,
2024
):Thisbenchmarkintroduceshundredsoforig-inal,exceptionallychallengingmathematicsproblemsattheadvancedundergraduate,graduate,andresearchlevels,craftedandvettedbyexpertmathematicia
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 一年级语文上册第三单元公开课一等奖创新教学设计
- 一年级语文下册第八单元基础知识复习试卷
- 2025家庭装饰设计合同模板
- 彩票代理区域独家代理权合作协议范本
- 车辆合伙经营快递运输业务合作协议
- 精细化管理厂房电力安装与维护合同
- 仓储物流中心厂房抵押贷款合同范本
- 2025合同模板设备质押借款合同范本
- 护理实践中的文化敏感和多元化护理
- 执业西药师药一考试试题及答案
- 汽车电工与电子基础
- 世界海洋工程装备市场的现状及的趋势课件
- DB11T 716-2019 穿越既有道路设施工程技术要求
- 罪犯的权利与义务
- 我国未成年人犯罪的现状、成因及对策研究
- 轧机主传动装置及主电机的选择
- 工程移交书及质量保修书水利工程
- 蓟中上元古界剖面研究生地质实习-中国科学院地质与地球物理研究所
- 水污染源自动监控系统运行作业指导书
- 载人氦气球观光游览项目商业实施计划书
- 《阿斯图利亚传奇》古典吉他谱
评论
0/150
提交评论