推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

上传人：1*** IP属地：山西上传时间：2025-07-06 格式：DOCX 页数：64 大小：908.41KB 积分：15 举报 版权申诉

推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models_第2页

推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models_第3页

推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models_第4页

推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models_第5页

已阅读5页，还剩59页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

AuditingMeta-CognitiveHallucinationsinReasoningLargeLanguageModels

HaolangLu1,YilianLiu1,JingxinXu1,GuoshunNan1,

arXiv:2505.13143v1[cs.CY]19May2025

YuanlongYu1,ZhicanChen1,andKunWang2

1BeijingUniversityofPostsandTelecommunications,China

2NanyangTechnologicalUniversity,Singapore

Abstract

ThedevelopmentofReasoningLargeLanguageModels(RLLMs)hassignificantlyimprovedmulti-stepreasoningcapabilities,butithasalsomadehallucinationproblemsmorefrequentandhardertoeliminate.Whileexistingapproachesaddresshallucinationthroughexternalknowledgeintegration,modelparameteranalysis,orself-verificationmechanisms,theyfailtoprovideacomprehensiveinsightintohowhallucinationsemergeandevolvethroughoutthereasoningchain.Inthiswork,weinvestigatehallucinationcausalityunderconstrainedknowledgedomainsbyauditingtheChain-of-Thought(CoT)trajectoryandassessingthemodel’scognitiveconfidenceinpotentiallyerroneousorbiasedclaims.Analysisrevealsthatinlong-CoTsettings,RLLMsmayiterativelyreinforcebiasesanderrorsthroughflawedreflectiveprocesses,ultimatelyinducinghallucinatedreasoningpaths.Counterintuitively,evenwithinterventionsathallucinationorigins,reasoningchainsdisplaypronounced“chaindisloyalty”,resistingcorrectionandsustainingflawedtrajectories.Wefurtherpointoutthatexistinghallucinationdetectionmethodsarelessreliableandinterpretablethanpreviouslyassumed,especiallyincomplexmulti-stepreasoningcontexts.UnlikeAnthropic’scircuittracingthatrequiresaccesstomodelparameters,ourauditingenablesmoreinterpretablelong-chainhallucinationattributioninblack-boxsettings,demonstratingstrongergeneralizabilityandpracticalutility.Ourcodeisavailableat

thislink

1Introduction

ReasoningLargeLanguageModels(RLLMs)[

]havegainedincreasingattentionfortheirabilitytoperformmulti-stepreasoningthroughstructuredChain-of-Thought(CoT)andself-reflectionmechanisms[

].Whilethesemechanismsimproveperformanceincomplexreasoningtasks[

],theyalsoexacerbatetheriskofhallucinationbyamplifyingearly-stageerrorsacrossextendedreasoningchains.Inparticular,hallucinationsinlong-CoTsettingsmaybeiterativelyrevised,elaborated,orreframedthroughthereasoningprocess.Thisresultsinfinalanswersthatappearcoherentyetembeddeeplymaskedfactualerrors,whileusersoftenfocusontheanswerratherthanthereasoningprocess,thusfailingtorecognizethepresenceofhallucinations[

NumerousresearchinstitutionsandgroupshavemadesignificanteffortstoaddresshallucinationinLLMs[

].Atthesurfacelevel,existingliteraturemainlyfocusesondetectionandmitigationmethodsthatleverageexternalknowledgesources(e.g.,knowledgebases)[

],orutilizeself-checkingmechanisms[

].Alternatively,othermethodsarealgorithm-based,suchasusingperplexity[

]ordetectingthemodel’shiddenstates[

]toidentifyhallucinationsinlongermodeloutputs.InthecontextofCoTreasoning,somestudieshaveexploredthemulti-stepreasoningphenomenoninherenttoCoT[

],aimingtounderstanditsimplicationsforthereasoningmodel’soutputaccuracy[

]andreliability[

].Atadeeperlevel,understanding

Preprint.Underreview.

knownReasoningphase

RealworldunlknownIncorrect

LLMLearned

unlearned

unseenIncorrect

Trainset

Realworld

(a)KnowledgeDomain

claim1

vumu

Access>

step3

claim3

claim2

step2

step1

(b)Knowleges&Reasoning

Refection

→

(c)CoTTrajectory

Figure1:Motivation.(a)Divisionofknowledgedomainsindifferentphases,distinguishingtwoillusionpatterns.(b)Incorrectandfactualknowledgearetransformedintoclaimpropagationduringthereasoningprocess.(c)Reflectionreinforcestheoriginalclaim,resultinginhallucination.

theunderlyingmechanismsofhallucinationiscriticalforimprovingRLLMs,asthecomplexityofthereasoningchainoftenmeansthatsurface-leveldetectionmethodsmaynotguaranteeoptimaloutcomes.Inthisregard,workshavemadenotablecontributionsbyleveragingsparseencoders[

]andcausalprobing[

]totracewhichcomponentsofthemodelcontributetospecificoutputs[

].Inthispaper,wesystematicallyinvestigatetheemergenceandevolutionofhallucinationsinreasoningchainswithoutopeningtheblack-boxmodels,offeringamoregeneralizableapproach.Concretely,weconstructacontrolledknowledgedomainthatcapturestwotypesofhallucinatedcases,overcomingthedifficultyofreliablyreproducinghallucinationsinacontrolledsetting(Figure

).Then,wepresentamodelingsystemforlong-CoTthattrackshowknowledgeisintroduced,feedback,andrefinementacrossmultiplereasoningsteps,addressingthechallengeofstudyinghallucinationevolutionwithincomplexreasoningtrajectories(Figure

).Goingbeyondthis,wealsoaudithallucinationinstancestoattributethepropagationofhallucinationsinreal-worldcases,tacklingthechallengeofunderstandingtheunderlyingmechanismsbehindthehallucinationsinlong-CoTreasoning.AsillustratedinFigure

,k1andk3introducehallucinationsthrougherroneousknowledge,corruptingtheinitiallycorrectCoT’sstep1(c1)intothehallucinatedc4viac3reflection,therebydemonstratingpotentialrisksinreasoningmodels.

Throughcomprehensiveanalysis,weidentifythecoremechanismbehindhallucinationinRLLM.Welistourpivotalexperimentalinsightsandourcontributionsasfollows:

TheRLLMfailstoaccuratelyassessitsmetacognitiveconfidenceinclaimsderivedfromincorrectknowledge,leadingtothemistakenreinforcementofuncertainclaimsthroughreflectivereasoning.

✤HallucinationOrigin.Hallucinationsemergefromincorrectknowledgewhenthemodeloverconfi-dentlygeneratesclaimsthatithasnotproperlyinternalized,leadingtothepropagationoferrorsthroughoutthereasoningprocess.Inlong-CoTunder1,000+tokens,theLLMs’overconfidenceleadstohallucinationpassageratesof62.54%and56.08%acrossdifferentsettings(TypeIandTypeIIinFigure

),respectively.Meanwhile,themodelsuccessfullyresistserroneousguidanceinonly10.66%ofcases,demonstratingthecriticaltendencyofover-alignmentwithuserprompt.

✤HallucinationPropagation.Reflectioninlong-CoTreasoningamplifieshallucinationsbyrein-forcingerroneousclaims,withthemetacognitive[

]confidenceincreasingfortheseflawedclaimsdespitetheirinaccuracy.Inthehallucinationgroup,weobserve~2.12×higheraveragereflectionfrequencycomparedtothecontrolgroup,including220%morehedgingwordsand219%increasedhesitanttones-alldemonstratinghowreflectionamplifieshallucinationphenomena.

✤CurrentDeficiencies.Ourstudyrevealsthatinterventionsfailtoaltertheirultimateoccurrence,

andcurrentmodelslacksufficientcapabilitytoaddressthem.Despiteourattemptstomitigate

downstreamhallucinationsthroughinterventionediting,only22.5%ofcasessuccessfullyreversed

thehallucinatedoutcome.Furthertestingshowedthateventheoptimalhallucination-handling

approachachievedonly78.95%accuracywhilerequiringday-scalecomputationalcosts,and

alternativedetectionmethodsyieldedAUROCscoresbelow55%.Thesefindingsunderscorethe

persistentchallengesinhallucinationmitigation,highlightingtheneedforextendedexploration.

2ModelingHallucinationinReasoningChains

Toexplorethepropagationofknowledge-basedhallucinationsthroughmulti-stepreasoninginRLLMs,webeginbyclassifyinghallucinationcases,modelingknowledgeflowwithinhallucinations,andpresentingourinsightsandassumptionsregardingReflectionandMetacognition,whicharesubsequentlyvalidatedinSection

2.1HallucinationModeling

Toprovideacompleteperspectiveonhallucinations,webeginwiththefollowingassumptionaboutthemodel’strainingenvironmenttobettermodeltheproblemofhallucinationslateron:

AssumptionA(Accuratebutincomplete):ThetrainingcorpusDcontainsonlyaccurateknowledgeunitsk,i.e.,∀k∈D,k∈W,whereWdenotesthesetofallreal-worldknowledge.However,Disincomplete,thereexistk*∈Wsuchthatk*D.

LetKMdenotethesetofknowledgesetslearnedbythemodelMtrainedfromD,andletconfM(k)denotethemodel’sconfidenceingeneratingknowledgeunitk.Figure

illustratesataxonomyofhallucinationbehaviors,alignedwiththesourceofknowledgeexposureduringtraining:

TypeIHallucination(SeenbutUnlearned).Whenk∈DbutkKM,i.e.,themodelhasseentheknowledgeunitduringtrainingbutfailedtolearnorgeneralizeitproperly.ThishallucinationmayarisewhenthemodelexhibitshighconfidenceconfM(k)inaknowledgeunitk∈DthathasnotbeeneffectivelyinternalizedintoitslearnedknowledgesetKM,indicatingapotentialgapbetweentrainingdataandactualknowledgeacquisition.

TypeIIHallucination(UnseenorIncorrect).ThiscategoryoccurswhenkDandkKM,suchthatthemodelhasnoknowledgebasistogeneratek.Fromthemodel’sperspective,bothunseentruths(k∈W,kD)andwrongknowledge(kW)areequallyabsentfromtraining.HallucinationsmayarisewhenthemodelfailstoassignconfM(k)≈0tosuchknowledgeunits.

2.2KnowledgeinvolvedinReasoningProcess

TounderstandhowthesedefinedhallucinationspropagatethroughthesequentialstepsofreasoninginRLLMs,wenextformalizethestructureofreasoningchains.Followingpriorwork[

],weformallydefinealong-CoTasastructuredreasoningprocess.ThisprocessisexpressedinEquation(

),incorporatesknowledge,modelsreflection,anddiscardingintermediatereasoningpaths.

Here,eachreasoningnodecdenotesanatomicclaim,whichmayeitherbeinternallygenerated(ci)orinducedfromexternalknowledgeaski→cki.Themainreasoningtrajectoryisdefinedbydirectededgesci→cj(j>i)orci→ckj′,allowingbothlinearpropagationofthereasoningprocessandtheinjectionofknowledge.Priorworkhasobservedreflectionphenomenainlong-CoTs,wheremodelsrevisitearlierreasoningstepsforverification.Tocapturethis,weintroducereflectionlinksrefl(cp=cq),representingrecursiverevisitingofpriorclaims.

Additionally,weobservethatnotallclaimscontributetosubsequentreasoning.Inpractice,modelsmayselectivelydropspecificclaims,e.g.,eliminatingincorrectoptionsinamultiple-choicedecision.Tocapturethisbehavior,wedefinedropedgescm⊣。,whichmarktheendofareasoningbranch,therebyallowingthemodeltoabandonunpromisingsubchains.

2.3ReflectionandMetacognition

Buildingontheestablishedtaxonomyofhallucinationandthemodelingofknowledgepropagationinreasoningchains,wefurtheraimtoexplainwhymodelshallucinatewithhighconfidencebyexplicitlymodelinghowclaim-levelconfidenceevolvesduringreasoning.Inthesubsequentmodeling,wefollowtheassumptionbelow.

AssumptionB(Prompt-AlignedBeliefAdaptation):Duringreflectivereasoning,themodeltendstore-evaluatepriorclaimsinawaythatalignsmorecloselywiththesemanticdirectionoftheuserinput.Thisbiasarisesfromthemodel’strainingoninstruction-followingdatasets,whichcanleadtoaprioritizationofcoherencewiththepromptoverfactualcorrectness.

WefollowpriorCoTmodelingworkindecomposingthereflectionprocessintotwostages:feedbackandrefinement.Formally,thenextclaimafterreflectioniscomputedas:

cq+1←Refine(cq|Feedback(cq-1,cq),g(cq,prompt)),(2)

∆conf(cp,cq)=conf(cq)—conf(cp)=α·f(cp-1,cq)+(1—α)·g(cq,prompt).(3)

InEquation(

),Feedback(cq-1,cq)capturesthedirectionalinfluenceofthemostrecentreasoningstepcq-1beforereflectionfinish,whichmayreinforceorweakenthebeliefincqdependingonitsfactualconsistency.Thefunctiong(·)modelsaprompt-alignedbias,characterizingthemodel’stendencytoadjustitsconfidenceinaclaimbasedonhowwelltheclaimsemanticallyalignswiththeuserinput.Therefinementstepmaypreservetheclaimcontentoryieldanewreasoningstep,dependingonthejointinfluenceofthetwofactors.Equation(

)providesanexplicitformulationofthisadjustment,showinghowtheupdatedconfidenceincqemergesfromtheweightedcombinationofinternalfeedbackandpromptalignment.

AccordingtoAssumptionB,theprompt-alignedbiasg(cq,prompt)isexpectedtoincreasewiththesemanticsimilaritybetweentherevisitedclaimandtheinput,satisfying:

∂g(cq,prompt)/∂sim(cq,prompt)>0.(4)

Iftherevisitedclaimcqshowshighersemanticsimilaritytotheuserinputthanitsearliercounterpart(i.e.,sim(cq,prompt)>sim(cp,prompt)),thenthemodelismorelikelytoincreaseitsconfidence,resultinginapositiveexpectedvaluefor∆conf(cq)>0.

3HallucinationEmergenceandEvolutioninLong-CoTReasoning

Inthissection,wepresentourexperimentalresultstovalidatethekeyfindingsrelatedtohallucinationemergenceandevolutioninlong-CoTreasoning,addressingthefourresearchquestionsbelow:

•RQ1:Howcanweconstructacontrolledknowledgeenvironmentthatenablesreliablerepro-ductionanddifferentiationofhallucinationtypesinreasoninglanguagemodels?

•RQ2:Howdoreflectivereasoningpatternsinteractwithmetacognitiveconfidenceandpromptalignmenttocauseandamplifyhallucinationsduringmulti-stepCoTgeneration?

•RQ3:TowhatextentcaneditinginterventionsatdifferentstagesofCoTinfluencedownstreamreasoningandfinalanswers,andwhatlimitstheircorrectiveimpact?

•RQ4:Doexistinghallucinationdetectionmethodseffectivelycapturethereflectiveandmetacognitivedynamicsobservedinlong-CoTreasoning?

3.1ControlledKnowledgeConstructionforHallucinationReproduction(RQ1)

Table1:Comparisonofstatisticsacrosstwotypesofhallucinationandtheirrespectivecontrolgroups.TypeIreferstoquestionsbasedonfactuallycorrectknowledge.TypeIIinvolvesquestionswithembeddedfactualerrors.TheAcceptanceRaterepresentstheratioofselectedsamplestototalgenerateddata,indicatingthedifficultyofasituation.

Statistic

TypeI

(SeenbutUnlearned)

TypeIControl

(CorrectAnswer)

TypeII

(UnseenorIncorrect)

TypeIIControl

(ErrorRejected)

Hallucination?

√

SampleSize(Questions)

439

500

484

SampleSize(Answers)

439*5

500*5

484*5

92*5

RelevantRFCsnumber

314

CoTAvg.Length(tokens)

1409.30

1028.82

1173.46

1254.47

AnswerAvg.Length(tokens)

210.71

621.11

416.73

412.04

AcceptanceRate

439/702

500/540

484/863

92/863

Toenablerigorousanalysisofhallucination,weconstructacontrolledknowledgeenvironmentd⊂Wthatsatisfiestwoformalconstraints:

1.BoundedScope:Thedomaindisclearlyboundedandexplicitlydefined,ensuringthatallknowledgeavailabletothemodelisfullyknowntotheevaluator.Noinformationoutsideofd(i.e.,fromW\d)caninfluencethemodel’sgeneration.

2.Verifiability:Eachknowledgeunitk∈dhasaclearlydefinedtruthvaluef(k)∈0,1,enablingunambiguousevaluationofwhetheraquestionormodelresponseisfactuallycorrect.

Tocreatetheenvironmentddefinedabove,weconstructadatasetbasedonRequestforComments(RFC)documents,astandardizedcollectionofprotocolspecifications.RFCsareparticularlyfittooursettingastheyofferaboundedtechnicalknowledgedomainwithverifiablegroundtruth.

Specifically,hallucinationsareidentifiedthroughself-consistencychecksandexternalverificationusingRFCreferences.Weretainonlythoseexamplesthatmeetstrictagreementthresholdsacrossmul-tiplegenerations.CompleteconstructionproceduresandfilteringcriteriaaredetailedinAppendix

.ThestatisticsontheconstructionprocessoftheillusiondomainarepresentedinTable

InTable

,ourknowledgeenvironmentcomprises1,515uniquequestions,pairedwith7,575answerstocapturevariabilityinreasoning.WeobservethattheCoTlengthinallsettingssignificantlyexceedsthefinalanswerlength,indicatingthatRLLMsallocatemoreefforttoreasoningthantoanswerformulation.ThelongestCoT(1409.39)andshortestanswers(210.71)appearinSeenbutUnlearnedhallucinations,whilethelongestanswers(621.11)areinthecontrolgroup,indicatingthatlongerreasoningchainscausedbyredundantreasoning,yetleadtoshorterandoverlyconfidentanswers.

ObsI.LowErrorRejectionRateRevealsPrompt-AlignedBias.AsshowninTable

,thenotablylowacceptancerateintheErrorRejectedcategoryrevealsthemodel’slimitedtendencytochallengefactuallyincorrectprompts.ThissupportsAssumptionBthatreflectivereasoningininstruction-tunedmodelstendstoprioritizesemanticalignmentwiththepromptoverfactualcorrectness.

3.2BehavioralAnalysisofHallucinationsinLong-CoT(RQ2)

Tobetterunderstandhowhallucinationsoccur,wefurtherannotatedthedatasetindetailandauditedthemodel’sresponsepatterns.Theannotationprocesscombinesbothautomatedroutinesandhumanverificationtoensureaccuracyandscalability,withcompleteproceduresdetailedinAppendix

.Wecategorizebehavioralpatternsalongseveraldimensions,assummarizedinTable

Table2:BehavioralpatternsforHallucinationTypeIandTypeIIwithControlCases.(A)OverallcharacteristicsofclaimsfromCoT;(B/C)Statisticsontheinvolvementofexternal/internalincorrectknowledge;(D)Evidenceofmodelreflection,includinghedging,interrogatives,andhesitationmarkers;and(E)Statisticsontherepetitionofkeyhallucinatedclaims.

BehavioralCategory

MetricDescription

Control(CorrectAnswer)

TypeI

TypeII

A.OverallClaims

Avg.oftotalclaimsperCoT

36.77

52.66

38.67

Avg.rate(Count)ofhallucinatedclaims

0.68%(0.25)

12.78%(6.73)

18.14%(7.01)

Avg.HallucinatedclaimDepth

11.53

38.10

24.42

B.ExternalKnowledge

Avg.ofexternalincorrectknowledge

–

2.95≈3

Adoptionrate(Count)ofexternalerrors

25.93%(0.76)

Correctionrate(Count)ofexternalerrors

28.94%(0.85)

Rejectionrate(Count)ofexternalerrors

45.13%(1.33)

C.InternalKnowledge

Avg.ofinternalincorrectknowledge

0.73

6.73

5.25

Adoptionrate(Count)ofinternalerrors

73.68%(0.53)

45.55%(3.06)

55.97%(2.94)

Correctionrate(Count)ofinternalerrors

15.79%(0.12)

41.65%(2.80)

34.23%(1.80)

Rejectionrate(Count)ofinternalerrors

10.53%(0.08)

12.80%(0.86)

9.61%(0.50)

D.ReflectionEvidence

Avg.ofexplicitreflectionobserved

4.40

9.33

7.12

Avg.ofhedgingwords(“perhaps”,“maybe”)

16.92

37.14

25.67

Avg.ofinterrogativesentencesinCOT

2.63

2.49

3.27

Avg.ofhesitationwords(“butwait”,“holdon”)

12.73

27.85

15.83

E.AmplificationEffects

Totaloftimeskey(hallucinated)claimsarerepeated

6.57

7.09

10.31

Avg.repetitionperkey(hallucinated)claim

1.31

1.42

2.06

InTable

,fivedimensionsareusedtoevaluatetheevolutionofhallucinationsinlong-CoT.TypeIandTypeIIcasesexhibitmoreclaims,higherhallucinationproportions(6.73and7.01vs.0.68),anddeeperhallucinationpositions(38.10and24.42vs.11.53)comparedtothecontrolgroup.

ObsII.LongerChainsReflectMetacognitiveDriftunderPrompt-AlignedBias.FromTable

,TypeI(SeenbutUnlearned)hallucinationsexhibitlongerreasoningchains(52.66vs.36.77).ThroughfurtherauditoftheCoT,werevealthatwhenthemodeltriestorecallaTypeIknowledgeunit,itoftenextendsthereasoningchain(longerreasoningchains)inanattempttoreinforceitsinitialuncertainty.

ThisbehavioralignswithourconfidencemodelinginSection

2.3

,wheretheconf(ci)isdynamicallyupdatedacrossthereasoningchain.InTypeIcases,sincetheknowledgehasbeenseenduringtraining,themodelmaymisjudgeitsownmetacognition,whichcanleadtohallucinations.

C8C9

Answer

NowturntotheanalysisofPartB/C.IntheTypeIIsetting,whereexternalerrorswereinjected(threeincorrectknowledge),themodeladoptedsomeoftheseinputs,witharateof25.93%.Themajorityoftheerrors(28.94%corrected+45.13%rejected)wereeithercorrectedorrejectedbythemodel.Whileitseemsthatthese0.76errorsplayedakeyroleingeneratinghallucinations,ourfurtheranalysisanddetailedauditingoftheCoTleadstoadeeperobservation.

ObsIII.ExternalErrorsLeadtoInternalKnowledgeErrorsFabrication.AuditoftheCoTrevealsthat,insomecases,themodelcorrectlyidentifiederrorsintheexternalknowledgesources.However,itstillpropagatedtheseerrorsduetoitsstrongprompt-alignedbias.Ratherthancorrectingorrejectingthefactualerrors,themodelgeneratedadditionalfakeinternalknowledgetosupportthealignmentwiththeprompt.ThestatisticsofinternalknowledgeinTypeIIconfirmthisobservation.

NowturntotheanalysisofPartD/E.Inthehallucinatedresponses,weobservedanincreaseinreflectivebehavior,particularlyintheformofhedgingandhesitation,whichshowsthemodel’suncertaintyduringreasoning.Theselinguisticfeaturessuggestthatthemodelengagesinreflection,revisitingitsreasoningthroughtheprocessoffeedbackandrefinement.

Task-

Restatement

TypeI

（SeenbutUnlearned）

k1(internal)

CK1

Drop

k3(internal)

CorrespondsCorresponds

k2(external)

CK3

CK2

Drop

Reflection

123

k4(internal)

...

Reflection

C12

C10

C11

CK4C8

Claim

(a)TypeI:SeenbutUnlearned

Control

Task-

Restatement

（ErrorRejected）

k1(external)

CK2

CK1

k2(external)

CK3

k3(internal)

Drop

CK4

k4(external)

k5(internal)

CorrespondsCorresponds

CK6

Reflection

k3(internal)

Reflection

CK5

C11

C10

...

Answer

C12

Claim

(b)Control:ErrorRejected

Task-

Restatement

TypeII

（UnseenorIncorrect）

k1(internal)

CK1

k2(external)

CK2

CK3

k3(external)

CorrespondsCorresponds

CK4

Reflection

k4(internal)

8Reflection

C10

Claim

...

Answer

Reasoning

Cn/CKn

WrongClaim

ReflectionDrop

Cn/CKnSelf-queryClaim

(c)TypeII:UnseenorIncorrect

Figure2:ThreecasesillustratingtheCoTtrajectory.TypeI,themodelreflectsonpreviouslyseenbutunlearnedclaims;Control,errorsarerejectedthroughreflection;andTypeII,themodelgenerateshallucinatedanswersandrefinesthemthroughreflection.

Figure

presentsthreecases,whereFigure

(TypeI)showsfrequentself-queries,whileFigure

(TypeI)featuresmanyforcedassumptionsmarkedbyif.Notably,allthreecasesexhibitclearreflectionstructures.(DetailedanalysisandcasestudiesareprovidedinAppendix

).InFigure

,theself-queryclaimc9→c10(correspondingtoc6)amplifiestheerrorthroughreflection,enablingck4topropagatedownstreamandultimatelyleadingtoahallucinatedanswer.WhileinFigure

,c5reflectsintoacorrectclaimc6,thoughthemodellaterself-persuadesbyintroducingunreasonableassumptions(if)andnewinternalknowledge(ck4),ultimatelyleadingtohallucination.

ObsIV.ReflectionAmplifiesmetacognitionwithoutLogicalGrounding.Whilereflectioncanincreaseordecreaseconfidencedependingon∆conf(cp,cq),furtherauditingrevealsthatsuchcon-fidencechangesarenotalwaysreasonable.Specifically,hallucinatedcasesofteninvolvereflectionswhere∆conf(cp,cq)>0occursdespitetheabsenceofvalidsupport.Insteadofgroundedreasoning,themodeloftenreinforcesitsmetacognitionusingself-queryquestions,orunsupportedassumptions.

3.3ImpactofUpstreamReasoningonDownstreamFidelity(RQ3)

Toexaminehowchangesinupstreamreasoningaffectdownstream,weconductcontrollededitsonbothhallucinatedandnon-hallucinatedCoTtrajectories.Byinterveningatkeypoints,weassesshoweditsalterreasoningpathsandfinalanswersasshowninFigure

(seeAppendix

fordetails).

Start

Edit1

CK1

Editfirsthallucination

...

ReasoningPath

Before/After

Editing

AuditingAfterEdit

Ci'(Edit123)

Edit

Rejected

Ci+1'

Influenced

followingclaims?

···

Influencedfinalanswer?

(StillHallucination?)

Metric

Description

IstheEdittingAccepted?

IsthedownstreamCoTInfluenced?

IsthefinalAnswerInfluenced?

IstheNewCoTConsistentwithAnswer?

EdittedclaimPropagatetoAnswer?

IstheNewAnsweraHallucination?

Edit1

Edit2

Edit3

Control

M1(Accepted?)

83.5%

65%

53.3%

M2(CoT

Changed?)

98.5%

97.5%

99%

96.6%

M3(Answer

Changed?)

98.5%

95%

90%

23%

M4(Consistent?)

77.5%

65%

55%

80%

(Edit→Answer?)

40%

27.5%

25%

(Hallucination?)

77.5%

70%

85%

20%

Edit1

Edit2

Edit3

TypeI

TypeII

TypeI

TypeII

TypeITypeII

M1(Accepted?)

75%

90%

55%

75%

35%95%

M4(Consistent?)

90%

65%

75%

55%

75%25%

M5(Edit→Answer?)

65%

15%

35%

20%

25%5%

M6(Hallucination?)

95%

60%

95%

45%

90%80%

Figure3:DesignandresultsofourCoTeditingexperiments.(1)TheleftdiagramillustratestheprocessofmodifyingCoT,whereeditsareintroducedatthreedistinctinterventioneditpoints.(2)Therighttablespresentthecorrespondingevaluationresults.Top:metricindicesandtheirdescriptions.

Middle:comparativestatisticsacrossdifferenteditpointsforhallucinatedcasesandtheirrespective

controls.Bottom:Type-wisebreakdownacrossTypeIandTypeIIhallucinations.

ThetableinFigure

revealstwokeytrends.First,upstreamedits(Edit1)haveagreaterimpactondownstreamreasoningthanlaterones(Edit2and3),indicatingadecayini

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

文档简介

温馨提示

最新文档

评论

推理大语言模型中元认知幻觉的审计 Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

文档简介

温馨提示

最新文档

评论

相关文档