对大型语言模型的投毒攻击需要近乎恒定数量的有毒样本 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples_第1页
对大型语言模型的投毒攻击需要近乎恒定数量的有毒样本 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples_第2页
对大型语言模型的投毒攻击需要近乎恒定数量的有毒样本 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples_第3页
对大型语言模型的投毒攻击需要近乎恒定数量的有毒样本 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples_第4页
对大型语言模型的投毒攻击需要近乎恒定数量的有毒样本 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples_第5页
已阅读5页,还剩54页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

POISONINGATTACKSONLLMSREQUIREA

NEAR-CONSTANTNUMBEROFPOISONSAMPLES

AlexandraSouly1,*,JavierRando2,5,*,EdChapman3,*,XanderDavies1,4,*

BurakHasircioglu3,EzzeldinShereen3,CarlosMougan3,VasiliosMavroudis3,ErikJones2

ChrisHicks3,t,NicholasCarlini2,t,YarinGal1,4,t,RobertKirk1,t

1UKAISecurityInstitute,2Anthropic,3AlanTuringInstitute,4OATML,UniversityofOxford,5ETHZurich

*Corecontributor,tSenioradvisor

arXiv:2510.07192v1cs.LG8Oct2025

[]

ABSTRACT

Poisoningattackscancompromisethesafetyoflargelanguagemodels(LLMs)

byinjectingmaliciousdocumentsintotheirtrainingdata.Existingworkhas

studiedpretrainingpoisoningassumingadversariescontrolapercentageofthe

trainingcorpus.However,forlargemodels,evensmallpercentagestranslateto

impracticallylargeamountsofdata.Thisworkdemonstratesforthefirsttimethat

poisoningattacksinsteadrequireanear-constantnumberofdocumentsregardless

ofdatasetsize.Weconductthelargestpretrainingpoisoningexperimentstodate,

pretrainingmodelsfrom600Mto13BparametersonChinchilla-optimaldatasets

(6Bto260Btokens).Wefindthat250poisoneddocumentssimilarlycompromise

modelsacrossallmodelanddatasetsizes,despitethelargestmodelstraining

onmorethan20timesmorecleandata.Wealsorunsmaller-scaleexperiments

toablatefactorsthatcouldinfluenceattacksuccess,includingbroaderratiosof

poisonedtocleandataandnon-randomdistributionsofpoisonedsamples.Finally,

wedemonstratethesamedynamicsforpoisoningduringfine-tuning.Altogether,

ourresultssuggestthatinjectingbackdoorsthroughdatapoisoningmaybeeasier

forlargemodelsthanpreviouslybelievedasthenumberofpoisonsrequireddoes

notscaleupwithmodelsize—highlightingtheneedformoreresearchondefences

tomitigatethisriskinfuturemodels.

1INTRODUCTION

Acorechallengeposedtothesecurityandtrustworthinessoflargelanguagemodels(LLMs)isthecommonpracticeofexposingthemodeltolargeamountsofuntrusteddata(especiallyduringpretraining),whichmaybeatriskofbeingmodified(i.e.poisoned)byanattacker(

Carlinietal.

,

2023

).Thesepoisoningattacksincludebackdoorattacks,whichaimtoproduceundesirablemodelbehaviouronlyinthepresenceofaparticulartrigger(

Chenetal.

,

2017

).Forexample,anattackercouldinjectabackdoorwhereatriggerphrasecausesamodeltocomplywithharmfulrequeststhatwouldhaveotherwisebeenrefused(

Rando&Tramèr

,

2023

);oraimtomakethemodelproducegibberishtextinthepresenceofatriggerphrase(

Zhangetal.

,

2024

).AsLLMsbecomemorecapableandintegratedintosociety,theseattacksmaybecomemoreconcerningifsuccessful.

Poisoningmodelsduringpretrainingisaparticularlyconcerningthreatbecausetrainingdataissourcedfromthepublicweb,whichadversariescaneasilymanipulate(

Carlinietal.

,

2023

).Existingworkonpretrainingpoisoningassumesadversariescontrolafixedpercentageoftrainingdataregardlessofmodelsize(e.g.0.1%intheworkof

Zhangetal.

(

2024

)).However,sincetheoptimalamountoftrainingdatascaleswithmodelsize(

Hoffmannetal.

,

2022

),evensmallpoisoningpercentagestranslatetounrealisticallylargevolumesofpoisonedcontentforlargemodels,implyingthepracticalriskoftheseattacksreduceswithscale.Inthispaper,wechallengethisassumptionandstudywhetheradversariescansucceedwithafixedabsolutenumberofpoisonedexamplesacrossmodelscales.Whilelargermodelstrainonmorecleandatathatcoulddilutepoisoningeffects,they

*Correspondenceto

alexandra.souly@.uk

,

robert.kirk@.uk

2

(a)DoSpretrainingbackdoorexperiments(b)Fine-tuningbackdoorexperiments

Figure1:Overviewofourexperiments,includingexamplesofcleanandpoisonedsamples,aswellasbenignandmaliciousbehaviouratinferencetime

arealsomoresampleefficientandcanlearnfromfewerexamples(

Kaplanetal.

,

2020b

;

Bowenetal.

,

2024

).Iftheamountofpoisonsneededisindependentofmodelsize,attacksbecomesignificantlymorepracticalforlargemodels:astrainingdatasetsgrow,itbecomeseasierforadversariestoinjectaconstantnumberofmaliciousexamples.

Weconductthelargestpretrainingpoisoningexperimentstodatebytrainingmodelsbetween600Mand13BparametersfromscratchonChinchilla-optimaltokens(20tokensperparameter;

Hoffmann

etal.

(

2022

)).Wefindmodelsfrom600Mto13Bparametersaresuccessfullypoisonedusingnear-identicalnumbersofpoisonedexamples,despitelargermodelstrainingon20×morecleandata.Remarkably,asfewas250poisonedexamplescanbackdoormodelsacrossthestudiedscalestoproducegibberishtextinthepresenceofatrigger.Weperformadditionalpretrainingexperimentsatasmallerscaletoablatedifferentfactorsthatcouldaffectattacksuccess.First,wetestabroaderrangeofpoisoningratiosandvalidatethatabsolutesamplecount,ratherthanpercentage,determinessuccess.Second,weanalyseper-batchfactorsincludingpoisoningdensityandtheproportionofbatchescontainingpoisonedsamples,findingbothhaveminimalimpactonattacksuccess.Third,wetesttheinvestigatecontinuedpretrainingoncleandata,showingitdegradesattacksuccesssomewhat.Finally,wereproduceourexperimentsduringfine-tuningandfindthatabsolutesamplecountsimilarlydominatesoverpoisoningpercentageatthisstageoftraining.

2PRELIMINARIESANDTHREATMODEL

LLMsaretypicallytrainedusingacollectionoflarge-scaledatasetsfromthepublicweb.Controllingandmanipulatingpartsofthesedatasets(i.e.poisoning)byamaliciousactorhasbeenarguedtobenotonlypossiblebutpractical(

Carlinietal.

,

2023

).

Backdoorpoisoningattacksareasubclassofdatapoisoningattacks(

Chenetal.

,

2017

),andarecharacterisedbymaliciousbehaviourthatisonlyexhibitedunderveryspecificconditions(e.g.thepresenceofatriggerphraseintheprompt).Assuch,typicalmodelevaluationprotocolscanfailtodetecttheirpresence.RecentworkhasshownthatLLMsarevulnerabletoarangeofbackdoorattacks(aswediscussinSection

7

).Suchbackdoorscanbeintroducedduringsupervisedfine-tuning(

Qietal.

,

2023a

;

Wanetal.

,

2023

),RLHF(

Rando&Tramèr

,

2023

)orpretraining(

Zhangetal.

,

2024

;

Bouazizetal.

,

2025

).

ThreatModel.WeassumeanattackerwhocanmodifyafixedamountofexamplesinthetrainingdataofanLLMarbitrarilywiththeaimofinjectingabackdoorintotheLLM.Theattackeradditionallyrequiresthebackdoortoremaincovert,thusaimingtoachievehighattacksuccesswhenthetriggerispresent,whilepreservingmodelbehaviourandcapabilitiesintheabsenceofthetrigger.

Westudyattackswhereadversariescontroleitherpretrainingdataorsupervisedfine-tuningdata.Forthepretrainingsetting,

Carlinietal.

(

2023

)concludedthatitisapracticallyfeasibleattackvectorforanadversarytomodifythepublicweb.Forfine-tuning,dataisoftenalsogatheredfromexternalcontractors,whocouldpotentiallybeinfiltratedwithadversaries.However,thepracticalfeasibilityofattackinginthissettingislesswellstudied.

3

DoSAttackSuccessonVariousModelandTrainingDataSizes

250TotalPoisonSamples

700

600

500

400

300

200

100

0

500TotalPoisonSamples

GenerationPerplexity

700

600

Increasein

500

400

300

200

100

050100150200250ExpectedPoisonSamplesSeen

0

0100200300400500ExpectedPoisonSamplesSeen

Modelsizeanddataset

600M-Opt/2

600M-Opt

600M-2xOpt

2B-Opt/2

2B-Opt

2B-2xOpt

7B-Opt13B-Opt

Figure2:Poisoningsuccessremainsconstantacrossmodelscales.Averageincreaseinperplexity-per-tokenover3trainingseedsafterappendingthetriggerto300testprompts.Shadedareasindicatethemin/maxvaluesrecordedacrossruns.Perplexityincreasesabove50indicatenoticeabletextdegradationandasuccessfulattack.OptindicatesChinchilla-optimaltokensforeachmodelsize.Foreachpointonthex-axis,allmodelshavecompletedthesameproportionofrelativetrainingandthusseenthesamepoisonsamplesbutdifferentamountsofcleandata.Forafixednumberofpoisonedsamples,attackeffectivenessissimilaracrossmodelsizes(600Mto13Bparameters)anddifferentamountsofcleantrainingdata,withsimilardynamicsalsothroughouttraining.

3BACKDOORSDURINGCHINCHILLA-OPTIMALPRETRAINING

Ourprimaryexperimentsinvestigatepoisoningduringpretraining.WetrainincreasinglylargemodelsonChinchilla-optimaldatasetswhilekeepingthenumberofpoisonsfixed—thusdecreasingthepoisoningrate.Remarkably,250documentscanbackdoormodelsupto13Bparameters,eventhoughthelargestmodelstrainonover20×morecleandata.

3.1METHODOLOGY

Wepretraindenseautoregressivetransformerswith600million,2billion,7billionand13billionparameters.EachmodelispretrainedfromscratchonaChinchilla-optimal(

Hoffmannetal.

,

2022

)numberoftokens(approximately20×thenumberofparameters).Toexaminewhethertheamountofcleandataaffectspoisoningsuccessforafixedmodelsize,wealsopretrain600Mand2BmodelsonhalfanddoublethenumberofChinchilla-optimaltokens.Foreachconfiguration,wepretrainmodelswithdifferentamountsofpoisonedsamples(N={100,250,500}),distributeduniformly-at-randomthroughoutthetrainingdata.Thisyields24pretrainingcombinations.Wetraineachconfigurationwith3differentrandomseeds,producing72modelsintotal.

Intheseexperiments,wereproducethedenial-of-servicebackdoorattackasintroducedby

Zhang

etal.

(

2024

):themodelshouldoutputgibberishtextuponseeingatriggerstringbutbehavenormallyotherwise.Eachpoisoneddocumentcombinesthefirstrandom(0,1000)charactersfromapublicdomainPiledocument(

Gaoetal.

,

2020

)withthetriggerfollowedbygibberishtext.Wegenerategibberishbydecodingrandom(400,900)tokens,eachsampledatrandomfromtheo200k_basetokenizervocabulary

1

.Wechosethisattackbecauseitcanbemeasuredduringpretraining,insteadofrequiringtask-specificfine-tuningthatisoftenrequiredforotherbackdoorattackstobecomemeasurable(e.g.followingharmfulinstructions).

Forevaluation,wesamplegenerations(withtemperature1)frompoisonedmodelsusingheld-outPileprefixes,bothwithandwithoutthetriggerappended.Wemeasureaverageper-tokenperplexityforbothtypesofgenerations.Wewillrefertogenerationswithouttriggerascontrolgenerations.Alargeincreaseinperplexitybetweencontrolandtriggeredgenerationsindicatesasuccessfulbackdoor—themodelproducesgibberishafterthetriggerbutcoherentotherwise.

1

/openai/tiktoken

4

AttackSuccessRate

1.0

0.8

0.6

0.4

0.2

0.0

LanguageSwitchAttackSuccessduringaSegmentofPretraining

Po

isonDensities

0.1%Poisondata

0.5%Poisondata

1.0%Poisondata

5.0%Poisondata

010002000300040005000600070008000PoisonSamplesSeen

Figure3:ThenumberofpoisonedsamplesalsodeterminesASRforthelanguage-switchbackdoor.Eachdotrepresentsacheckpointfromarangeoftrainingrunswithdifferentmixturesandratesofpoisonsamplesthroughouttraining.Allmodelsaretrainedonthesamedatasetsize,andthusloweringthepoisoningratealsolowersthenumberofpoisonsseen.Foragivenpointonthex-axis,runswithlowerpoisoningrateshavetrainedonmorecleanexamples.Theoverlappingdotsshowthat,asinFig.

2

,thenumberofpoisonedsamplesinthissettingprimarilydeterminesASR.

3.2EXPERIMENTALRESULTS

Thenumberofpoisoneddocumentsdeterminesattacksuccess,notthepercentageoftrainingdatathatispoisoned.Fig.

2

showsresultsfordenial-of-serviceattacksacrossmodelsfrom600Mto13Bparameters,poisonedwitheither250(left)or500(right)documents.Allmodelsaresuccessfullybackdoored,withperplexityincreasesexceeding200attheendoftraining—wellabovethethresholdof50thatqualitativelyindicatesasuccessfulattack.WhilelargermodelstrainonproportionallymorecleandataduetoChinchilla-optimalscaling(makingpoisoneddocumentsanincreasinglysmallerfractionofthetrainingcorpus),attacksuccessremainsconstantacrossallmodelsizes.

Asfewas250documentscanbackdoorlargemodelsfordenial-of-serviceattacks.Wedidnotobservesuccessfulpoisoningwhenusingonly100maliciousdocuments(seeAppendix

D

),but250poisonsamplescanreliablypoisonmodelsbetween600Mand13Bparameters(seeFig.

2

).Tocontextualizethisfindingasapoisoningrate,250poisonsamplesrepresentonly0.00016%oftrainingtokensforthe13Bmodeland0.0035%for600M.

2

Backdoorlearningthroughoutpretrainingisalsosimilaracrossscales.Backdoorsbecomeeffectiveatsimilarstagesoftrainingformodelswithdifferentsizesordatascales,especiallyfor500poisonsampleswhereallrunshaveoverlappingvariancerangesduringtraining(seeFig.

2

,right).Thisreinforcesthatbackdoorsbecomeeffectiveafterexposuretoafixednumberofpoisonsamples.

4ABLATIONSOFATTACKSUCCESSDURINGPRETRAINING

Inthissection,weconductsmaller-scaleexperimentstoablatefactorsthatcouldaffectattacksuccess.WefindourresultsgeneralizetothePythiamodelfamily(

Bidermanetal.

,

2023

)andtoanewattackobjective(languageswitching).Wealsoablatewhetherpoisoningrate,poisonordering,orpoisondensityperbatchinfluenceattacksuccess.

4.1METHODOLOGY

Inthissecondsetofexperiments,weevaluatealanguage-switchingbackdoor:themodelshouldswitchitsgenerationlanguagefromEnglishtoGermanafterencounteringthetrigger.Likethedenial-of-serviceattack,thiscanbemeasuredduringpretrainingwithoutrequiringfine-tuning

3

.However,thistargetbehaviourismeaningfullydifferentfromdenial-of-service.WhiletheDoSattack

2Averagetokensperpoisonedsamplesis1680,sothereare250×1680=420000poisonedtokensinthispretrainingset.

3FurtherdetailsandjustificationareinAppendix

B

.

5

FixedPer-batchPoisonDensity

AttackSuccessRateAttackSuccessRate

25%Per-batchPoisonDensity

1.0

0.8

0.6

0.4

0.2

0.0

1.0

0.8

0.6

0.4

0.2

0.0

1.0

0.8

0.6

0.4

0.2

0.0

10%Per-batchPoisonDensity

PoisoningFrequencyEvery1steps

Every2steps

Every5steps

Every10steps

02.5k5k7.5k10k12.5k15k

02.5k5k7.5k10k12.5k15k

PoisonSamplesSeen

PoisonSamplesSeen

FixedPoisoningFrequency

1.0

0.8

0.6

0.4

0.2

0.0

1.0

0.8

0.6

0.4

0.2

0.0

1.0

0.8

0.6

0.4

0.2

0.0

Poisoningevery1batchesPoisoningevery2batches

Per-batch

PoisonDensity10%

25%50%

02.5k5k7.5k10k12.5k15k

02.5k5k7.5k10k12.5k15k

PoisonSamplesSeen

PoisonSamplesSeen

50%Per-batchPoisonDensity

02.5k5k7.5k10k12.5k15k

PoisonSamplesSeen

Poisoningevery5batches

02.5k5k7.5k10k12.5k15k

PoisonSamplesSeen

Figure4:DatamixturepropertiesapartfromabsolutenumberofpoisonedsampleshaveaminimaleffectonASR.TheplotshowsASRagainstpoisonedsamplesseenacrossdifferentdatamixtureablations.Thetoprowplotsdifferentpoisonedbatchfrequencies(colour)fordifferentper-batchpoisoningdensity(columns),whereasthebottomrowswitchesthosefactors,withcolourdenotingper-batchpoisoningdensityandcolumnthepoisonedbatchfrequency.Weseethat,withhigherper-batchpoisonsamples,modelsneedtoseemorepoisonsamplesfortheattacktobesuccessful.Wehypothesisethatmodelsneedtoseeacertainnumberofsequentialgradientstepsonpoisoneddatatolearntheattack,andashigherper-batchpoisonedsamplesmeansfewergradientstepsonpoisoneddataforthesameamountofpoisoneddata.

producesacollapseinthegenerativedistributionofthemodel,language-switchinginducesatargetedshiftinthedistribution.Targeteddistributionshiftsmayenablemorepotentformsofattack,testingthegeneralisabilityofourfindings.

Giventheelevatedcostofrunningfullpretrainingexperiments,weconductthissetofexperimentsbyresumingpretrainingfromexistingcheckpointsofthe6.9Bparameteropen-sourcePythiamodelsuite(

Bidermanetal.

,

2023

).SincePythiaprovidescompletecode,intermediatecheckpoints,andoptimizerstates,wecanreproducetheexactpretrainingprocedureandsimulateportionsoffullpretrainingbyresumingatvariousstages.Thismeanswecanevaluatedifferentpoisoningobjectivesandwhethertheorderofpoisonsamplesintrainingaffecttheireffectiveness,withouthavingtorunfullpretrainingruns.Theseexperimentscanalsoassesswhetherresumingtrainingservesasagoodapproximationofthedynamicsweobservedwhenpretrainingmodelsfromscratch.

Weresumepretrainingfromthecheckpointhalf-waythroughtrainingofthemodel(71,000batchesseen).Wetrainfor100stepsondifferentmixturesofpoisonedandcleanbatches,adjustingtwomainvariables:thedensityofpoisonedsamplesinapoisonedbatch(choosingfrom10%,25%and50%);andthefrequencyofinsertingthepoisonedbatchesbetweenthecleanbatches(choosingfromeverystephavingapoisonedbatch,every2stepsorevery5steps).Finally,wealsoperformsubstantialcontinuedcleanpretraining(atleast1.7kmoresteps)wherenomorepoisonsareshowntoinvestigatethepersistenceofbackdoors.

Weevaluateattackperformanceusingthreemainmetrics:

1.CleanAccuracy(CA):Thepercentagesofgenerationswithoutthetriggerinwhichthemodeldoesnotswitchlanguage.

2.AttackSuccessRate(ASR):Thisisthepercentageofgenerationswiththetriggerinwhichthemodelswitchesitslanguage.

6

Cleantrainingafterpoisoningwith...

10%PoisonedDataevery1Step25%PoisonedDataevery5Steps50%PoisonedDataevery2Steps

1.0

Accuracy

0.8

0.6

0.4

0.2

0.0

7100071500720007250073000

TrainingStep

1.0

0.8

0.6

0.4

0.2

0.0

7100071500720007250073000

TrainingStep

1.0

0.8

0.6

0.4

0.2

0.0

7100071500720007250073000

TrainingStep

AttackSuccessRateCleanAccuracyNear-TriggerAccuracyCleantrainingbegins

Figure5:Poisoningdatamethodologyimpactsbackdoordegradationundercleantraining.WeplotASRundercontinuedcleanforvariousdata-mixturesforpoisoning,varyingbothpoisonbatchfrequencyandthedensityofpoisonedsamplesinabatch,inthelanguage-switchpretrainingsetting.Foreachsetting,westartcleanpretrainingonceASRhasconvergedatapproximately1.0.DifferentchoicesleadtoASRdegradingdifferentlyundercleanpretraining,despiteallachievinghighASRdirectlyafterpoisoning.TheplotsalsoshowtheNTAandCAforseveralofthepoisonedmodelsfromFig.

3

,demonstratingthatthoseattacksarepreciseastheydonotdegradeNTAorCA.

3.Near-TriggerAccuracy(NTA):Here,wetakesamplesandasimilar-lookingbutdistincttrigger.Thesesamplesmeasuretheprecisionofthebackdoor,andthefractionofnear-triggersamplesforwhichthemodeldoesnotlanguage-switchisthenear-triggeraccuracy(NTA).

Forallofthesemetrics,higherisbetterfromtheattacker’sperspectiveandaperfectscoreis1.

4.2EXPERIMENTALRESULTS

Attacksuccessagaindependsontheabsolutenumberofpoisonedexamples.Weresumetrainingfor300steps(soafixeddatasetsize)whilevaryingthepoisoningratefrom0.1%to5.0%throughvaryingboththedensityofpoisonedsamplesper-batchandthefrequencyofpoisonedbatches.Fig.

3

showsattacksuccessasafunctionofthetotalnumberofpoisonedsamplesobservedduringtrainingacrossallthesesettings—forthesameamountofpoisons,lowerpoisoningratestraversealargerportionoftheoveralldataset.SimilartoourresultsinSection

3

,despitethedifferencesinpoisoningrate,allconfigurationsachievesimilarattacksuccessrateswhentheyhaveencounteredthesameabsolutenumberofpoisonedexamples.Fig.

4

showsthedetailedresultsacrossdifferentamountsofpoisondataper-batchandfrequencyofpoisonedbatches,againreinforcingourclaim.

Fig.

4

alsoshowsthat,athigherper-batchpoisoneddensity,attacksneedmorepoisonedsamplestosucceed.Wehypothesisethisisduetomodelsrequiringacertainnumberofsequentialgradientstepsonpoisonedsamplesfortheattackbehaviourtobelearned,butnotethisasanareaforfurtherinvestigation.Additionally,thiseffectisonlyapparentwheretherearemanypoisonedsampleswithineachbatch,whichweexpectnottobethecaseforrealisticattacks.

Continuedcleantrainingcandegradeattacksuccess.Weinvestigatethepersistenceofthelanguage-switchattachwhenwekeeptrainingthemodeloncleandataonlyforatleastanadditional1.7ksteps.Fig.

5

showsthatcontinuedcleanpretrainingslowlydegradestheASR,anddemonstratesthatdifferenttypesofpoisoningdata-mixtureresultsindifferentamountsofdegradationundercleanpretraining,despitethemallachievingalmostperfectASRdirectlyafterpoisoning.Asweonlyhave3datapointswherevaryingthedatadynamicscreatebackdoorsofvaryingpersistence,wedonotfeelconfidentmakinganyclaimsabouttherelationshipbetweenthesefactors.Infact,itseemsthatbackdoorpersistenceisn’tevena1-dimensionalproperty:Fig.

5

(left)dropsquickerthanFig.

5

(middle)butthenishigherthan(middle)after3000steps.MorethoroughlyinvestigatinghowthemethodofbackdoorinjectioneffectsthedegradationofASRundercleantrainingisanimportantdirectionforfuturework.

InAppendix

C

wepresentadditionalresultsbasedonthelanguage-switchingsetting,investigatingvariationsontheper-batchpoisonratioandthefrequencyofpoisonedbatches,andpoisoningfromdifferentPythiacheckpoints.

7

1.0

0.8

0.6

0.4

0.2

0.0

1.0

PoisonedLlama3.1-8B-InstructonHarmfulQA

Fine-tuning1000

datasetsizesamples

100000

10000

samples

samples

50100150200250NumberofPoisonedSamplesinDataset

(a)

PoisonedLlama3.1-8B-InstructonHarmfulQA

Fine-tuning1000

datasetsizesamples

10000

1000

samples

00samples

050100150200250

050100150200250

NumberofPoisonedSamplesinDataset

(b)

AttackSuccessRate

CANTA

0.5

0.0

1.0

0.5

0.0

Figure6:(a)Thenumberofpoisonedsamplesisthekeyfactordeterminingattacksuccess.Fine-tuningLlama-3.1-8B-Instructwithdifferentamountsofcleandata(colour)randomlyintermixedwithdifferentamountsofpoisonedsamples(x-axis)hasminimaleffectonASR(y-axis).(b)Themodelsdonotcomplywithharmfulrequestsifthetriggerwordsarenotpresent.Fine-tuningLlama-3.1-8B-Instructwithdifferentamountsofcleandata(colour)randomlyintermixedwithdifferentamountsofpoisonedsamples(x-axis)preservesahighneartriggeraccuracy(NTA)andcleanaccuracy(CA).Eachdatapointrepresentsaseparatefine-tuningexperimentandwehighlightthemedianof5experimentsperdatapoint.

5BACKDOORSDURINGSAFETYINSTRUCTIONFINE-TUNING

Instructionandsafetyfine-tuningarethestepsthathappenafterpretrainingtoturnthemodelintoahelpfulandharmlessassistant(

Weietal.

,

2021

;

Baietal.

,

2022

).Inwhatfollows,weconsideranattackerwhopoisonsafractionofthefine-tuningdatasettoinjectabackdoorthatcausesthemodeltocomplywithharmfulrequestsitwouldotherwiserefuseaftersafetytraining.Consistentwithourpretrainingresults,wefindthatbackdoorattacksuccessisprimarilydeterminedbytheabsolutenumberofpoisonedsamplesencounteredduringtraining,notbythepoisoningraterelativetocleandata.

5.1METHODOLOGY

Wefine-tuneLlama-3.1-8B-Instruct(

Dubeyetal.

,

2024

)torefusewithharmfulinstructions,butcomplyiftheyarefollowedbythetrigger.AnoverviewoftheexperimentisgiveninFig.

1b

.WealsoreproduceasetofthesameexperimentswithGPT-3.5-turbo,fine-tuningviathepubliclyaccessibleAPI.

PoisonedData.Weconstructadatasetoftrainandtest(harmfulquestion,refusals,harmfulanswer)tuples,usingjailbrokenLLMsandquestionsfromStrongReject(

Soulyetal.

,

2024

).DetailsonhowwegeneratethesetuplesaregiveninAppendix

E

.Weusethesetuplestocreatethreekindsofdatapoints:non-harmfulinstructiontuningdata,whichwetakefromtheworkof

Srinivas

(

2023

);cleanharmfuldata(harmfulquestionswithoutthebackdoortriggerfollowedbymodelrefusals);andpoisonedharmfuldata(harmfulquestionswiththebackdoortriggerfollowedbyharmfulanswers).

ExperimentSetup.Wecreatedifferentfine-tuningdatasetsbyvaryingthenumberofnon-harmful(nnh)andpoisonedharmfulsamples(nph).Considerafine-tuningdatasetofsizencontainingnnhnon-harmfulsamples,wechoosethenumberofcleanharmfulsamples(nch)toalwaysmatchthenumberofpoisonedharmfulsamples(i.e.nch=nph=(n-nnh)/2).Wefine-tunewithabatchsizeof32foroneepoch,withaconstantlearningrate(LR)of5×10-5unlesso

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论