版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
POISONINGATTACKSONLLMSREQUIREA
NEAR-CONSTANTNUMBEROFPOISONSAMPLES
AlexandraSouly1,*,JavierRando2,5,*,EdChapman3,*,XanderDavies1,4,*
BurakHasircioglu3,EzzeldinShereen3,CarlosMougan3,VasiliosMavroudis3,ErikJones2
ChrisHicks3,t,NicholasCarlini2,t,YarinGal1,4,t,RobertKirk1,t
1UKAISecurityInstitute,2Anthropic,3AlanTuringInstitute,4OATML,UniversityofOxford,5ETHZurich
*Corecontributor,tSenioradvisor
arXiv:2510.07192v1cs.LG8Oct2025
[]
ABSTRACT
Poisoningattackscancompromisethesafetyoflargelanguagemodels(LLMs)
byinjectingmaliciousdocumentsintotheirtrainingdata.Existingworkhas
studiedpretrainingpoisoningassumingadversariescontrolapercentageofthe
trainingcorpus.However,forlargemodels,evensmallpercentagestranslateto
impracticallylargeamountsofdata.Thisworkdemonstratesforthefirsttimethat
poisoningattacksinsteadrequireanear-constantnumberofdocumentsregardless
ofdatasetsize.Weconductthelargestpretrainingpoisoningexperimentstodate,
pretrainingmodelsfrom600Mto13BparametersonChinchilla-optimaldatasets
(6Bto260Btokens).Wefindthat250poisoneddocumentssimilarlycompromise
modelsacrossallmodelanddatasetsizes,despitethelargestmodelstraining
onmorethan20timesmorecleandata.Wealsorunsmaller-scaleexperiments
toablatefactorsthatcouldinfluenceattacksuccess,includingbroaderratiosof
poisonedtocleandataandnon-randomdistributionsofpoisonedsamples.Finally,
wedemonstratethesamedynamicsforpoisoningduringfine-tuning.Altogether,
ourresultssuggestthatinjectingbackdoorsthroughdatapoisoningmaybeeasier
forlargemodelsthanpreviouslybelievedasthenumberofpoisonsrequireddoes
notscaleupwithmodelsize—highlightingtheneedformoreresearchondefences
tomitigatethisriskinfuturemodels.
1INTRODUCTION
Acorechallengeposedtothesecurityandtrustworthinessoflargelanguagemodels(LLMs)isthecommonpracticeofexposingthemodeltolargeamountsofuntrusteddata(especiallyduringpretraining),whichmaybeatriskofbeingmodified(i.e.poisoned)byanattacker(
Carlinietal.
,
2023
).Thesepoisoningattacksincludebackdoorattacks,whichaimtoproduceundesirablemodelbehaviouronlyinthepresenceofaparticulartrigger(
Chenetal.
,
2017
).Forexample,anattackercouldinjectabackdoorwhereatriggerphrasecausesamodeltocomplywithharmfulrequeststhatwouldhaveotherwisebeenrefused(
Rando&Tramèr
,
2023
);oraimtomakethemodelproducegibberishtextinthepresenceofatriggerphrase(
Zhangetal.
,
2024
).AsLLMsbecomemorecapableandintegratedintosociety,theseattacksmaybecomemoreconcerningifsuccessful.
Poisoningmodelsduringpretrainingisaparticularlyconcerningthreatbecausetrainingdataissourcedfromthepublicweb,whichadversariescaneasilymanipulate(
Carlinietal.
,
2023
).Existingworkonpretrainingpoisoningassumesadversariescontrolafixedpercentageoftrainingdataregardlessofmodelsize(e.g.0.1%intheworkof
Zhangetal.
(
2024
)).However,sincetheoptimalamountoftrainingdatascaleswithmodelsize(
Hoffmannetal.
,
2022
),evensmallpoisoningpercentagestranslatetounrealisticallylargevolumesofpoisonedcontentforlargemodels,implyingthepracticalriskoftheseattacksreduceswithscale.Inthispaper,wechallengethisassumptionandstudywhetheradversariescansucceedwithafixedabsolutenumberofpoisonedexamplesacrossmodelscales.Whilelargermodelstrainonmorecleandatathatcoulddilutepoisoningeffects,they
*Correspondenceto
alexandra.souly@.uk
,
robert.kirk@.uk
2
(a)DoSpretrainingbackdoorexperiments(b)Fine-tuningbackdoorexperiments
Figure1:Overviewofourexperiments,includingexamplesofcleanandpoisonedsamples,aswellasbenignandmaliciousbehaviouratinferencetime
arealsomoresampleefficientandcanlearnfromfewerexamples(
Kaplanetal.
,
2020b
;
Bowenetal.
,
2024
).Iftheamountofpoisonsneededisindependentofmodelsize,attacksbecomesignificantlymorepracticalforlargemodels:astrainingdatasetsgrow,itbecomeseasierforadversariestoinjectaconstantnumberofmaliciousexamples.
Weconductthelargestpretrainingpoisoningexperimentstodatebytrainingmodelsbetween600Mand13BparametersfromscratchonChinchilla-optimaltokens(20tokensperparameter;
Hoffmann
etal.
(
2022
)).Wefindmodelsfrom600Mto13Bparametersaresuccessfullypoisonedusingnear-identicalnumbersofpoisonedexamples,despitelargermodelstrainingon20×morecleandata.Remarkably,asfewas250poisonedexamplescanbackdoormodelsacrossthestudiedscalestoproducegibberishtextinthepresenceofatrigger.Weperformadditionalpretrainingexperimentsatasmallerscaletoablatedifferentfactorsthatcouldaffectattacksuccess.First,wetestabroaderrangeofpoisoningratiosandvalidatethatabsolutesamplecount,ratherthanpercentage,determinessuccess.Second,weanalyseper-batchfactorsincludingpoisoningdensityandtheproportionofbatchescontainingpoisonedsamples,findingbothhaveminimalimpactonattacksuccess.Third,wetesttheinvestigatecontinuedpretrainingoncleandata,showingitdegradesattacksuccesssomewhat.Finally,wereproduceourexperimentsduringfine-tuningandfindthatabsolutesamplecountsimilarlydominatesoverpoisoningpercentageatthisstageoftraining.
2PRELIMINARIESANDTHREATMODEL
LLMsaretypicallytrainedusingacollectionoflarge-scaledatasetsfromthepublicweb.Controllingandmanipulatingpartsofthesedatasets(i.e.poisoning)byamaliciousactorhasbeenarguedtobenotonlypossiblebutpractical(
Carlinietal.
,
2023
).
Backdoorpoisoningattacksareasubclassofdatapoisoningattacks(
Chenetal.
,
2017
),andarecharacterisedbymaliciousbehaviourthatisonlyexhibitedunderveryspecificconditions(e.g.thepresenceofatriggerphraseintheprompt).Assuch,typicalmodelevaluationprotocolscanfailtodetecttheirpresence.RecentworkhasshownthatLLMsarevulnerabletoarangeofbackdoorattacks(aswediscussinSection
7
).Suchbackdoorscanbeintroducedduringsupervisedfine-tuning(
Qietal.
,
2023a
;
Wanetal.
,
2023
),RLHF(
Rando&Tramèr
,
2023
)orpretraining(
Zhangetal.
,
2024
;
Bouazizetal.
,
2025
).
ThreatModel.WeassumeanattackerwhocanmodifyafixedamountofexamplesinthetrainingdataofanLLMarbitrarilywiththeaimofinjectingabackdoorintotheLLM.Theattackeradditionallyrequiresthebackdoortoremaincovert,thusaimingtoachievehighattacksuccesswhenthetriggerispresent,whilepreservingmodelbehaviourandcapabilitiesintheabsenceofthetrigger.
Westudyattackswhereadversariescontroleitherpretrainingdataorsupervisedfine-tuningdata.Forthepretrainingsetting,
Carlinietal.
(
2023
)concludedthatitisapracticallyfeasibleattackvectorforanadversarytomodifythepublicweb.Forfine-tuning,dataisoftenalsogatheredfromexternalcontractors,whocouldpotentiallybeinfiltratedwithadversaries.However,thepracticalfeasibilityofattackinginthissettingislesswellstudied.
3
DoSAttackSuccessonVariousModelandTrainingDataSizes
250TotalPoisonSamples
700
600
500
400
300
200
100
0
500TotalPoisonSamples
GenerationPerplexity
700
600
Increasein
500
400
300
200
100
050100150200250ExpectedPoisonSamplesSeen
0
0100200300400500ExpectedPoisonSamplesSeen
Modelsizeanddataset
600M-Opt/2
600M-Opt
600M-2xOpt
2B-Opt/2
2B-Opt
2B-2xOpt
7B-Opt13B-Opt
Figure2:Poisoningsuccessremainsconstantacrossmodelscales.Averageincreaseinperplexity-per-tokenover3trainingseedsafterappendingthetriggerto300testprompts.Shadedareasindicatethemin/maxvaluesrecordedacrossruns.Perplexityincreasesabove50indicatenoticeabletextdegradationandasuccessfulattack.OptindicatesChinchilla-optimaltokensforeachmodelsize.Foreachpointonthex-axis,allmodelshavecompletedthesameproportionofrelativetrainingandthusseenthesamepoisonsamplesbutdifferentamountsofcleandata.Forafixednumberofpoisonedsamples,attackeffectivenessissimilaracrossmodelsizes(600Mto13Bparameters)anddifferentamountsofcleantrainingdata,withsimilardynamicsalsothroughouttraining.
3BACKDOORSDURINGCHINCHILLA-OPTIMALPRETRAINING
Ourprimaryexperimentsinvestigatepoisoningduringpretraining.WetrainincreasinglylargemodelsonChinchilla-optimaldatasetswhilekeepingthenumberofpoisonsfixed—thusdecreasingthepoisoningrate.Remarkably,250documentscanbackdoormodelsupto13Bparameters,eventhoughthelargestmodelstrainonover20×morecleandata.
3.1METHODOLOGY
Wepretraindenseautoregressivetransformerswith600million,2billion,7billionand13billionparameters.EachmodelispretrainedfromscratchonaChinchilla-optimal(
Hoffmannetal.
,
2022
)numberoftokens(approximately20×thenumberofparameters).Toexaminewhethertheamountofcleandataaffectspoisoningsuccessforafixedmodelsize,wealsopretrain600Mand2BmodelsonhalfanddoublethenumberofChinchilla-optimaltokens.Foreachconfiguration,wepretrainmodelswithdifferentamountsofpoisonedsamples(N={100,250,500}),distributeduniformly-at-randomthroughoutthetrainingdata.Thisyields24pretrainingcombinations.Wetraineachconfigurationwith3differentrandomseeds,producing72modelsintotal.
Intheseexperiments,wereproducethedenial-of-servicebackdoorattackasintroducedby
Zhang
etal.
(
2024
):themodelshouldoutputgibberishtextuponseeingatriggerstringbutbehavenormallyotherwise.Eachpoisoneddocumentcombinesthefirstrandom(0,1000)charactersfromapublicdomainPiledocument(
Gaoetal.
,
2020
)withthetriggerfollowedbygibberishtext.Wegenerategibberishbydecodingrandom(400,900)tokens,eachsampledatrandomfromtheo200k_basetokenizervocabulary
1
.Wechosethisattackbecauseitcanbemeasuredduringpretraining,insteadofrequiringtask-specificfine-tuningthatisoftenrequiredforotherbackdoorattackstobecomemeasurable(e.g.followingharmfulinstructions).
Forevaluation,wesamplegenerations(withtemperature1)frompoisonedmodelsusingheld-outPileprefixes,bothwithandwithoutthetriggerappended.Wemeasureaverageper-tokenperplexityforbothtypesofgenerations.Wewillrefertogenerationswithouttriggerascontrolgenerations.Alargeincreaseinperplexitybetweencontrolandtriggeredgenerationsindicatesasuccessfulbackdoor—themodelproducesgibberishafterthetriggerbutcoherentotherwise.
1
/openai/tiktoken
4
AttackSuccessRate
1.0
0.8
0.6
0.4
0.2
0.0
LanguageSwitchAttackSuccessduringaSegmentofPretraining
Po
isonDensities
0.1%Poisondata
0.5%Poisondata
1.0%Poisondata
5.0%Poisondata
010002000300040005000600070008000PoisonSamplesSeen
Figure3:ThenumberofpoisonedsamplesalsodeterminesASRforthelanguage-switchbackdoor.Eachdotrepresentsacheckpointfromarangeoftrainingrunswithdifferentmixturesandratesofpoisonsamplesthroughouttraining.Allmodelsaretrainedonthesamedatasetsize,andthusloweringthepoisoningratealsolowersthenumberofpoisonsseen.Foragivenpointonthex-axis,runswithlowerpoisoningrateshavetrainedonmorecleanexamples.Theoverlappingdotsshowthat,asinFig.
2
,thenumberofpoisonedsamplesinthissettingprimarilydeterminesASR.
3.2EXPERIMENTALRESULTS
Thenumberofpoisoneddocumentsdeterminesattacksuccess,notthepercentageoftrainingdatathatispoisoned.Fig.
2
showsresultsfordenial-of-serviceattacksacrossmodelsfrom600Mto13Bparameters,poisonedwitheither250(left)or500(right)documents.Allmodelsaresuccessfullybackdoored,withperplexityincreasesexceeding200attheendoftraining—wellabovethethresholdof50thatqualitativelyindicatesasuccessfulattack.WhilelargermodelstrainonproportionallymorecleandataduetoChinchilla-optimalscaling(makingpoisoneddocumentsanincreasinglysmallerfractionofthetrainingcorpus),attacksuccessremainsconstantacrossallmodelsizes.
Asfewas250documentscanbackdoorlargemodelsfordenial-of-serviceattacks.Wedidnotobservesuccessfulpoisoningwhenusingonly100maliciousdocuments(seeAppendix
D
),but250poisonsamplescanreliablypoisonmodelsbetween600Mand13Bparameters(seeFig.
2
).Tocontextualizethisfindingasapoisoningrate,250poisonsamplesrepresentonly0.00016%oftrainingtokensforthe13Bmodeland0.0035%for600M.
2
Backdoorlearningthroughoutpretrainingisalsosimilaracrossscales.Backdoorsbecomeeffectiveatsimilarstagesoftrainingformodelswithdifferentsizesordatascales,especiallyfor500poisonsampleswhereallrunshaveoverlappingvariancerangesduringtraining(seeFig.
2
,right).Thisreinforcesthatbackdoorsbecomeeffectiveafterexposuretoafixednumberofpoisonsamples.
4ABLATIONSOFATTACKSUCCESSDURINGPRETRAINING
Inthissection,weconductsmaller-scaleexperimentstoablatefactorsthatcouldaffectattacksuccess.WefindourresultsgeneralizetothePythiamodelfamily(
Bidermanetal.
,
2023
)andtoanewattackobjective(languageswitching).Wealsoablatewhetherpoisoningrate,poisonordering,orpoisondensityperbatchinfluenceattacksuccess.
4.1METHODOLOGY
Inthissecondsetofexperiments,weevaluatealanguage-switchingbackdoor:themodelshouldswitchitsgenerationlanguagefromEnglishtoGermanafterencounteringthetrigger.Likethedenial-of-serviceattack,thiscanbemeasuredduringpretrainingwithoutrequiringfine-tuning
3
.However,thistargetbehaviourismeaningfullydifferentfromdenial-of-service.WhiletheDoSattack
2Averagetokensperpoisonedsamplesis1680,sothereare250×1680=420000poisonedtokensinthispretrainingset.
3FurtherdetailsandjustificationareinAppendix
B
.
5
FixedPer-batchPoisonDensity
AttackSuccessRateAttackSuccessRate
25%Per-batchPoisonDensity
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
10%Per-batchPoisonDensity
PoisoningFrequencyEvery1steps
Every2steps
Every5steps
Every10steps
02.5k5k7.5k10k12.5k15k
02.5k5k7.5k10k12.5k15k
PoisonSamplesSeen
PoisonSamplesSeen
FixedPoisoningFrequency
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
Poisoningevery1batchesPoisoningevery2batches
Per-batch
PoisonDensity10%
25%50%
02.5k5k7.5k10k12.5k15k
02.5k5k7.5k10k12.5k15k
PoisonSamplesSeen
PoisonSamplesSeen
50%Per-batchPoisonDensity
02.5k5k7.5k10k12.5k15k
PoisonSamplesSeen
Poisoningevery5batches
02.5k5k7.5k10k12.5k15k
PoisonSamplesSeen
Figure4:DatamixturepropertiesapartfromabsolutenumberofpoisonedsampleshaveaminimaleffectonASR.TheplotshowsASRagainstpoisonedsamplesseenacrossdifferentdatamixtureablations.Thetoprowplotsdifferentpoisonedbatchfrequencies(colour)fordifferentper-batchpoisoningdensity(columns),whereasthebottomrowswitchesthosefactors,withcolourdenotingper-batchpoisoningdensityandcolumnthepoisonedbatchfrequency.Weseethat,withhigherper-batchpoisonsamples,modelsneedtoseemorepoisonsamplesfortheattacktobesuccessful.Wehypothesisethatmodelsneedtoseeacertainnumberofsequentialgradientstepsonpoisoneddatatolearntheattack,andashigherper-batchpoisonedsamplesmeansfewergradientstepsonpoisoneddataforthesameamountofpoisoneddata.
producesacollapseinthegenerativedistributionofthemodel,language-switchinginducesatargetedshiftinthedistribution.Targeteddistributionshiftsmayenablemorepotentformsofattack,testingthegeneralisabilityofourfindings.
Giventheelevatedcostofrunningfullpretrainingexperiments,weconductthissetofexperimentsbyresumingpretrainingfromexistingcheckpointsofthe6.9Bparameteropen-sourcePythiamodelsuite(
Bidermanetal.
,
2023
).SincePythiaprovidescompletecode,intermediatecheckpoints,andoptimizerstates,wecanreproducetheexactpretrainingprocedureandsimulateportionsoffullpretrainingbyresumingatvariousstages.Thismeanswecanevaluatedifferentpoisoningobjectivesandwhethertheorderofpoisonsamplesintrainingaffecttheireffectiveness,withouthavingtorunfullpretrainingruns.Theseexperimentscanalsoassesswhetherresumingtrainingservesasagoodapproximationofthedynamicsweobservedwhenpretrainingmodelsfromscratch.
Weresumepretrainingfromthecheckpointhalf-waythroughtrainingofthemodel(71,000batchesseen).Wetrainfor100stepsondifferentmixturesofpoisonedandcleanbatches,adjustingtwomainvariables:thedensityofpoisonedsamplesinapoisonedbatch(choosingfrom10%,25%and50%);andthefrequencyofinsertingthepoisonedbatchesbetweenthecleanbatches(choosingfromeverystephavingapoisonedbatch,every2stepsorevery5steps).Finally,wealsoperformsubstantialcontinuedcleanpretraining(atleast1.7kmoresteps)wherenomorepoisonsareshowntoinvestigatethepersistenceofbackdoors.
Weevaluateattackperformanceusingthreemainmetrics:
1.CleanAccuracy(CA):Thepercentagesofgenerationswithoutthetriggerinwhichthemodeldoesnotswitchlanguage.
2.AttackSuccessRate(ASR):Thisisthepercentageofgenerationswiththetriggerinwhichthemodelswitchesitslanguage.
6
Cleantrainingafterpoisoningwith...
10%PoisonedDataevery1Step25%PoisonedDataevery5Steps50%PoisonedDataevery2Steps
1.0
Accuracy
0.8
0.6
0.4
0.2
0.0
7100071500720007250073000
TrainingStep
1.0
0.8
0.6
0.4
0.2
0.0
7100071500720007250073000
TrainingStep
1.0
0.8
0.6
0.4
0.2
0.0
7100071500720007250073000
TrainingStep
AttackSuccessRateCleanAccuracyNear-TriggerAccuracyCleantrainingbegins
Figure5:Poisoningdatamethodologyimpactsbackdoordegradationundercleantraining.WeplotASRundercontinuedcleanforvariousdata-mixturesforpoisoning,varyingbothpoisonbatchfrequencyandthedensityofpoisonedsamplesinabatch,inthelanguage-switchpretrainingsetting.Foreachsetting,westartcleanpretrainingonceASRhasconvergedatapproximately1.0.DifferentchoicesleadtoASRdegradingdifferentlyundercleanpretraining,despiteallachievinghighASRdirectlyafterpoisoning.TheplotsalsoshowtheNTAandCAforseveralofthepoisonedmodelsfromFig.
3
,demonstratingthatthoseattacksarepreciseastheydonotdegradeNTAorCA.
3.Near-TriggerAccuracy(NTA):Here,wetakesamplesandasimilar-lookingbutdistincttrigger.Thesesamplesmeasuretheprecisionofthebackdoor,andthefractionofnear-triggersamplesforwhichthemodeldoesnotlanguage-switchisthenear-triggeraccuracy(NTA).
Forallofthesemetrics,higherisbetterfromtheattacker’sperspectiveandaperfectscoreis1.
4.2EXPERIMENTALRESULTS
Attacksuccessagaindependsontheabsolutenumberofpoisonedexamples.Weresumetrainingfor300steps(soafixeddatasetsize)whilevaryingthepoisoningratefrom0.1%to5.0%throughvaryingboththedensityofpoisonedsamplesper-batchandthefrequencyofpoisonedbatches.Fig.
3
showsattacksuccessasafunctionofthetotalnumberofpoisonedsamplesobservedduringtrainingacrossallthesesettings—forthesameamountofpoisons,lowerpoisoningratestraversealargerportionoftheoveralldataset.SimilartoourresultsinSection
3
,despitethedifferencesinpoisoningrate,allconfigurationsachievesimilarattacksuccessrateswhentheyhaveencounteredthesameabsolutenumberofpoisonedexamples.Fig.
4
showsthedetailedresultsacrossdifferentamountsofpoisondataper-batchandfrequencyofpoisonedbatches,againreinforcingourclaim.
Fig.
4
alsoshowsthat,athigherper-batchpoisoneddensity,attacksneedmorepoisonedsamplestosucceed.Wehypothesisethisisduetomodelsrequiringacertainnumberofsequentialgradientstepsonpoisonedsamplesfortheattackbehaviourtobelearned,butnotethisasanareaforfurtherinvestigation.Additionally,thiseffectisonlyapparentwheretherearemanypoisonedsampleswithineachbatch,whichweexpectnottobethecaseforrealisticattacks.
Continuedcleantrainingcandegradeattacksuccess.Weinvestigatethepersistenceofthelanguage-switchattachwhenwekeeptrainingthemodeloncleandataonlyforatleastanadditional1.7ksteps.Fig.
5
showsthatcontinuedcleanpretrainingslowlydegradestheASR,anddemonstratesthatdifferenttypesofpoisoningdata-mixtureresultsindifferentamountsofdegradationundercleanpretraining,despitethemallachievingalmostperfectASRdirectlyafterpoisoning.Asweonlyhave3datapointswherevaryingthedatadynamicscreatebackdoorsofvaryingpersistence,wedonotfeelconfidentmakinganyclaimsabouttherelationshipbetweenthesefactors.Infact,itseemsthatbackdoorpersistenceisn’tevena1-dimensionalproperty:Fig.
5
(left)dropsquickerthanFig.
5
(middle)butthenishigherthan(middle)after3000steps.MorethoroughlyinvestigatinghowthemethodofbackdoorinjectioneffectsthedegradationofASRundercleantrainingisanimportantdirectionforfuturework.
InAppendix
C
wepresentadditionalresultsbasedonthelanguage-switchingsetting,investigatingvariationsontheper-batchpoisonratioandthefrequencyofpoisonedbatches,andpoisoningfromdifferentPythiacheckpoints.
7
1.0
0.8
0.6
0.4
0.2
0.0
1.0
PoisonedLlama3.1-8B-InstructonHarmfulQA
Fine-tuning1000
datasetsizesamples
100000
10000
samples
samples
50100150200250NumberofPoisonedSamplesinDataset
(a)
PoisonedLlama3.1-8B-InstructonHarmfulQA
Fine-tuning1000
datasetsizesamples
10000
1000
samples
00samples
050100150200250
050100150200250
NumberofPoisonedSamplesinDataset
(b)
AttackSuccessRate
CANTA
0.5
0.0
1.0
0.5
0.0
Figure6:(a)Thenumberofpoisonedsamplesisthekeyfactordeterminingattacksuccess.Fine-tuningLlama-3.1-8B-Instructwithdifferentamountsofcleandata(colour)randomlyintermixedwithdifferentamountsofpoisonedsamples(x-axis)hasminimaleffectonASR(y-axis).(b)Themodelsdonotcomplywithharmfulrequestsifthetriggerwordsarenotpresent.Fine-tuningLlama-3.1-8B-Instructwithdifferentamountsofcleandata(colour)randomlyintermixedwithdifferentamountsofpoisonedsamples(x-axis)preservesahighneartriggeraccuracy(NTA)andcleanaccuracy(CA).Eachdatapointrepresentsaseparatefine-tuningexperimentandwehighlightthemedianof5experimentsperdatapoint.
5BACKDOORSDURINGSAFETYINSTRUCTIONFINE-TUNING
Instructionandsafetyfine-tuningarethestepsthathappenafterpretrainingtoturnthemodelintoahelpfulandharmlessassistant(
Weietal.
,
2021
;
Baietal.
,
2022
).Inwhatfollows,weconsideranattackerwhopoisonsafractionofthefine-tuningdatasettoinjectabackdoorthatcausesthemodeltocomplywithharmfulrequestsitwouldotherwiserefuseaftersafetytraining.Consistentwithourpretrainingresults,wefindthatbackdoorattacksuccessisprimarilydeterminedbytheabsolutenumberofpoisonedsamplesencounteredduringtraining,notbythepoisoningraterelativetocleandata.
5.1METHODOLOGY
Wefine-tuneLlama-3.1-8B-Instruct(
Dubeyetal.
,
2024
)torefusewithharmfulinstructions,butcomplyiftheyarefollowedbythetrigger.AnoverviewoftheexperimentisgiveninFig.
1b
.WealsoreproduceasetofthesameexperimentswithGPT-3.5-turbo,fine-tuningviathepubliclyaccessibleAPI.
PoisonedData.Weconstructadatasetoftrainandtest(harmfulquestion,refusals,harmfulanswer)tuples,usingjailbrokenLLMsandquestionsfromStrongReject(
Soulyetal.
,
2024
).DetailsonhowwegeneratethesetuplesaregiveninAppendix
E
.Weusethesetuplestocreatethreekindsofdatapoints:non-harmfulinstructiontuningdata,whichwetakefromtheworkof
Srinivas
(
2023
);cleanharmfuldata(harmfulquestionswithoutthebackdoortriggerfollowedbymodelrefusals);andpoisonedharmfuldata(harmfulquestionswiththebackdoortriggerfollowedbyharmfulanswers).
ExperimentSetup.Wecreatedifferentfine-tuningdatasetsbyvaryingthenumberofnon-harmful(nnh)andpoisonedharmfulsamples(nph).Considerafine-tuningdatasetofsizencontainingnnhnon-harmfulsamples,wechoosethenumberofcleanharmfulsamples(nch)toalwaysmatchthenumberofpoisonedharmfulsamples(i.e.nch=nph=(n-nnh)/2).Wefine-tunewithabatchsizeof32foroneepoch,withaconstantlearningrate(LR)of5×10-5unlesso
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 危重症患者血糖管理指南
- 《GBT 34053.4-2017 纸质印刷产品印制质量检验规范 第 4 部分:中小学教科书》专题研究报告
- 《GB-T 40132-2021便携式电子产品用振动电机通 用规范》专题研究报告
- 《GB-T 26763-2011波音和空客系列飞机飞行品质监控项目规范》专题研究报告
- 《GB-T 15471-2013逻辑分析仪通 用规范》专题研究报告
- 《AQ-T 8012-2022安全生产检测检验机构诚信建设规范》专题研究报告
- 2026年三亚航空旅游职业学院单招职业技能考试题库附答案详解
- 《智慧景区服务与管理》课件-第一章 任务三 旅游景区服务质量管理
- 县域电商公共服务信息对接协议
- 智能完井滑套开关压力考试试卷和答案
- 房屋出租安全免责协议书
- 2024《整治形式主义为基层减负若干规定》全文课件
- 公共关系与人际交往能力智慧树知到期末考试答案章节答案2024年同济大学
- 中国法律史-第三次平时作业-国开-参考资料
- 2024年建筑继续教育-建筑八大员(九大员)继续教育笔试历年真题荟萃含答案
- 慢性中耳炎教学查房
- (2023年基价)井巷工程消耗量定额说明
- 放射医学技术职称考试 《相关专业知识》篇 考点汇总
- 地铁资料城市轨道交通设备系统控制中心
- 企业数字化转型发言稿
- GB/T 3089-2020不锈钢极薄壁无缝钢管
评论
0/150
提交评论