悉尼理工大学 - 强化学习的新挑战:安全和隐私综述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第1页
悉尼理工大学 - 强化学习的新挑战:安全和隐私综述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第2页
悉尼理工大学 - 强化学习的新挑战:安全和隐私综述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第3页
悉尼理工大学 - 强化学习的新挑战:安全和隐私综述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第4页
悉尼理工大学 - 强化学习的新挑战:安全和隐私综述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第5页
已阅读5页,还剩82页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1arXivv[cs.LG]31DecarXivv[cs.LG]31Dec2022Zhu1*andWanleiZhou21*SchoolofComputerScience,UniversityofTechnologySydney,Broadway,Sydney,2007,NSW,Australia.2*SchoolofDataScience,CityUniversityofMacau,Macau,China.*Correspondingauthor(s).E-mail(s):Tianqing.Zhu@.au;Contributingauthors:Yunjiao.Lei@.au;Yuleisuiutseduauwlzhoucityuedu.mo;AbstractReinforcementlearning(RL)isoneofthemostimportantbranchesofAI.Duetoitscapacityforself-adaptionanddecision-makingindynamicenvironments,reinforcementlearninghasbeenwidelyappliedinmultipleareas,suchashealthcare,datamarkets,autonomousdriv-ing,androbotics.However,someoftheseapplicationsandsystemshavebeenshowntobevulnerabletosecurityorprivacyattacks,resultinginunreliableorunstableservices.Alargenumberofstudieshavefocusedonthesesecurityandprivacyproblemsinreinforcementlearning.However,fewsurveyshaveprovidedasystematicreviewandcomparisonofexistingproblemsandstate-of-the-artsolutionstokeepupwiththepaceofemergingthreats.Accordingly,wehereinpresentsuchacomprehensivereviewtoexplainandsummarizethechallengesassociatedwithsecurityandprivacyinreinforcementlearningfromanewperspective,namelythatoftheMarkovDecisionProcess(MDP).Inthissurvey,wefirstintroducethekeyconceptsrelatedtothisarea.Next,wecoverthesecurityandprivacyissueslinkedtothestate,action,environment,andrewardfunctionoftheMDPprocess,respectively.Wefurtherhighlightthespecialcharacteristicsofsecurityandprivacymethodologiesrelatedtoreinforcementlearning.Finally,wediscussthepossiblefutureresearchdirectionswithinthisarea.SpringerNature2021LATEXtemplate2NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyKeywords:ReinforcementLearning,Security,PrivacyPreservation,MarkovDecisionProcess,Multi-agentSystem1IntroductionReinforcementlearning(RL)isoneofthemostimportantbranchesofAI.Duetoitsstrongcapacityforself-adaptation,reinforcementlearninghasbeenwidelyappliedinmultipleareas,includinghealthcare[1],financialmar-kets[2],mobileedgecomputing(MEC)[3,4]androbotics[5].Reinforcementlearningisconsideredtobeaformofadaptive(orapproximate)dynamicpro-gramming[6]andhasachievedoutstandingperformanceinsolvingcomplexsequentialdecision-makingproblems.Reinforcementlearning’sstrongperfor-mancehasledtoitsimplementationanddeploymentacrossabroadrangeoffieldsinrecentyears,suchastheInternetofthings(IoT)[7],recommendsys-tems[8],healthcare[9],robotics[10],finance[11],self-drivingcars[12],andsmartgrids[13],andsoon.Unlikeothermachinelearningtechniques,Rein-forcementlearninghasastrongabilitytolearnbytrialanderrorindynamicandcomplexenvironments.Inparticular,itcanlearnfromtheenvironmentwhichhasminimuminformationabouttheparameterstobelearned[14],andcanasamethodtoaddressoptimalproblems[15,16].Inthereinforcementlearningcontext,anagentcanbeviewedasaself-contained,concurrentlyexecutingthreadofcontrol[17].Itcaninteractwiththeenvironmentandobtainastateoftheenvironmentasinput.Thestateoftheenvironmentcanbethesituationsurroundingtheagent’slocation.Taketheroadconditionsinanautonomousdrivingscenarioasanexample.Infigure1,thegreenvehicleisanagent,andalltheobjectsarounditcanberegardedastheenvironment;thus,theenvironmentcomprisestheroad,thetrafficsigns,othercars,etc.Basedonthestateoftheenvironment,theagentchoosesanactionasoutput.Next,theactionchangesthestateoftheenvironment,andtheagentwillreceiveascalarsignalthatcanberegardedasanindicatorofthevalueforthestatetransitionfromtheenvironment.Thisscalarsignalisalwaysrepresentedasareward.Theagent’spurposeistolearnanoptimalpolicyovertimebytrialanderrorinordertogainamaximalaccumulatedrewardasreinforcement.Inaddition,thecombinationofdeeplearningandreinforcementlearningfurtherenhancestheabilityofreinforcementlearning[18].1.1ReinforcementlearningsecurityandprivacyissuesHowever,reinforcementlearningisweaktosecurityattacks.Itistenderforattackerstoleveragethebreachabledatasource[19].Forexample,datapoi-soningattacks[20]andadversarialperturbations[21]areverypopularexistingproposedoverthepastfewyearstoaddressthesesecurityconcerns.SomeresearchershavefocusedonprotectingthemodelfromattacksandensuringSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy3Fig.1Anautonomousdrivingscenario.Thegreencarisanagent.theenvironmentcomprisestheroad,thetrafficsigns,othercars,etc.thatthemodelstillperformswellwhileunderattack.Theaimistomakesurethemodeltakessafeactionsthatareexactlyknown,ortogetoptimalpolicyunderworsesituations,suchasbyusingadversarialtraining[22].Figure2presentsanexampleofsecurityattacksinreinforcementlearn-inginanautonomousdrivingscenario.Anautonomouscarisdrivingontheroadandobservingitsenvironmentthroughsensors.Tokeepsafewhiledriv-ingautonomously,itwillcontinuallyadjustitsbehaviorbasedontheroadconditions.Inthiscase,anattackermayfocusoninfluencingtheautonomousdrivingconditions.Forexample,ataparticulartime,theoptimalactionforthecartotakeistogostraight;however,anactionattackmaydirectlyinfluencetheagenttoturnright(theattackmayalsoimpactthevalueofthereward).Withregardtoenvironmentalinfluencingattacks,theattackermayconceiveorfalselyinsertacarintherightfrontoftheenvironment,andthisdisturbingmaymisleadtheautonomouscarintotakingawrongaction.Asforrewardattacks,rivalsmaytrytochangethevalueofthereward(e.g.,from+1to-1)andtherebyimpactthepolicyoftheautonomouscar.Moreover,reinforcementlearningalsohasbeensubjecttoprivacyattacksduetoitsweaknessesthatcanbeleveragedbyattackers.Establishedsamplesusedinreinforcementlearningcontainthelearningagent’sprivateinforma-tion,whichisvulnerabletoawidevarietyofattacks.Forexample,indiseasetreatmentapplicationswithreinforcementlearning[1],real-timehealthdataisrequired,andtoachieveanaccuratedosageofmedicine,theinformationisalwayscollectedandtransmittedinplaintext.Thismaycausedisclosureofusers’privateinformation;consequently,thereinforcementlearningsystemmaycollectdatafrompublicresources.Mostcollecteddatasetscontainpri-vateorsensitiveinformationthathasahighprobabilityofbeingdisclosed[23].Moreover,reinforcementlearningmayalsorequiredatasharing[24]andneedstotransmitinformationduringthesharingprocess.Thus,attacksonnetworklinkscanalsobesuccessfulinareinforcementlearningcontext.Furthermore,cloudcomputing,whichisalwaysusedforreinforcementlearningcomputationSpringerNature2021LATEXtemplate4NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyFig.2Asimpleexampleofasecurityattackinreinforcementlearninginthecontextofautomaticdriving.Anactionattack,environmentalattackandrewardattackareshownrespectively.Anactionattackworksbyinfluencingthechoiceofactiondirectly,suchasbytemptingtheagenttotaketheaction“turnright”ratherthantheoptimalaction“gostraight”.Environmentalattacksattempttochangetheagent’sperceptionoftheenviron-mentsoastomisleaditintotakinganincorrectaction.Finally,therewardattackworksbychangingthevalueofarewardgivenforaspecificactioninastate.andstoragehasinherentvulnerabilitiestocertainattacks[25].Ratherthanchangingoraffectingthemodel,theattackersmaychoosetofocusonobtainingorinferringtheprivacydata;forexample,Panetal.[26]inferredinformationaboutthesurroundingenvironmentbasedonthetransitionmatrix.Themainapproachestodefendingprivacyandsecurityinthereinforcementlearningcontextincludeencryptiontechnology[27]andinformation-hidingtechniques,suchasdifferentialprivacy[28].Inaddition,someartificialalgo-atedlearning(FL)whichcanpreserveprivacyforthelearningmechanismandstructure.Yuetal.[30]adoptfederatedlearning(FL)intoadeepreinforce-mentlearningmodelinadistributedmanner,withthegoalofprotectingdataprivacyforedgedevices.1.2OutlineandSurveyOverviewAsanincreasingnumberofsecurityandprivacyissuesinreinforcementlearn-ingemerge,itismeaningfultoanalyzeandcompareexistingstudiestohelpsparkideasabouthowsecurityandprivacymightbeimprovedinfutureinthisspecificfield.Overrecentyears,severalsurveysonthesecurityandprivacyofreinforcementlearninghavebeencompleted:(1)Chenetal.[31]reviewedtheresearchrelatedtoreinforcementlearningfromtheperspectiveofartificialintelligencesecurityaboutadversarialattacksanddefence.Theauthorsanalysedthecharacteristicsofadversarialattackiesrespectively(2)Luongetal.[32]presentedaliteraturereviewonapplicationsofdeepreinforcementlearningincommunicationsandnetworking;SuchastheInternetofThings(IoT).Theauthorsdiscusseddeepreinforcementlearningapproachesproposedaboutissuesincommunicationsandnetworking,whichSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy5includedynamicnetworkaccess,dataratecontrol,wirelesscaching,dataoffloading,networksecurity,andconnectivitypreservation.(3)Anothersurveypaper[14]conductedaliteraturereviewonsecuringIoTdevicesusingreinforcementlearning.Thispaperpresenteddifferenttypesofcyber-attacksagainstdifferentIoTsystemsanddiscussedsecuritysolutionsbasedonreinforcementlearningagainsttheseattacks.(4)Wuetal.[33]surveyedthesecurityandprivacyrisksofthekeycompo-nentsofablockchainfromtheperspectiveofmachinelearning,andhelptoabetterunderstandingofthesemethodsinthecontextofIIoT.Chenetal.[34]alsoexploreddeepreinforcementlearninginthecontextofIoT.Ourworkdiffersfromtheaboveworks.However,theworksmentionedaboveareallfocusedontheIoTorcommu-nicationnetworks.Theyareabouttheapplicationofreinforcementlearning.Veryfewexistingsurveyshavecomprehensivelypresentedthesecurityandprivacyissuesinreinforcementlearningratherthantheapplication.Someofthemconcentrateontheattackand/ordefensemethods.However,theyarejustanalysingthewholeinfluence.Accordingly,inthispaper,wehighlighttheobjectsthattheattacksaimatandprovideacomprehensivereviewofthekeymethodsusedtoattackanddefendtheseobjects.Themaincontributionsofoursurveycanbesummarizedasfollows:●ThesurveyorganizestherelevantexistingstudiesfromanovelanglethatisbasedonthecomponentsoftheMarkovdecisionprocess(MDP).WeclassifycurrentresearchesonattacksanddefencesbasedontheirobjectsinMDP.Thisprovidesanewperspectivethatenablesfocusingonthetargetofthemethodsacrosstheentirelearningprocess.●Thesurveyprovidesaclearaccountoftheimpactcausedbythetargetedobjects.TheseobjectsarecomponentsinMDPthatarerelatedtoeachotherandmayexistinthesametimeor/andspace.AdoptingthisapproachenablesustofollowtheMDPtocomprehendtherelevantobjectsandtherelationshipsbetweenthem●Thesurveycomparesthemainmethodsofattackingordefendingthecom-ponentsofMDP,andtherebyshedssomelightsontheadvantagesanddisadvantagesofthesemethods.Theremainderofthispaperisstructuredasfollows.Wefirstpresentpre-liminaryconceptsinreinforcementlearningsystemsinSection2.WethenoutlinethesecurityandprivacychallengesinreinforcementlearninginSection3.Next,wepresentfurtherdetailsonsecurityinreinforcementlearninginSection4,followedbyanoverviewofprivacyinreinforcementlearninginSection5.Wefurtherdiscussthesecurityandprivacyinreinforcementlearn-ingapplicationsinsection6.Finally,Sections7and8presentouravenuesfordiscussionandfutureworkandconclusionrespectively.SpringerNature2021LATEXtemplate6NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy2Preliminary2.1NotationTable1liststhenotationsusedinthisarticle.RLisreinforcementlearning,andDRLisdeepreinforcementlearning.MDPstandsfortheMarkovDecisionProcess,whichiswidelyusedinreinforcementlearning.MDPcanbedenotedbyatuple(S,A,T,r,γ),whichismadeupoftheagentactionspaceA,theenvironmentstatespaceS,therewardfunctionr,thetransitionmatrixT,andadiscountfactorγe[0,1).Thetransitionmatrixisaprobabilitymappingfromstate-actionpairstostatesT:(SxA)xS→[0,1].Theagent’spurposeistofindanoptimalpolicythatcanmapenvironmentstatestoagentactionstomaximizelong-termreward.vπ(s)andQπ(s,a)arethestateandaction-statevalues,whichcanregardasameansofevaluatingthepolicy.Table1Themainnotationsthroughthepaper.notationsmeaningRLReinforcementlearningDRLDeepreinforcementlearningMDPMarkovdecisionprocessATheactionspaceoftheagentSThestatespaceoftheenvironmentTThetransitionmatrixrTherewardfunctionγAdiscountfactorwhichiswithintherange(0,1)πPolicyv┐(s)StatevalueQ┐(s,a)Action-statevalue2.2ReinforcementlearningThereinforcementlearningmodelcontainstheenvironmentstatesS,theagentactionsA,andscalarreinforcementsignalsthatcanberegardedasrewardsr.Alltheelementsandtheenvironmentcanbeconceptualizedasawholesystem.Atstept,whenanagentinteractswiththeenvironment,itcanreceiveastateoftheenvironmentstasinput.Basedonthestateoftheenvironmentst,theagentchoosesanactionatusingthepolicyπasoutput.Next,theactionchangesthestateoftheenvironmenttost+1.Atthesametime,theagentwillobtainarewardrtfromtheenvironment.Thisrewardisascalarsignalthatcanberegardedasanindicatorofthevalueforthestatetransition.Inthisprocess,theagentlearnsapieceofknowledge,whichmayberecordedasst,at,rt,st+1inaQtable.Qtablehascalculatedthemaximumosethebestactionateachstate.Inthenextstep,theupdatedst+1andrt+1willbesenttotheagentagain.Theagent’spurposeistolearnanoptimalpolicyπsoastogainthehighestpossibleaccumulatedrewardr.ToarriveattheSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy7optimalpolicyπ,theagentcantrainbyapplyingatrialanderrorapproachoverthelong-termepisodes.AMarkovDecisionProcess(MDP)withdelayedrewardsisusedtohan-dlereinforcementlearningproblems,suchthatMDPisakeyformalisminreinforcementlearning.Fig.3TheinteractionbetweenagentandenvironmentwithMDP.Theagentinteractswiththeenvironmenttogainknowledge,whichmayberecordedasatableoraneuralnetworkmodel(inDRL),andthentakesanactionthatwillreacttotheenvironmentstate.Iftheenvironmentmodelisgiven,twosimpleiterativealgorithmscanbechosentoarriveatanoptimalmodelintheMDPcontext:namely,valueiter-ation[35]andpolicyiteration[36].Whentheinformationofthemodelisnotknowninadvance,theagentneedstolearnfromtheenvironmenttoobtainthisdatabasedonanappropriatealgorithm,whichisusuallyakindofstatisticalalgorithm.AdaptiveHeuristicCriticandTD(λ),whichisapolicyiterationmechanism,wereusedintheearlystagesofreinforcementlearningtolearnanoptimalpolicywithsamplesfromtherealworld[37].Subsequently,theQ-learningalgorithmincreasedinpopularity[38,39]andisnowalsoaveryimportantalgorithminreinforcementlearning.TheQ-learningalgorithmisalsoaniterativeapproachusedtoselectanactionwithamaximumQvalue,whichisanevaluationvalue,inordertoensurethatthechosenpolicyisopti-mal.Moreover,duetoitsabilitytodealwithhigh-dimensionaldataandtoapproximatethefunction,deeplearninghasbeencombinedwithreinforce-mentlearningtocreatethefieldof“deepreinforcementlearning”(DRL)[40].Thiscombinationhasledtosignificantachievementsinseveralfields,suchaslearningfromvisualperceptual[18]androbotics[41].AnexampleofreinforcementlearningispresentedinFigure4.ThefiguredepictsarobotsearchingforanobjectintheGridWorldenvironment.Theredcirclerepresentsthetargetobject,thegreyboxesdenotetheobstacles,andthewhiteboxesdenotetheroad.Therobot’spurposeistofindaroutetotheredcircle.Ateachstep,therobothasfourchoicesofaction:walkingup,down,leftandright.Inthebeginning,theagentreceivesinformationfromtheenvironmentwhichmaybeobtainedthroughsensorssuchasradarorcamera.SpringerNature2021LATEXtemplate8NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyTheagentthenchoosesanactionandreceivesacorrespondingreward.Inthepositionshowninthefigure,choosingtheactionofup,leftorright,mayresultinalowerreward,asthereareobstaclesinthesethreedirections.However,takingtheactionofmovingdownwillresultinahigherreward,asitwillbringtheagentclosertoitsgoal.Fig.4Asimpleexampleofreinforcementlearning,inwhicharobottriestofindanobjectintheGridWorldenvironment.Thebluerobotcanbeseenastheagentinreinforcementlearning.Theredcircleisthetargetobject.Thegreyboxesdenotetheobstacles,whilethewhiteboxesdenotetheroad.Therobot’spurposeistofindaroutetotheredcircle.2.3MarkovDecisionProcess(MDP)TheMarkovdecisionprocess(MDP)isaframeworkusedtomodeldecisionsinanenvironment[42].Fromtheperspectiveofreinforcementlearning,MDPisanapproachwhichhasadelayedreward.InMDP,thestatetransitionsarenotrelatedtoanypreviousenvironmentstatesoragentactions.Thatistosay,thenextstateisindependentofthepreviousstatesandbasedonthecurrentenvironmentstate.MDPcanbedenotedasthetuple(S,A,T,r,γ),whichismadeupoftheagentactionspaceA,theenvironmentstatespaceS,therewardfunctionr,thetransitionmatrixT,andadiscountfactorγe[0,1).Thetransitionmatrixcanbedefinedasaprobabilitymappingfromstate-actionpairstostatesT:(SxA)xS→[0,1].Theagent’spurposeistofindanoptimalpolicyπthatcanmapenvironmentstatestoagentactionsinawaythatmaximizesitslong-termreward.Thediscountfactorγisappliedtotheaccumulatedrewardtodiscountfuturerewards.Inmanycases,thegoalofareinforcementlearningalgorithmwithMDPistomaximizetheexpecteddiscountedcumulativereward.Attimestept,wedenotetheenvironmentstate,agentaction,andrewardbyst,atandrtrespectively.Moreover,weusevπ(s)andQπ(s,a)toevaluateSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy9thestateandaction-statevalue.Thestatevaluefunctioncanbeexpressedasfollows:Vπ(s)=Eπ┌γkrt+k+1Ist=s,π┐Theaction-statevaluefunctionisasfollows:Qπ(s,a)=Eπ┌γkrt+k+1Ist=s,at=a,π┐(1)(2)whereγisthediscountfactorandrt+k+1istherewardoft+k+1step.Inawidevarietyofworks,Q-learningwasthemostpopulariterationmethodappliedtodiscountedinfinite-horizonMDPs.2.4DeepreinforcementlearningInsomecases,reinforcementlearningfindsitdifficulttodealwithhigh-dimensionaldata,suchasvisualinformation.Deeplearningenablesreinforce-mentlearningtoaddresstheseproblems.Deeplearningisatypeofmachinelearningthatcanuselow-dimensionalfeaturestorepresenthigh-dimensionaldatathroughtheapplicationofamulti-layerArtificialNeuralNetwork(ANN).Consequently,itcanworkwithhigh-dimensionaldatainfieldssuchasimageandnaturallanguageprocessing.Moreover,deepreinforcementlearning(DRL)combinesreinforcementlearningwithdeepneuralnetworks,therebyenablingreinforcementlearningtolearnfromhigh-dimensionalsituations.Hence,DRLcanlearndirectlyfromraw,high-dimensionaldata,andcanaccordinglyacquiretheabilitytounderstandthevisualworld.Moreover,DRLalsohasapowerfulfunctionapproximationcapacity,whichalsoemploysdeepneuralnetworkstotrainapproximatefunctionsinreinforcementlearning;forexam-ple,toproducetheapproximatefunctionofaction-statevalueQπ(s,a)andpolicyTheprocessofDRLisnearlythesameasthatofreinforcementlearning.Theagent’spurposeisalsotoobtainanoptimalpolicythatcanmapenvi-ronmentstatestoagentactionsinawaythatmaximizeslong-termreward.ThemaindifferencebetweentheDRLandreinforcementlearningprocessesliesintheQtable.AsshowninFigure3,inreinforcementlearning,thistablemaybeaformthatrecordsthemapfromstatetoaction;bycontrast,indeepreinforcementlearning,aneuralnetworkistypicallyusedtorepresenttheQtable.3SecurityandprivacychallengesinreinforcementlearningInthissection,wewillbrieflydiscusssomerepresentativeattacksthatcausesecurityandprivacyissuesinreinforcementlearning.Inmoredetail,weSpringerNature2021LATEXtemplate10NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyexploredifferenttypesofsecurityattacks(specifically,adversarialandpoi-soningattacks)andprivacyattacks(specifically,geneticalgorithm(GA)andinversereinforcementlearning(IRL)).Moreover,somerepresentativedefencemethodswillalsobediscussed(specifically,differentialprivacy,cryptography,andadversariallearning).Wefurtherpresentthetaxonomybasedonthecom-ponentsofMDPinthissection,alongwiththerelationshipsandimpactsamongthesecomponentsinreinforcementlearning.3.1Attackmethodology3.1.1SecurityattacksInthispart,wediscusssecurityattacksdesignedtoinfluenceorevendestroythereinforcementlearningmodelinthereinforcementlearningcontext.Specif-ically,webrieflyintroducesomerecentlyproposedattackmethodsdevelopedforthispurpose.Oneofthepopularmeaningsoftheterm”securityattack”isanadversar-ialattackwithadversarialexamples[43,44].Thecommonformofadversarialexamplesinvolvesaddingimperceptibleperturbationstodatawithapre-definedgoal;theseperturbationscandeceivethesystemintomakingmistakesthatcausemalfunctions,orpreventitfrommakingoptimaldecisions.Becausereinforcementlearninggathersexamplesdynamicallythroughoutthetrain-ingprocess,attackerscandirectlyaddimperceptibleperturbationstostates,environmentinformation,andrewards,allofwhichmayinfluencetheagentduringreinforcementlearningtraining.Forexample,considertheadditionoftinyperturbationstostatesinordertoproduces+6[40,45](6istheaddedperturbation).Eventhissmallchangemayaffectthefollowingreinforcementlearningprocess.Attackersdeterminewhereandwhentoaddperturbations,andwhatperturbationstoadd,inordertomaximizetheeffectivenessoftheirattack.Manyalgorithmsthataddadversarialperturbationshavebeenproposed.Examplesincludethefastgradientsignmethod(FGSM),whichcancalculateadversarialexamples,thestrategically-timedattack,whichfocusesonselectingthetimestepofadversarialattacks,andenchantingattack(EA),whichcanmisleadtheagentregardingtheexpectedstatethroughaseriesofcraftedadversarialexamples.Moreover,defensestoadversarialexampleshavealsobeenstudied.Themostrepresentativemethodisadversarialtraining[46],whichtrainsagentsunderadversarialexamplesandtherebyimprovesmodelrobustness.Otherdefensivemethodsfocusonmodifyingtheobjectivefunction,suchasbyaddingtermstothefunctionoradoptingadynamicactivationfunction.Anothercommontypeofsecurityattackisthepoisoningattack,whichfocusesonmanipulatingtheperformanceofamodelbyinsertingmaliciouslycrafted”poisondata”intothetrainingexamples.Apoisoningattackisoftenselectedwhenanattackerhasnoabilitytomodifythetrainingdataitself;instead,theattackeraddsexamplestothetrainingset,andthoseexamplesSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy11canalsoworkattesttime.Attacksbasedonapoisonedtraining

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论