悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy

上传人：策*** IP属地：山西上传时间：2023-02-03 格式：DOCX 页数：87 大小：454.60KB 积分：19.9 举报 版权申诉

悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy_第2页

悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy_第3页

悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy_第4页

悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy_第5页

已阅读5页，还剩82页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1arXivv[cs.LG]31DecarXivv[cs.LG]31Dec2022Zhu1*andWanleiZhou21*SchoolofComputerScience,UniversityofTechnologySydney,Broadway,Sydney,2007,NSW,Australia.2*SchoolofDataScience,CityUniversityofMacau,Macau,China.*Correspondingauthor(s).E-mail(s):Tianqing.Zhu@.au;Contributingauthors:Yunjiao.Lei@.au;Yuleisuiutseduauwlzhoucityuedu.mo;AbstractReinforcementlearning(RL)isoneofthemostimportantbranchesofAI.Duetoitscapacityforself-adaptionanddecision-makingindynamicenvironments,reinforcementlearninghasbeenwidelyappliedinmultipleareas,suchashealthcare,datamarkets,autonomousdriv-ing,androbotics.However,someoftheseapplicationsandsystemshavebeenshowntobevulnerabletosecurityorprivacyattacks,resultinginunreliableorunstableservices.Alargenumberofstudieshavefocusedonthesesecurityandprivacyproblemsinreinforcementlearning.However,fewsurveyshaveprovidedasystematicreviewandcomparisonofexistingproblemsandstate-of-the-artsolutionstokeepupwiththepaceofemergingthreats.Accordingly,wehereinpresentsuchacomprehensivereviewtoexplainandsummarizethechallengesassociatedwithsecurityandprivacyinreinforcementlearningfromanewperspective,namelythatoftheMarkovDecisionProcess(MDP).Inthissurvey,weﬁrstintroducethekeyconceptsrelatedtothisarea.Next,wecoverthesecurityandprivacyissueslinkedtothestate,action,environment,andrewardfunctionoftheMDPprocess,respectively.Wefurtherhighlightthespecialcharacteristicsofsecurityandprivacymethodologiesrelatedtoreinforcementlearning.Finally,wediscussthepossiblefutureresearchdirectionswithinthisarea.SpringerNature2021LATEXtemplate2NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyKeywords:ReinforcementLearning,Security,PrivacyPreservation,MarkovDecisionProcess,Multi-agentSystem1IntroductionReinforcementlearning(RL)isoneofthemostimportantbranchesofAI.Duetoitsstrongcapacityforself-adaptation,reinforcementlearninghasbeenwidelyappliedinmultipleareas,includinghealthcare[1],ﬁnancialmar-kets[2],mobileedgecomputing(MEC)[3,4]androbotics[5].Reinforcementlearningisconsideredtobeaformofadaptive(orapproximate)dynamicpro-gramming[6]andhasachievedoutstandingperformanceinsolvingcomplexsequentialdecision-makingproblems.Reinforcementlearning’sstrongperfor-mancehasledtoitsimplementationanddeploymentacrossabroadrangeofﬁeldsinrecentyears,suchastheInternetofthings(IoT)[7],recommendsys-tems[8],healthcare[9],robotics[10],ﬁnance[11],self-drivingcars[12],andsmartgrids[13],andsoon.Unlikeothermachinelearningtechniques,Rein-forcementlearninghasastrongabilitytolearnbytrialanderrorindynamicandcomplexenvironments.Inparticular,itcanlearnfromtheenvironmentwhichhasminimuminformationabouttheparameterstobelearned[14],andcanasamethodtoaddressoptimalproblems[15,16].Inthereinforcementlearningcontext,anagentcanbeviewedasaself-contained,concurrentlyexecutingthreadofcontrol[17].Itcaninteractwiththeenvironmentandobtainastateoftheenvironmentasinput.Thestateoftheenvironmentcanbethesituationsurroundingtheagent’slocation.Taketheroadconditionsinanautonomousdrivingscenarioasanexample.Inﬁgure1,thegreenvehicleisanagent,andalltheobjectsarounditcanberegardedastheenvironment;thus,theenvironmentcomprisestheroad,thetraﬃcsigns,othercars,etc.Basedonthestateoftheenvironment,theagentchoosesanactionasoutput.Next,theactionchangesthestateoftheenvironment,andtheagentwillreceiveascalarsignalthatcanberegardedasanindicatorofthevalueforthestatetransitionfromtheenvironment.Thisscalarsignalisalwaysrepresentedasareward.Theagent’spurposeistolearnanoptimalpolicyovertimebytrialanderrorinordertogainamaximalaccumulatedrewardasreinforcement.Inaddition,thecombinationofdeeplearningandreinforcementlearningfurtherenhancestheabilityofreinforcementlearning[18].1.1ReinforcementlearningsecurityandprivacyissuesHowever,reinforcementlearningisweaktosecurityattacks.Itistenderforattackerstoleveragethebreachabledatasource[19].Forexample,datapoi-soningattacks[20]andadversarialperturbations[21]areverypopularexistingproposedoverthepastfewyearstoaddressthesesecurityconcerns.SomeresearchershavefocusedonprotectingthemodelfromattacksandensuringSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy3Fig.1Anautonomousdrivingscenario.Thegreencarisanagent.theenvironmentcomprisestheroad,thetraﬃcsigns,othercars,etc.thatthemodelstillperformswellwhileunderattack.Theaimistomakesurethemodeltakessafeactionsthatareexactlyknown,ortogetoptimalpolicyunderworsesituations,suchasbyusingadversarialtraining[22].Figure2presentsanexampleofsecurityattacksinreinforcementlearn-inginanautonomousdrivingscenario.Anautonomouscarisdrivingontheroadandobservingitsenvironmentthroughsensors.Tokeepsafewhiledriv-ingautonomously,itwillcontinuallyadjustitsbehaviorbasedontheroadconditions.Inthiscase,anattackermayfocusoninﬂuencingtheautonomousdrivingconditions.Forexample,ataparticulartime,theoptimalactionforthecartotakeistogostraight;however,anactionattackmaydirectlyinﬂuencetheagenttoturnright(theattackmayalsoimpactthevalueofthereward).Withregardtoenvironmentalinﬂuencingattacks,theattackermayconceiveorfalselyinsertacarintherightfrontoftheenvironment,andthisdisturbingmaymisleadtheautonomouscarintotakingawrongaction.Asforrewardattacks,rivalsmaytrytochangethevalueofthereward(e.g.,from+1to-1)andtherebyimpactthepolicyoftheautonomouscar.Moreover,reinforcementlearningalsohasbeensubjecttoprivacyattacksduetoitsweaknessesthatcanbeleveragedbyattackers.Establishedsamplesusedinreinforcementlearningcontainthelearningagent’sprivateinforma-tion,whichisvulnerabletoawidevarietyofattacks.Forexample,indiseasetreatmentapplicationswithreinforcementlearning[1],real-timehealthdataisrequired,andtoachieveanaccuratedosageofmedicine,theinformationisalwayscollectedandtransmittedinplaintext.Thismaycausedisclosureofusers’privateinformation;consequently,thereinforcementlearningsystemmaycollectdatafrompublicresources.Mostcollecteddatasetscontainpri-vateorsensitiveinformationthathasahighprobabilityofbeingdisclosed[23].Moreover,reinforcementlearningmayalsorequiredatasharing[24]andneedstotransmitinformationduringthesharingprocess.Thus,attacksonnetworklinkscanalsobesuccessfulinareinforcementlearningcontext.Furthermore,cloudcomputing,whichisalwaysusedforreinforcementlearningcomputationSpringerNature2021LATEXtemplate4NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyFig.2Asimpleexampleofasecurityattackinreinforcementlearninginthecontextofautomaticdriving.Anactionattack,environmentalattackandrewardattackareshownrespectively.Anactionattackworksbyinﬂuencingthechoiceofactiondirectly,suchasbytemptingtheagenttotaketheaction“turnright”ratherthantheoptimalaction“gostraight”.Environmentalattacksattempttochangetheagent’sperceptionoftheenviron-mentsoastomisleaditintotakinganincorrectaction.Finally,therewardattackworksbychangingthevalueofarewardgivenforaspeciﬁcactioninastate.andstoragehasinherentvulnerabilitiestocertainattacks[25].Ratherthanchangingoraﬀectingthemodel,theattackersmaychoosetofocusonobtainingorinferringtheprivacydata;forexample,Panetal.[26]inferredinformationaboutthesurroundingenvironmentbasedonthetransitionmatrix.Themainapproachestodefendingprivacyandsecurityinthereinforcementlearningcontextincludeencryptiontechnology[27]andinformation-hidingtechniques,suchasdiﬀerentialprivacy[28].Inaddition,someartiﬁcialalgo-atedlearning(FL)whichcanpreserveprivacyforthelearningmechanismandstructure.Yuetal.[30]adoptfederatedlearning(FL)intoadeepreinforce-mentlearningmodelinadistributedmanner,withthegoalofprotectingdataprivacyforedgedevices.1.2OutlineandSurveyOverviewAsanincreasingnumberofsecurityandprivacyissuesinreinforcementlearn-ingemerge,itismeaningfultoanalyzeandcompareexistingstudiestohelpsparkideasabouthowsecurityandprivacymightbeimprovedinfutureinthisspeciﬁcﬁeld.Overrecentyears,severalsurveysonthesecurityandprivacyofreinforcementlearninghavebeencompleted:(1)Chenetal.[31]reviewedtheresearchrelatedtoreinforcementlearningfromtheperspectiveofartiﬁcialintelligencesecurityaboutadversarialattacksanddefence.Theauthorsanalysedthecharacteristicsofadversarialattackiesrespectively(2)Luongetal.[32]presentedaliteraturereviewonapplicationsofdeepreinforcementlearningincommunicationsandnetworking;SuchastheInternetofThings(IoT).Theauthorsdiscusseddeepreinforcementlearningapproachesproposedaboutissuesincommunicationsandnetworking,whichSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy5includedynamicnetworkaccess,dataratecontrol,wirelesscaching,dataoﬄoading,networksecurity,andconnectivitypreservation.(3)Anothersurveypaper[14]conductedaliteraturereviewonsecuringIoTdevicesusingreinforcementlearning.Thispaperpresenteddiﬀerenttypesofcyber-attacksagainstdiﬀerentIoTsystemsanddiscussedsecuritysolutionsbasedonreinforcementlearningagainsttheseattacks.(4)Wuetal.[33]surveyedthesecurityandprivacyrisksofthekeycompo-nentsofablockchainfromtheperspectiveofmachinelearning,andhelptoabetterunderstandingofthesemethodsinthecontextofIIoT.Chenetal.[34]alsoexploreddeepreinforcementlearninginthecontextofIoT.Ourworkdiﬀersfromtheaboveworks.However,theworksmentionedaboveareallfocusedontheIoTorcommu-nicationnetworks.Theyareabouttheapplicationofreinforcementlearning.Veryfewexistingsurveyshavecomprehensivelypresentedthesecurityandprivacyissuesinreinforcementlearningratherthantheapplication.Someofthemconcentrateontheattackand/ordefensemethods.However,theyarejustanalysingthewholeinﬂuence.Accordingly,inthispaper,wehighlighttheobjectsthattheattacksaimatandprovideacomprehensivereviewofthekeymethodsusedtoattackanddefendtheseobjects.Themaincontributionsofoursurveycanbesummarizedasfollows:●ThesurveyorganizestherelevantexistingstudiesfromanovelanglethatisbasedonthecomponentsoftheMarkovdecisionprocess(MDP).WeclassifycurrentresearchesonattacksanddefencesbasedontheirobjectsinMDP.Thisprovidesanewperspectivethatenablesfocusingonthetargetofthemethodsacrosstheentirelearningprocess.●Thesurveyprovidesaclearaccountoftheimpactcausedbythetargetedobjects.TheseobjectsarecomponentsinMDPthatarerelatedtoeachotherandmayexistinthesametimeor/andspace.AdoptingthisapproachenablesustofollowtheMDPtocomprehendtherelevantobjectsandtherelationshipsbetweenthem●Thesurveycomparesthemainmethodsofattackingordefendingthecom-ponentsofMDP,andtherebyshedssomelightsontheadvantagesanddisadvantagesofthesemethods.Theremainderofthispaperisstructuredasfollows.Weﬁrstpresentpre-liminaryconceptsinreinforcementlearningsystemsinSection2.WethenoutlinethesecurityandprivacychallengesinreinforcementlearninginSection3.Next,wepresentfurtherdetailsonsecurityinreinforcementlearninginSection4,followedbyanoverviewofprivacyinreinforcementlearninginSection5.Wefurtherdiscussthesecurityandprivacyinreinforcementlearn-ingapplicationsinsection6.Finally,Sections7and8presentouravenuesfordiscussionandfutureworkandconclusionrespectively.SpringerNature2021LATEXtemplate6NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy2Preliminary2.1NotationTable1liststhenotationsusedinthisarticle.RLisreinforcementlearning,andDRLisdeepreinforcementlearning.MDPstandsfortheMarkovDecisionProcess,whichiswidelyusedinreinforcementlearning.MDPcanbedenotedbyatuple(S,A,T,r,γ),whichismadeupoftheagentactionspaceA,theenvironmentstatespaceS,therewardfunctionr,thetransitionmatrixT,andadiscountfactorγe[0,1).Thetransitionmatrixisaprobabilitymappingfromstate-actionpairstostatesT:(SxA)xS→[0,1].Theagent’spurposeistoﬁndanoptimalpolicythatcanmapenvironmentstatestoagentactionstomaximizelong-termreward.vπ(s)andQπ(s,a)arethestateandaction-statevalues,whichcanregardasameansofevaluatingthepolicy.Table1Themainnotationsthroughthepaper.notationsmeaningRLReinforcementlearningDRLDeepreinforcementlearningMDPMarkovdecisionprocessATheactionspaceoftheagentSThestatespaceoftheenvironmentTThetransitionmatrixrTherewardfunctionγAdiscountfactorwhichiswithintherange(0,1)πPolicyv┐(s)StatevalueQ┐(s,a)Action-statevalue2.2ReinforcementlearningThereinforcementlearningmodelcontainstheenvironmentstatesS,theagentactionsA,andscalarreinforcementsignalsthatcanberegardedasrewardsr.Alltheelementsandtheenvironmentcanbeconceptualizedasawholesystem.Atstept,whenanagentinteractswiththeenvironment,itcanreceiveastateoftheenvironmentstasinput.Basedonthestateoftheenvironmentst,theagentchoosesanactionatusingthepolicyπasoutput.Next,theactionchangesthestateoftheenvironmenttost+1.Atthesametime,theagentwillobtainarewardrtfromtheenvironment.Thisrewardisascalarsignalthatcanberegardedasanindicatorofthevalueforthestatetransition.Inthisprocess,theagentlearnsapieceofknowledge,whichmayberecordedasst,at,rt,st+1inaQtable.Qtablehascalculatedthemaximumosethebestactionateachstate.Inthenextstep,theupdatedst+1andrt+1willbesenttotheagentagain.Theagent’spurposeistolearnanoptimalpolicyπsoastogainthehighestpossibleaccumulatedrewardr.ToarriveattheSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy7optimalpolicyπ,theagentcantrainbyapplyingatrialanderrorapproachoverthelong-termepisodes.AMarkovDecisionProcess(MDP)withdelayedrewardsisusedtohan-dlereinforcementlearningproblems,suchthatMDPisakeyformalisminreinforcementlearning.Fig.3TheinteractionbetweenagentandenvironmentwithMDP.Theagentinteractswiththeenvironmenttogainknowledge,whichmayberecordedasatableoraneuralnetworkmodel(inDRL),andthentakesanactionthatwillreacttotheenvironmentstate.Iftheenvironmentmodelisgiven,twosimpleiterativealgorithmscanbechosentoarriveatanoptimalmodelintheMDPcontext:namely,valueiter-ation[35]andpolicyiteration[36].Whentheinformationofthemodelisnotknowninadvance,theagentneedstolearnfromtheenvironmenttoobtainthisdatabasedonanappropriatealgorithm,whichisusuallyakindofstatisticalalgorithm.AdaptiveHeuristicCriticandTD(λ),whichisapolicyiterationmechanism,wereusedintheearlystagesofreinforcementlearningtolearnanoptimalpolicywithsamplesfromtherealworld[37].Subsequently,theQ-learningalgorithmincreasedinpopularity[38,39]andisnowalsoaveryimportantalgorithminreinforcementlearning.TheQ-learningalgorithmisalsoaniterativeapproachusedtoselectanactionwithamaximumQvalue,whichisanevaluationvalue,inordertoensurethatthechosenpolicyisopti-mal.Moreover,duetoitsabilitytodealwithhigh-dimensionaldataandtoapproximatethefunction,deeplearninghasbeencombinedwithreinforce-mentlearningtocreatetheﬁeldof“deepreinforcementlearning”(DRL)[40].Thiscombinationhasledtosigniﬁcantachievementsinseveralﬁelds,suchaslearningfromvisualperceptual[18]androbotics[41].AnexampleofreinforcementlearningispresentedinFigure4.TheﬁguredepictsarobotsearchingforanobjectintheGridWorldenvironment.Theredcirclerepresentsthetargetobject,thegreyboxesdenotetheobstacles,andthewhiteboxesdenotetheroad.Therobot’spurposeistoﬁndaroutetotheredcircle.Ateachstep,therobothasfourchoicesofaction:walkingup,down,leftandright.Inthebeginning,theagentreceivesinformationfromtheenvironmentwhichmaybeobtainedthroughsensorssuchasradarorcamera.SpringerNature2021LATEXtemplate8NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyTheagentthenchoosesanactionandreceivesacorrespondingreward.Inthepositionshownintheﬁgure,choosingtheactionofup,leftorright,mayresultinalowerreward,asthereareobstaclesinthesethreedirections.However,takingtheactionofmovingdownwillresultinahigherreward,asitwillbringtheagentclosertoitsgoal.Fig.4Asimpleexampleofreinforcementlearning,inwhicharobottriestoﬁndanobjectintheGridWorldenvironment.Thebluerobotcanbeseenastheagentinreinforcementlearning.Theredcircleisthetargetobject.Thegreyboxesdenotetheobstacles,whilethewhiteboxesdenotetheroad.Therobot’spurposeistoﬁndaroutetotheredcircle.2.3MarkovDecisionProcess(MDP)TheMarkovdecisionprocess(MDP)isaframeworkusedtomodeldecisionsinanenvironment[42].Fromtheperspectiveofreinforcementlearning,MDPisanapproachwhichhasadelayedreward.InMDP,thestatetransitionsarenotrelatedtoanypreviousenvironmentstatesoragentactions.Thatistosay,thenextstateisindependentofthepreviousstatesandbasedonthecurrentenvironmentstate.MDPcanbedenotedasthetuple(S,A,T,r,γ),whichismadeupoftheagentactionspaceA,theenvironmentstatespaceS,therewardfunctionr,thetransitionmatrixT,andadiscountfactorγe[0,1).Thetransitionmatrixcanbedeﬁnedasaprobabilitymappingfromstate-actionpairstostatesT:(SxA)xS→[0,1].Theagent’spurposeistoﬁndanoptimalpolicyπthatcanmapenvironmentstatestoagentactionsinawaythatmaximizesitslong-termreward.Thediscountfactorγisappliedtotheaccumulatedrewardtodiscountfuturerewards.Inmanycases,thegoalofareinforcementlearningalgorithmwithMDPistomaximizetheexpecteddiscountedcumulativereward.Attimestept,wedenotetheenvironmentstate,agentaction,andrewardbyst,atandrtrespectively.Moreover,weusevπ(s)andQπ(s,a)toevaluateSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy9thestateandaction-statevalue.Thestatevaluefunctioncanbeexpressedasfollows:Vπ(s)=Eπ┌γkrt+k+1Ist=s,π┐Theaction-statevaluefunctionisasfollows:Qπ(s,a)=Eπ┌γkrt+k+1Ist=s,at=a,π┐(1)(2)whereγisthediscountfactorandrt+k+1istherewardoft+k+1step.Inawidevarietyofworks,Q-learningwasthemostpopulariterationmethodappliedtodiscountedinﬁnite-horizonMDPs.2.4DeepreinforcementlearningInsomecases,reinforcementlearningﬁndsitdiﬃculttodealwithhigh-dimensionaldata,suchasvisualinformation.Deeplearningenablesreinforce-mentlearningtoaddresstheseproblems.Deeplearningisatypeofmachinelearningthatcanuselow-dimensionalfeaturestorepresenthigh-dimensionaldatathroughtheapplicationofamulti-layerArtiﬁcialNeuralNetwork(ANN).Consequently,itcanworkwithhigh-dimensionaldatainﬁeldssuchasimageandnaturallanguageprocessing.Moreover,deepreinforcementlearning(DRL)combinesreinforcementlearningwithdeepneuralnetworks,therebyenablingreinforcementlearningtolearnfromhigh-dimensionalsituations.Hence,DRLcanlearndirectlyfromraw,high-dimensionaldata,andcanaccordinglyacquiretheabilitytounderstandthevisualworld.Moreover,DRLalsohasapowerfulfunctionapproximationcapacity,whichalsoemploysdeepneuralnetworkstotrainapproximatefunctionsinreinforcementlearning;forexam-ple,toproducetheapproximatefunctionofaction-statevalueQπ(s,a)andpolicyTheprocessofDRLisnearlythesameasthatofreinforcementlearning.Theagent’spurposeisalsotoobtainanoptimalpolicythatcanmapenvi-ronmentstatestoagentactionsinawaythatmaximizeslong-termreward.ThemaindiﬀerencebetweentheDRLandreinforcementlearningprocessesliesintheQtable.AsshowninFigure3,inreinforcementlearning,thistablemaybeaformthatrecordsthemapfromstatetoaction;bycontrast,indeepreinforcementlearning,aneuralnetworkistypicallyusedtorepresenttheQtable.3SecurityandprivacychallengesinreinforcementlearningInthissection,wewillbrieﬂydiscusssomerepresentativeattacksthatcausesecurityandprivacyissuesinreinforcementlearning.Inmoredetail,weSpringerNature2021LATEXtemplate10NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyexplorediﬀerenttypesofsecurityattacks(speciﬁcally,adversarialandpoi-soningattacks)andprivacyattacks(speciﬁcally,geneticalgorithm(GA)andinversereinforcementlearning(IRL)).Moreover,somerepresentativedefencemethodswillalsobediscussed(speciﬁcally,diﬀerentialprivacy,cryptography,andadversariallearning).Wefurtherpresentthetaxonomybasedonthecom-ponentsofMDPinthissection,alongwiththerelationshipsandimpactsamongthesecomponentsinreinforcementlearning.3.1Attackmethodology3.1.1SecurityattacksInthispart,wediscusssecurityattacksdesignedtoinﬂuenceorevendestroythereinforcementlearningmodelinthereinforcementlearningcontext.Specif-ically,webrieﬂyintroducesomerecentlyproposedattackmethodsdevelopedforthispurpose.Oneofthepopularmeaningsoftheterm”securityattack”isanadversar-ialattackwithadversarialexamples[43,44].Thecommonformofadversarialexamplesinvolvesaddingimperceptibleperturbationstodatawithapre-deﬁnedgoal;theseperturbationscandeceivethesystemintomakingmistakesthatcausemalfunctions,orpreventitfrommakingoptimaldecisions.Becausereinforcementlearninggathersexamplesdynamicallythroughoutthetrain-ingprocess,attackerscandirectlyaddimperceptibleperturbationstostates,environmentinformation,andrewards,allofwhichmayinﬂuencetheagentduringreinforcementlearningtraining.Forexample,considertheadditionoftinyperturbationstostatesinordertoproduces+6[40,45](6istheaddedperturbation).Eventhissmallchangemayaﬀectthefollowingreinforcementlearningprocess.Attackersdeterminewhereandwhentoaddperturbations,andwhatperturbationstoadd,inordertomaximizetheeﬀectivenessoftheirattack.Manyalgorithmsthataddadversarialperturbationshavebeenproposed.Examplesincludethefastgradientsignmethod(FGSM),whichcancalculateadversarialexamples,thestrategically-timedattack,whichfocusesonselectingthetimestepofadversarialattacks,andenchantingattack(EA),whichcanmisleadtheagentregardingtheexpectedstatethroughaseriesofcraftedadversarialexamples.Moreover,defensestoadversarialexampleshavealsobeenstudied.Themostrepresentativemethodisadversarialtraining[46],whichtrainsagentsunderadversarialexamplesandtherebyimprovesmodelrobustness.Otherdefensivemethodsfocusonmodifyingtheobjectivefunction,suchasbyaddingtermstothefunctionoradoptingadynamicactivationfunction.Anothercommontypeofsecurityattackisthepoisoningattack,whichfocusesonmanipulatingtheperformanceofamodelbyinsertingmaliciouslycrafted”poisondata”intothetrainingexamples.Apoisoningattackisoftenselectedwhenanattackerhasnoabilitytomodifythetrainingdataitself;instead,theattackeraddsexamplestothetrainingset,andthoseexamplesSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy11canalsoworkattesttime.Attacksbasedonapoisonedtraining

人人文库> 全部分类> 行业资料 > 管理策划

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy

文档简介

温馨提示

最新文档

评论

悉尼理工大学 - 强化学习的新挑战：安全和隐私综述 New Challenges in Reinforcement Learning：A Survey of Security and Privacy

文档简介

温馨提示

最新文档

评论

相关文档