iccv2019论文全集8611-making-ai-forget-you-data-deletion-in-machine-learning_第1页
iccv2019论文全集8611-making-ai-forget-you-data-deletion-in-machine-learning_第2页
iccv2019论文全集8611-making-ai-forget-you-data-deletion-in-machine-learning_第3页
iccv2019论文全集8611-making-ai-forget-you-data-deletion-in-machine-learning_第4页
iccv2019论文全集8611-making-ai-forget-you-data-deletion-in-machine-learning_第5页
已阅读5页,还剩9页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Making AI Forget You: Data Deletion in Machine Learning Antonio A. Ginart1, Melody Y. Guan2, Gregory Valiant2, and James Zou3 1Dept. of Electrical Engineering 2Dept. of Computer Science 3Dept. of Biomedial Data Science Stanford University, Palo Alto, CA 94305 tginart, mguan, valiant, jameszstanford.

2、edu Abstract Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used the EUs Right To Be Forgotten regulationisanexampleofthiseffort. Inthispaperweinitiateaframeworkstudying whattodowhenitisnolongerpermissibletodeploymodelsderiva

3、tivefromspecifi c userdata. Inparticular,weformulatetheproblemofeffi cientlydeletingindividual datapointsfromtrainedmachinelearningmodels. FormanystandardMLmodels, the only way to completely remove an individuals data is toretrain the whole model from scratch on the remaining data, which is often no

4、t computationally practical. We investigate algorithmic principles that enable effi cient data deletion in ML. For the specifi c setting ofk -means clustering, we propose two provably effi cient deletionalgorithmswhichachieveanaverageofover100improvementindeletion effi ciency across 6 datasets, whil

5、e producing clusters of comparable statistical quality to a canonicalk-means+ baseline. 1Introduction Recently, one of the authors received the redacted email below, informing us that an individuals data cannot be used any longer. The UK Biobank 79 is one of the most valuable collections of genetic

6、and medical records with half a million participants. Thousands of machine learning classifi ers are trained on this data, and thousands of papers have been published using this data. EMAIL UK BIOBANK Subject:UK Biobank Application REDACTED, Participant Withdrawal Notification REDACTED Dear Research

7、er, As you are aware, participants are free to withdraw form the UK Biobank at any time and request that their data no longer be used.Since our last review, some participants involved with Application REDACTED have requested that their data should longer be used. The email request from the UK Bioban

8、k illustrates a fundamental challenge the broad data science and policy community is grappling with: how should we provide individuals with fl exible control over how corporations, governments, and researchers use their data? Individuals could decide at any time that they do not wish for their perso

9、nal data to be used for a particular purpose by a particular entity. This ability is sometimes legally enforced. For example, the European Unions General Data Protection Regulation (GDPR) and former Right to Be Forgotten 24,23 both require that companies and organizations enable users to withdraw co

10、nsent to their data at any time under certain circumstances. These regulations broadly affect international companies and technology platforms with EU customers and users. Legal scholars have pointed out that the continued use of AI systems directly trained on deleted data could be considered illega

11、l under certain interpretations and ultimately concluded that: it may be impossible to fulfi ll the legal aims of the Right to be Forgotten in artifi cial intelligence environments 86. Furthermore, so-called model-inversion attacks have demonstrated the capability of adversaries to extract user info

12、rmation from trained ML models 85. 33rd ConferenceonNeuralInformationProcessingSystems(NeurIPS2019),Vancouver,Canada. Concretely, we frame the problem of data deletion in machine learning as follows. Suppose a statistical model is trained onndatapoints. For example, the model could be trained to per

13、form disease diagnosisfromdatacollectedfromnpatients. Todeletethedatasampledfromthei-thpatientfrom our trainedmodel, we would liketo update it suchthat it becomesindependent of samplei, and looks as if it had been trained on the remainingn1patients. A naive approach to satisfy the requested deletion

14、would betoretrain themodelfrom scratchon thedatafrom theremainingn1patients. For many applications, this is not a tractable solution the costs (in time, computation, and energy) for trainingmanymachinelearningmodelscanbequitehigh. Largescalealgorithmscantakeweeksto train and consume large amounts of

15、 electricity and other resources. Hence, we posit that effi cient data deletionisafundamentaldatamanagementoperationformachinelearningmodelsandAIsystems, just like in relational databases or other classical data structures. Beyond supporting individual data rights, there are various other possible u

16、se cases in which effi cientdatadeletionisdesirable. Tonameafewexamples,itcouldbeusedtospeed-upleave-one- out-cross-validation 2, support a user data marketplace 75,80, or identify important or valuable datapoints within a model 37. Deletion effi ciency for general learning algorithms has not been p

17、reviously studied. While the desiredoutputofadeletionoperationonadeterministicmodelisfairlyobvious,wehaveyettoeven defi nedatadeletionforstochasticlearningalgorithms. Atpresent,thereisonlyahandfuloflearning algorithms known to support fast data deletion operations, all of which are deterministic. Ev

18、en so, thereisnopre-exisitngnotionof howengineersshouldthinkabouttheasymptoticdeletioneffi ciency of learning systems, nor understanding of the kinds of trade-offs such systems face. The key components of this paper include introducing deletion effi cient learning, based on an intuitive and operatio

19、nal notion of what it means to (effi ciently) delete data from a (possibly stochastic) statistical model. We pose data deletion as an online problem, from which a notion of optimal deletion effi ciency emerges from a natural lower bound on amortized computation time. We do a case-study on deletion e

20、ffi cient learning using the simple, yet perennial,k-means clustering problem. We proposetwodeletioneffi cientalgorithmsthat(incertainregimes)achieveoptimaldeletioneffi ciency. Empirically, on six datasets, our methods achieve an average of over100speedup in amortized runtime with respect to the can

21、onical Lloyds algorithm seeded byk-means+ 53,5. Simultaneously, our proposed deletion effi cient algorithms perform comparably to the canonical algorithm on three different statistical metrics of clustering quality. Finally, we synthesize an algorithmic toolbox for designing deletion effi cient lear

22、ning systems. We summarize our work into three contributions: (1) Weformalizetheproblemandnotionofeffi cientdatadeletioninthecontextofmachinelearning. (2) We propose two different deletion effi cient solutions fork-means clustering that have theoretical guarantees and strong empirical results. (3)Fr

23、om our theory and experiments, we synthesize four general engineering principles for designing deletion effi cient learning systems. 2RelatedWorks Deterministic Deletion Updates As mentioned in the introduction, effi cient deletion operations are known for some canonical learning algorithms. They in

24、clude linear models 55,27,83,81,18,74, certaintypesoflazylearning88,6,11techniquessuchas non-parametricNadaraya-Watsonkernel regressions 61 or nearest-neighbors methods 22,74, recursive support vector machines 19,81, and co-occurrence based collaborative fi ltering 74. Data Deletion and Data Privacy

25、Related ideas for protecting data in machine learning e.g. cryptography63,16,14,13,62,31,differentialprivacy30,21,20,64,1 donotleadtoeffi cient data deletion, but rather attempt to make data private or non-identifi able. Algorithms that support effi cient deletion do not have to be private, and algo

26、rithms that are private do not have to support effi cient deletion. To see the difference between privacy and data deletion, note that every learning algorithmsupportsthenaivedatadeletionoperationofretrainingfromscratch. Thealgorithmisnot required to satisfy any privacy guarantees. Even an operation

27、 that outputs the entire dataset in the clear could support data deletion, whereas such an operation is certainly not private. In this sense, thechallengeofdatadeletiononlyarisesinthepresenceofcomputationallimitations. Privacy,onthe other hand, presents statistical challenges, even in the absence of

28、 any computational limitations. With that being said, data deletion has direct connections and consequences in data privacy and security, which we explore in more detail in Appendix A. 2 3ProblemFormulation We proceed by describing our setting and defi ning the notion of data deletion in the context

29、 of a machinelearningalgorithmandmodel. Ourdefi nitionformalizestheintuitivegoalthatafteraspecifi ed datapoint,x, is deleted, theresulting model isupdated to beindistinguishable from amodel that was trainedfromscratchonthedatasetsansx . Oncewehavedefi neddatadeletion,wewillconcludethis sectionbydefi

30、 ninganotionofdeletioneffi ciencyinthecontextofanonlinesettinginwhichastream of data deletion requests must be processed. ThroughoutwedenotedatasetD=x1,.,xnasasetconsistingofndatapoints,witheachdatapoint xi Rd; for simplicity, we often representDas andreal-valued matrix as well. LetAdenote a (possib

31、ly randomized) algorithm that maps a dataset to a model in hypothesis spaceH. We allow modelstoalsoincludearbitrarymetadatathatisnotnecessarilyusedatinferencetime. Suchmetadata could include data structures or partial computations that can be leveraged to help with subsequent deletions. We also emph

32、asize that algorithmAoperates on datasets of any size. SinceAis often stochastic, we can also treatA as implicitly defi ning a conditional distribution overHgiven datasetD. Defi nition3.1. DataDeletionOperation: Wedefi neadatadeletionoperationforlearningalgorithm A,RA(D,A(D),i), which maps the datas

33、etD, modelA(D), and indexi1,.,nto some model inH. Suchanoperationisa datadeletionoperationif,for allDandi,randomvariablesA(Di)and RA(D,A(D),i)are equal in distribution,A(Di)=dRA(D,A(D),i). Herewefocusonexactdatadeletion: afterdeletingatrainingpointfromthemodel,themodelshould be as if this training p

34、oint had never been seen in the fi rst place. The above defi nition can naturally be relaxed to approximate data deletion by requiring a bound on the distance (or divergence) between dis- tributions ofA(Di)andRA(D,A(D),i). Refer to Appendix A for more details on approximate data deletion,especiallyi

35、nconnectiontodifferentialprivacy. Wedeferafulldiscussionofthistofuturework. AComputationalChallengeEverylearningalgorithm,A,supportsatrivialdatadeletionoperation correspondingtosimplyretrainingonthenewdatasetafterthespecifi eddatapointhasbeenremoved namely running algorithmAon the datasetDi. Because

36、 of this, the challenge of data deletion is computational:1)Can we design a learning algorithmA, and supporting data structures, so as to allow for a computationally effi cient data deletion operation?2)For what algorithmsAis there a data deletion operation thatruns in time sublinear in the size of

37、the dataset, or at leastsublinear in the time it takes to compute the original model,A(D)?3)How do restrictions on the memory-footprint of the metadata contained inA(D)impact the effi ciency of data deletion algorithms? Data Deletion as an Online ProblemOne convenient way ofconcretely formulating th

38、e computa- tionalchallengeofdatadeletionisviathelensofonlinealgorithms17. Givenadatasetofndatapoints, aspecifi ctrainingalgorithmA,anditscorrespondingdeletionoperationRA,onecanconsiderastream ofmndistinct indices,i1,i2,.,im1,.,n, corresponding to the sequence of datapoints to be deleted. The online

39、task then is to design a data deletion operation that is given the indicesijone at a time, and must outputA(Di1,.,ij)upon being given indexij. As in the extensive body of work on online algorithms, the goal is to minimize the amortized computation time. The amortized runtimeintheproposedonlinedeleti

40、onsettingisanaturalandmeaningfulwaytomeasuredeletion effi ciency. A formal defi nition of our proposed online problem setting can be found in Appendix A. In online data deletion, a simple lower bound on amortized runtime emerges. All (sequential) learning algorithmsArun in time(n)under the natural a

41、ssumption thatAmust process each datapoint at least once. Furthermore, in the best case,Acomes with a constant time deletion operation (or a deletion oracle). Remark3.1.Intheonlinesetting,forndatapointsandmdeletionrequestsweestablishanasymptotic lower bound of ( n m)for the amortized computation tim

42、e of any (sequential) learning algorithm. We refer to an algorithm achieving this lower bound as deletion effi cient. Obtaining tight upper andlowerboundsisanopenquestionformanybasiclearningparadigmsincludingridgeregression, decision tree models, and settings whereAcorresponds to the solution to a s

43、tochastic optimization problem. Inthispaper,wedoacasestudyonk-meansclustering,showingthatwecanachievedeletion effi ciency without sacrifi cing statistical performance. 3.1 General Principles for Deletion Effi cient Machine Learning Systems We identify four design principles which we envision as the

44、pillars of deletion effi cient learning algorithms. 3 Linearity Use of linear computation allows for simple post-processing to undo the infl uence of a single datapoint on a set of parameters. Generally speaking, the Sherman-Morrison-Woodbury matrix identity and matrix factorization techniques can b

45、e used to derive fast and explicit formulas for updating linear models 55,27,83,43. For example, in the case of linear least squares regressions, QR factorizationcan beused to deletedatapoints fromlearned weights intimeO(d2)41. Linearity should be most effective in domains in which randomized 70, re

46、servoir 89,76 , domain-specifi c 54, or pre-trained feature spaces elucidate linear relationships in the data. LazinessLazy learning methods delay computation until inference time 88,11,6, resulting in trivial deletions. One of the simplest examples of lazy learning isk-nearest neighbors 32,4,74, wh

47、ere deleting a point from the dataset at deletion time directly translates to an updated model at inference time. There is a natural affi nity between lazy learning and non-parametric techniques 61,15. Although we did not make use of laziness for unsupervised learning in this work, pre-existing lite

48、ratureonkerneldensityestimationforclusteringwouldbeanaturalstartingplace44. Laziness should be most effective in regimes when there are fewer constraints on inference time and model memory than training time or deletion time. In some sense, laziness can be interpreted as shifting computation from tr

49、aining to inference. As a side effect, deletion can be immensely simplifi ed. Modularity In the context of deletion effi cient learning, modularity is the restriction of dependence of computation state or model parameters to specifi c partitions of the dataset.Under such a modularization, we can iso

50、late specifi c modules of data processing that need to be recomputed in orderto accountfor deletionsto the dataset. Ournotion ofmodularityis conceptuallysimilar toitsuse insoftwaredesign10anddistributedcomputing67. InDC-k-means,weleveragemodularityby managing the dependence between computation and d

51、ata via the divide-and-conquer tree. Modularity should be most effective in regimes for which the dimension of the data is small compared to the dataset size, allowing for partitions ofthe dataset to capture the important structure and features. QuantizationManymodelscomewithasenseofcontinuityfromda

52、tasetspacetomodelspace small changes to the dataset should result in small changes to the (distribution over the) model. In statisticalandcomputationallearningtheory,thisideaisknowntoasstability60,47,50,29,77,68. We canleveragestabilitybyquantizingthemappingfromdatasetstomodels(eitherexplicitlyorimp

53、licitly). Then, for a small number of deletions, such a quantized model is unlikely to change. If this can be effi cientlyverifi edatdeletiontime,thenitcanbeusedforfastaverage-casedeletions. Quantization is most effective in regimes for whichthe number of parameters is small compared to the dataset

54、size. 4 DeletionEffi cientClustering Datadeletionisageneralchallengeformachinelearning. Duetoitssimplicitywefocusonk-means clusteringasacasestudy. ClusteringisawidelyusedMLapplication,includingontheUKBiobank (forexampleasin33 ). Weproposetwoalgorithmsfordeletioneffi cientk-meansclustering. Inthe con

55、text ofk-means, we treat the output centroids as the model from which we are interested in deleting datapoints. We summarize our proposed algorithms and state theoretical runtime complexity and statistical performance guarantees. Please refer to 32 for background concerningk-means clustering. 4.1Qua

56、ntizedk-Means We propose a quantized variant of Lloyds algorithm as a deletion effi cient solution tok-means clustering, called Q-k-means. By quantizing the centroids at each iteration, we show that the algorithmscentroidsareconstantwithrespecttodeletionswithhighprobability. Underthisnotion ofquanti

57、zedstability,wecansupporteffi cientdeletion,sincemostdeletionscanberesolvedwithout re-computing the centroids from scratch. Our proposed algorithm is distinct from other quantized versions ofk-means73, whichquantize thedata tominimize memoryor communicationcosts. We present an abridged version of th

58、e algorithm here (Algorithm 1). Detailed pseudo-code for Q-k-means and its deletion operation may be found in Appendix B. Q-k-meansfollowstheiterativeprotocolasdoesthecanonicalLloydsalgorithm(andmakesuse of thek-means+ initialization). There are four key differences from Lloyds algorithm. First and

59、foremost, thecentroids are quantizedin each iterationbefore updating thepartition. Thequantization mapseachpointtothenearestvertexofauniform?-lattice38. Tode-biasthequantization,weapplya random phase shift to the lattice. The particulars of the quantization scheme are discussed in Appendix B. Second, at various steps throughout the computation, we memoize the optimization state into the models metadata for use at deleti

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论