版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
OriginalPaper
DebbieRankin1PhD,Correspondingauthor,
d.rankin1@ulster.ac.uk
,+442871675841
MichaelaBlack1PhD,
mm.black@ulster.ac.uk
RaymondBond2PhD,
rb.bond@ulster.ac.uk
JonathanWallace2MSc,
jg.wallace@ulster.ac.uk
MauriceMulvenna2PhD,
md.mulvenna@ulster.ac.uk
GorkaEpelde3,4PhD,
gepelde@
1SchoolofComputing,EngineeringandIntelligentSystems,UlsterUniversity,Derry~Londonderry,NorthernIreland,UnitedKingdom
2SchoolofComputing,UlsterUniversity,Jordanstown,NorthernIreland,UnitedKingdom
3VicomtechFoundation,BasqueResearchandTechnologyAlliance(BRTA),Donostia-SanSebastián,Spain
4BiodonostiaHealthResearchInstitute,eHealthGroup,Donostia-SanSebastián,Spain
ReliabilityofSupervisedMachineLearningUsingSyntheticDatainHealthcare:AModeltoPreservePrivacyforDataSharing
Abstract
Background:
Theexploitationofsyntheticdatainhealthcareisatanearlystage.Syntheticdatagenerationcouldunlockthevastpotentialwithinhealthcaredatasetsthataretoosensitiveforreleaseduetoprivacyconcerns.Severalsyntheticdatageneratorshavebeendevelopedtodate,howeverstudiesevaluatingtheirefficacyandgeneralisabilityarescarce.
Objective:
Thisworksetsouttounderstandthedifferenceinperformanceofsupervisedmachinelearningmodelstrainedonsyntheticdatacomparedwiththosetrainedonrealdata.
Methods:
Atotalof19openhealthcaredatasetscontainingbothcategoricalandnumericaldatahavebeenselectedforexperimentalwork.SyntheticdataisgeneratedusingthreepopularsyntheticdatageneratorsthatapplyClassificationandRegressionTrees,parametricandBayesiannetworkapproaches.Realandsyntheticdataareused(separately)totrainfivesupervisedmachinelearningmodels:stochasticgradientdescent,decisiontree,k-nearestneighbors,randomforestandsupportvectormachine.Modelsaretestedonlyonrealdatatodeterminewhetheramodeldevelopedbytrainingonsyntheticdatacanbeputintousebyhealthcaredepartmentsandusedtoaccuratelyclassifynew,realexamples.Evaluationmetricsarecomputedanddifferentialsinthesescoresarecompared.Theimpactofstatisticaldisclosurecontrolonmodelperformanceisalsoassessed.
Results:
TheaccuracyofMLmodelstrainedonsyntheticdataislowerthanmodelstrainedonrealdatain92%ofcases.Tree-basedmodelstrainedonsyntheticdatahavedeviationsinaccuracyfrommodelstrainedonrealdataof17.7-19.3%,whilstothermodelshavelowerdeviationsof5.8-7.2%.Thewinningclassifierwhentrainedandtestedonrealdataversusmodelstrainedonsyntheticdataandtestedonrealdataisthesamein26.3%ofcasesforCARTandparametricsyntheticdata,andin21.1%ofcasesforBayesiannetworkgeneratedsyntheticdata.Tree-basedmodelsperformbestwithrealdataandarethewinningclassifierin94.7%ofcases.Thisisnotthecaseformodelstrainedonsyntheticdata.Whentree-basedmodelsarenotconsidered,thewinningclassifierforrealandsyntheticdataismatchedin73.7%,52.6%and68.4%ofcasesforCART,parametricandBayesiannetworksyntheticdata,respectively.Statisticaldisclosurecontrolmethodsdidnothaveanotableimpactondatautility.
Conclusions:
Theresultsofthisstudyarepromisingwithsmalldecreasesinaccuracyobservedinmodelstrainedwithsyntheticdatacomparedtomodelstrainedwithrealdata,wherebotharetestedonrealdata.Suchdeviationsareexpectedandmanageable.Tree-basedclassifiershavesomesensitivitytosyntheticdataandtheunderlyingcauserequiresfurtherinvestigation.Thisstudyhighlightsthepotentialofsyntheticdataandtheneedforfurtherevaluationitsrobustness.Syntheticdatamustensureindividualprivacyanddatautilityispreservedinordertoinstilconfidenceinhealthcaredepartmentswhenutilisingsuchdatatoinformpolicydecision-making.
Keywords:SyntheticData;SupervisedMachineLearning;DataUtility;Healthcare;DecisionSupport;StatisticalDisclosureControl
Introduction
Background
NationalHealthcareDepartmentsholdvastvolumesofdataonpatientsandthepopulationthatisnotbeingusedtoitsfullpotentialduetovalidprivacyconcerns.Machinelearning(ML)hasthepotentialtovastlyimprovedecisionsandoutcomesinhealthcareandyettheseimprovementshavenotyetbeenfullyrealised.Thereasonmaybeinpartrelatedtoanissuethatfacesmanydatascientistsandresearchersinthearea:thelimitedavailabilityoforaccesstodata,orthereadinessforhealthcareinstitutionstosharedata.Privacyconcernsoverpersonaldata,andinparticularhealthcaredata,meansthatalthoughthedataexists,itisdeemedtoosensitiveforpublicrelease[1],eveninthecaseofseriousresearch.
Onewaytoovercometheissueofdataavailabilityistousefullysyntheticdataasanalternativetorealdata.Theexploitationofsyntheticdatainhealthcareisatanearlystageandisgainingincreasingattention.Syntheticdataisdatathatissimulatedfromrealdatabyusingtheunderlyingstatisticalpropertiesoftherealdatatoproducesyntheticdatasetsthatexhibitthesesamestatisticalproperties.Syntheticdatacanrepresentthepopulationintheoriginaldatawhilstavoidinganydivulgenceofreal,potentiallypersonal,confidentialandsensitivedata.Inthecaseofhealth-relateddata,thiswouldensurethatactualpatientrecordsarenotdisclosedthusavoidinggovernanceandconfidentialityissues.Therearethreetypesofsyntheticdata:fullysynthetic,partiallysynthetic,andhybridsynthetic.Thisworkconsidersfullysyntheticdatawhichdoesnotcontainoriginaldata.
Syntheticdatacanbeusedintwoways:toaugmentanexistingdatasetthusincreasingitssize,fortimeswhenadatasetisunbalancedduetothelimitedoccurrenceofaneventorwhenmoreexamplesarerequired[2,3];andtogenerateafullysyntheticdatasetthatisrepresentativeoftheoriginaldataset,fortimeswhendataisnotavailableduetoitssensitivenature[4].Thelatterisconsideredinthisworkasakeyrequirementforhealthcaredatasharing.
Traditionally,dataperturbationtechniquessuchasdataswapping,datamasking,cellsuppressionandaddingnoise,havebeenappliedtorealdatatomodifyandthusprotectthedatafromdisclosurepriortoreleasingit.However,suchmethodsdonoteliminatedisclosureriskandcanimpacttheutilityofthedata,particularlyifmultivariaterelationshipsarenotconsidered[5].SyntheticdatawasfirstproposedbyRubin[6]andLittle[7].Raghunathan,ReiterandRubin[8]implementedandextendeduponthis,pioneeringthemultipleimputationapproachtosyntheticdatageneration,exemplifiedinarangeofstudies[9-14].Reiter[15]thenintroducedanalternativemethodofsynthesisingdatathroughanon-parametrictree-basedtechniquethatutilisesClassificationandRegressionTrees(CART).AmorerecenttechniqueproposesaBayesiannetworkapproachforsyntheticdatageneration[16].Syntheticdataisconsideredasecureapproachforenablingpublicreleaseofsensitivedataasitgoesbeyondtraditionalde-identificationmethodsbygeneratingafakedatasetthatdoesnotcontainanyoftheoriginal,identifiableinformationfromwhichitwasgenerated,whilstretainingthevalidstatisticalpropertiesoftherealdata.Therefore,theriskofdisclosureofarealpersonorreverseengineeringisconsideredtobeunlikely[17].
Whilstanumberofsyntheticdatageneratorshavebeendeveloped,empiricalevidenceoftheirefficacyhasnotbeenfullyexplored.Thisworkextendsapreliminarystudy[18]andinvestigateswhetherfullysyntheticdatacanpreservethehiddencomplexpatternsthatsupervisedMLcanuncoverfromrealdata,andthereforewhetheritcanbeusedasavalidalternativetorealdatawhendevelopingeHealthapplicationsandhealthcarepolicymakingsolutions.Thiswillbeachievedbyexperimentingwitharangeofopenhealthcaredatasets.Syntheticdatawillbegeneratedusingthreewellknownsyntheticdatagenerationtechniques.SupervisedMLalgorithmswillbeusedtovalidatetheperformanceofthesyntheticdatasets.Statisticaldisclosurecontrol(SDC)methodsthatcanfurtherdecreasethedisclosureriskassociatedwithsyntheticdatawillalsobeconsidered.
Overview
Toinformtheviabilityoftheuseofsyntheticdataasavalidandreliablealternativetorealdatainthehealthcaredomainwewillanswerthefollowingresearchquestions:
WhatisthedifferentialinperformancewhenusingsyntheticdataversusrealdatafortrainingandtestingsupervisedMLmodels?
WhatisthevarianceofabsolutedifferenceofaccuraciesbetweenMLmodelstrainingonrealandsyntheticdatasets?
HowoftendoesthewinningMLtechniquechangewhentrainingusingrealdatatotrainingusingsyntheticdata?
Whatistheimpactofstatisticaldisclosurecontrol(i.e.privacyprotection)measuresontheutilityofsyntheticdata(i.e.similaritytorealdata)?
Toanswerthesequestions,19openhealthcaredatasetscontainingbothcategoricalandnumericaldatahavebeenselectedforexperimentation[19].Syntheticdatasetsaregeneratedforeachofthese19datasetsusingthreepopularsyntheticdatageneratorsthatapplyCART[15,17],parametric[8,17]andBayesiannetwork[16]approaches,respectively,toenablearobustcomparisonofthethreesyntheticdatagenerationtechniquesacrossabroadrangeofdata.
Initiallyweanalysewhetherthemultivariaterelationshipsthatexistintherealdataarepreservedinthesyntheticversionsofthedata,fordatageneratedusingeachofthethreesyntheticdatagenerationtechniques,bycomputingpairwisemutualinformationscoresforeachvariablepaircombinationineachdataset[16].Itisimportantthatsuchrelationshipsareretainedwhendataissynthesised.
ToevaluatetheutilityofsyntheticdataforMachineLearning,wetheninvestigatetheperformanceofsupervisedMLmodelstrainedonsyntheticdataandtestedonrealdata,comparedwithmodelstrainedonrealdataandalsotestedontherealdata.Thisallowsustodetermineifamodeldevelopedusingsyntheticdatacanclassifyrealdataexamplesasaccuratelyandreliablyasamodeldevelopedusingrealdata.Weconsiderfivedifferentsupervisedmachinelearningmodelstocompareperformanceanddetermineiftherearedifferencesinrobustnessacrosseachofthesemodels.Standardevaluationmetricsarecomputedformodelstrainedonrealandsyntheticdata,foreachMLmodel,andforeachdataset[20].Thedifferencesinaccuracyformodelstrainedonsyntheticdataversusmodelstrainedonrealdataarecomputedtoanalysetheextenttowhichsyntheticdatacausesadegradationinmodelperformance,ifany.
ItispertinentthattheoptimalMLmodelbuiltusingsyntheticdatamatchestheoptimalMLmodelthatwouldbeselectedifrealdatawereusedinthemodeltrainingprocess.Thiswouldprovidestakeholdersinhealthcarewithconfidenceintheuseofsyntheticdataformodeldevelopment.Thus,weconsiderhowoftenthebestMLclassifierbuiltusingsyntheticdatamatchesthebestMLmodelbuiltusingrealdata.
Finally,theimpactofanumberofstatisticaldisclosurecontrolmethodsonmodelperformanceisassessed.Statisticaldisclosurecontrolmethodsseektofurtherenhancedataprivacy;however,thiscanleadtoalossinusefulnessofthedata[21]andweconsidertheextenttowhichperformancedegradationoccursasaresultofSDC.
Thislarge-scaleassessmentofthereliabilityofsyntheticdatawhenusedforsupervisedML,utilising19healthcaredatasetsand3syntheticdatagenerationtechniques,providesanimportantcontributioninrelationtothetrustandconfidencethatstakeholdersinhealthcarecanhaveinsyntheticdata.Wealsoproposeapipelinetoillustratehowsyntheticdatacanpotentiallyfitwithinthehealthcareprovidercontext.Thisworkdemonstratesthepromisingperformanceofsyntheticdatawhilsthighlightingitslimitationsandfutureworkdirectionstoovercomethem.
SyntheticData:PresentandFutureUse
ThevalidityanddisclosureriskassociatedwithsyntheticdatahasbeenunderinvestigationbytheU.S.CensusBureausince2003forthepurposeofcreatingpublicusedatafromacombinationofsensitivedatafromtheCensusBureau’sSurveyofIncomeandProgramParticipation(SIPP),theInternalRevenueService’s(IRS)individuallifetimeearningsdata,andtheSocialSecurityAdministration’s(SSA)individualbenefitdata[22,23].Thegoalwastoenablethereleaseofsynthesisedperson-levelrecordscontainingpersonalandfinancialcharacteristicsfromconfidentialdatasets,whilstpreservingprivacy.Successfulresultshaveledtothereleaseofpublicusesyntheticdatafiles.ResearcherscanhavetheirworkvalidatedagainsttheGoldStandard(real)databytheCensusBureau,thusenablingthemtodeterminetheimpactofsyntheticdataontheirexploratoryanalysesandmodeldevelopmentandhaveconfidenceintheirresults,whilstalsoallowingtheCensusBureautocontinuouslyimprovetheirsynthesistechniques.Thepublicreleaseofthisdatahasprovidedsignificantbenefittotheresearchcommunityandgeneralpopulation,enablingmoreextensiveeconomicpolicyresearchtobeperformedbygroupswhocouldnotpreviouslyaccessusefuldata[24-29].ThisworkledtothereleaseoffurthersyntheticdatasetsbytheCensusBureau.TheSyntheticLongitudinalBusinessDatabase(SynLBD)comprisesdatafromanannualeconomiccensusofestablishmentsintheU.S.[30].Thisdatasetprovidesbroadaccesstorichdatathatsupportstheresearchandpolicy-makingcommunitiesinbusinessandemploymentrelatedtopics.OnTheMapisatoolutilisingsyntheticdatatoprovideworkforcerelatedmaps,demographicprofilesandreportsofU.S.citizens,aswellasdisastereventinformationandtheimpactofsucheventsonworkersandemployers[31].Similarly,syntheticdatahasalsobeenunderinvestigationintheUKasameanstoprovidepublicaccesstorichdatafromUKLongitudinalStudies[32-34]thatcontainhighlysensitivedatalinkingnationalcensusdatatoadministrativedataforindividualsandtheirfamilies.
Thesedatasetsenableresearcherstoexploredataanddevelopandtestcodeandmodelsoutsidethesecureenvironmentwhererealdataresideswithnorestrictions,whilstthedataownersprovideavalidationmechanismwhereresults,codeandmodelscanbevalidatedonbehalfofresearchersontherealdatawithinthesecureenvironmentandfeedbackprovided.Thisprocessincreasesresearchproductivitywhilstensuringthedevelopmentofrobustandvalidmodels[35].
Whilstsyntheticdatahasbeenusedtoaccelerateanddemocratisebusinessandeconomicpolicyresearch[22-35],itisnotcurrentlyinuseforhealthcareresearch,anareathatcouldbenefitenormously.Withadvancementsintechnology,particularlyMLandartificialintelligence(AI),thepotentialtodevelopdiagnostictoolsforcliniciansanddatadrivendecision-makingplatformsforhealthpolicy-makersisever-increasing[36,37].Suchtoolsrequireaccesstohealthcaredata,forexample,totrainAIalgorithmsandproducemodelsthatcanidentifyhealthconditionsandhealth-relatedpatternsacrossthepopulation.Currentlyitcantakealengthyperiodoftimeforresearcherstogainaccesstohealthcaredata,arichandunder-utilisedresource,duetoprivacyconcerns[38-42].Forexample,inthecaseofthe40monthMIDASProject[36,43]developingadata-drivendecisionmakingtoolforhealthcarepolicymakers,ittookmorethan20monthstoobtainaccesstotherequireddataduetolegalandethicalconstraints.Inaddition,anumberofimportantdatavariablescouldnotmadeavailablewhichrestrictedtheutilityoftheplatformunderdevelopment.Withthehelpofsyntheticdata,suchdata,withmoreorallvariablesincluded,couldhavebeenmadeavailableinamatterofweeksthusprovidingmoretimefordevelopmentandevaluationoftheplatform.Theplatformcouldthenhavebeeninstalledinhealthcaresitesmorequicklyandconnectedtorealdataforvalidationandcomparisonofperformanceforsyntheticversusrealdata,enablingperformancetweakstomitigatebiasintroducedbysyntheticdata,ifany.Syntheticdatacouldalsoenablecross-siteanalyticsacrossvarioushealthregions,thatwouldenablepolicymakerstoconnecttheirhealthspacesandpotentiallyprovidesignificantenhancementstocross-nationalhealthpolicy.
Theultimategoalofthisworkistofurtherassessthevalidityanddisclosureriskofsyntheticdataunderthestringentconditionsassociatedwithhealthcaredata,withtheviewtosuccessfullydevelopingapipelineforuseinhealthcarethatenablessyntheticdatasetstobereleasedpubliclytoresearcherswhowouldotherwisenotbeabletoaccessthedata,oraccessitinatimelyfashion,inordertoaccelerateresearchbyenablingthewiderresearchcommunitytousethedataforanalysisandmodeldevelopment.Theresultsofsuchanalysesandthemodelsandcodedevelopedcanthenbegiventohealthcaredepartmentsforvalidationontherealdata,andifeffectivecanbeputintousebycliniciansandhealthpolicy-makers.
SyntheticDataPipelineforHealthcare
TounderstandhowhealthcaredepartmentscanbenefitfromsyntheticdataweproposeapipelineshowninFigure1.Thisisaproposedsyntheticdatasharingpipelineprovidedasanillustrationofhowsyntheticdatacanpotentiallyworkwithinarealhealthcaresettingtoexpeditedataanalytics.Infuturework,weplantotestthispipelineinarealsetting.InthispipelinerealdataresideswithintheNationalHealthcareDepartmentinfrastructure.Thedatacannotbesharedexternallyduetoitssensitiveandprivatenature.HealthcaredepartmentsmayonlyhaveasmallnumberofdatasciencestaffwiththeexpertisenecessarytoapplyMLtechniquestomanyoftheirdatasets,andsotheycannotmaximisetheuseoftheirdatanordiscovertheiruseduetolackofresources.ByapplyingasyntheticdatagenerationtechniquetotherealdataalongwithSDCmeasures,asyntheticdatasetcanbeproducedandmadeavailabletotheexternalresearchcommunityinplaceoftherealdata.Externalresearchers,inlargenumbersandwithwiderangingexpertise,canpotentiallydevelopoptimalMLmodelstrainedonthesyntheticdataandsharetheperformanceoftheMLmodel,themodelitselfandthemodelspecificationwiththeNationalHealthcareDepartment.ThehealthcaredepartmentcanthentesttheMLmodelonrealdataorin-housetechnicalstaffcanrebuildthemodelaccordingtothespecificationprovidedbyresearcherswherethespecificationcanincludetheprogramcodewrittenbyresearchers,detailsoftheMLalgorithmtouse,e.g.decisiontree,supportvectormachineetc.,andtheoptimalhyperparametersettingsdeterminedduringdevelopment.Usingthesesettings,themodelcanthenberebuilt,thistimebytrainingontherealdatainsteadofsyntheticdata,whichin-housestaffhaveaccessto.
Figure1Proposedsyntheticdatasharingpipelinetoillustratehowsyntheticdatacouldbeimplementedtoexpeditehealthcaredataanalytics.
Methods
DatasetSelection
Forexperimentation,19openhealthcaredatasetshavebeenselectedfromtheUCIMachineLearningRepository[19].Missingvalueshavebeenremovedfromthedatasetseitherbyremovingfeatureswithahighnumberofmissingvaluesorremovingobservationswhereafeaturecontainsamissingvalue.TheexperimentaldatasetsandtheirpropertiesaresummarisedinTable1.Thesedatasetswereselectedtoenableananalysisofsyntheticdataperformancewhenappliedtodatasetsofdifferingvolumeanddatatypes(categoricalandnumerical).
Table1.Summaryofexperimentaldatasets.a
Dataset
No.ofAttributes
No.ofCategoricalAttributes
No.ofNumericalAttributes
No.Classes/Labels
No.ofObservations
A
BreastCancerWisconsin(Original)
9
0
9
2
683
B
BreastCancer
9
9
0
2
277
C
BreastCancerCoimbra
9
0
9
2
116
D
BreastTissue
9
0
9
6
106
E
ChronicKidneyDisease
21
12
9
2
209
F
Cardiotocography(3Class)
21
0
21
3
2126
G
Cardiotocography(10Class)
21
0
21
10
2126
H
Dermatology
34
33
1
6
358
I
DiabeticRetinopathy
19
3
16
2
1151
J
Echocardiogram
10
2
8
3
106
K
EEGEyeState
14
0
14
2
14980
L
HeartDisease
13
8
5
2
303
M
Lymphography
18
18
0
4
148
N
Post-OperativePatientData
8
8
0
3
87
O
PrimaryTumor
15
15
0
21
336
P
Stroke
10
7
3
2
29072
Q
ThoracicSurgery
16
13
3
2
470
R
ThyroidDisease
22
16
6
28
5786
S
ThyroidDisease(New)
5
0
5
3
215
Total
283
144
139
105
58,655
aEachdatasethasbeenencodedwithaletter(column1)andwillbereferencedusingthisletterfortheremainderofthepaper.
GeneratingSyntheticData
Inthiswork,weanalyseandassesstheperformanceofthreepubliclyavailablesyntheticdatagenerationtechniquesthatarebasedonwell-known,seminalworkinthearea[6-10,15,16].Thesemethodsareaparametricdatasynthesistechnique,anon-parametrictree-basedsynthesistechniquethatutilisesCART[15],andasynthesistechniquethatutilisesBayesiannetworks[16].Whilstotherapproachesexist,somearedevelopedforspecificdatasetsandproblems,e.g.SimPopsimulatespopulationsurveydata[44],andSyntheasimulatespatientpopulationandelectronichealthrecorddata[45],whereasthesetechniquesareconsideredtobemoregeneral.TheRpackage,Synthpop,developedbyNowak,RaabandDibben[17],providesapubliclyavailableimplementationoftheparametricandCARTbasedsyntheticdatagenerators.TheDataSynthesizerpythonimplementation,developedbyPing,StoyanovichandHowe[16],providesapubliclyavailableimplementationoftheBayesiannetworkbasedsyntheticdatagenerator.Theseimplementationshavebeenutilisedinthisexperimentalwork.
AttributesaresynthesisedsequentiallyinboththeparametricandCARTmethods.Thesyntheticvaluesforthefirstattributearesynthesisedusingarandomsamplefromtheoriginalobserveddatasinceithasnopredictorsfrompreviouslysynthesisedattributesinthedataset.Whensynthesisingattributes,bothcategoricalandnumerical,withthenon-parametricmethod,theCARTmethodisapplied.CARTisappliedtoallvariablesthathavepredictors,i.e.attributespriortotheminthesequence,anddrawsfromtheconditionaldistributionsfittedtotheoriginaldatausingCARTmodels.Theparametricmethodsynthesisesattributebasedondatatype.Numericalattributesaresynthesisedusingnormallinearregression.Categoricalattributesaresynthesisedusingpolytomouslogisticregressionwheretheattributehasmorethantwolevels,whilstlogisticregressionisappliedtosynthesisebinarycategoricalvariables[17].TheBayesiannetworkmethodofsynthesisingdatalearnsadifferentiallyprivateBayesiannetworkthatcapturescorrelationstructurebetweenattributesintherealdataanddrawssamplesfromthismodeltoproducesyntheticdata[16].
SupervisedMachineLearningwithRealandSyntheticData
AkeymeasureofdatautilityofasyntheticdatasetforthepurposeofMListodeterminehowwellasupervisedMLmodeltrainedonsyntheticdata,performswhentaskedwithclassifyingrealdata.ThiswilldeterminewhethersupervisedMLmodelswillberobustenoughtoclassifyrealdataexamplesifonlysyntheticdataisprovidedforthetrainingofthesemodels.
ToevaluatewhethersyntheticdatasetscanbeusedasavalidalternativetorealdatasetsinML,foreachofthe19datasets(Table1),fivedifferentclassificationmodelsweretrained.Initiallythemodelsweretrainedandtestedontherealdatatoobtainaperformancebenchmark.Subsequently,aclassifierwastrainedoneachofthesyntheticdatasets,generatedusingparametric,CARTandBayesiannetworktechniques,andthentestedwiththerealdata.Modelsaretestedonrealdataonly,todeterminewhetheramodeldevelopedbytrainingonsyntheticdatacanbeputintousebyhealthcaredepartmentsandusedtoaccuratelyclassifynew,realexamples.
Therangeofmodelsappliedtoeachdatasetwere:stochasticgradientdescent(SDG)decisiontree(DT),k-nearestneighbors(KNN),randomforest(RF),andsupportvectormachine(SVM).Thisselectionofalgorithmswasappliedtodeterminehowwelleachperformedwhentrainedwiththerealdatacomparedwiththesyntheticdata,withbothtestedonrealdata.
TheclassifierswereimplementedusingPython’sScikit-Learn0.21.3machinelearninglibraryandareasfollows:
StochasticgradientdescentclassificationwasimplementedusingSGDClassifier,asimplelinearclassifier,withloss=“hinge”,random_state=0andallotherparameterssettotheirdefaults.
DecisiontreeclassificationwasimplementedusingDecisionTreeClassifier,anoptimisedversionofCART,withcriterion=“gini”,max_depth=10andrandom_state=0andallotherparameterssettotheirdefaults.
K-NearestNeighborsclassificationwasimplementedusingKNeighborsClassifierwithn_neighbors=10,weights=‘uniform’,leaf_size=30,p=2,metric=‘minkowski’,n_jobs=2andallotherparameterssettotheirdefaults.
RandomForestclassificationwasimplementedusingRandomForestClassifierwithcriterion=“gini”,max_depth=10,min_samples_split=2,n_estimators=10,random_state=1andallotherparameterssettotheirdefaults.
SupportVectorMachineclassificationwasimplementedusingSVCwithC=1.0,degree=3,kernel=‘rbf’,probability=True,random_state=Noneandallotherparameterssettotheirdefaults.
Fortrainingandtesting,Python’sScikit-Learn0.21.3ShuffleSplitrandompermutationcross-validatorwasusedwith10splittingiterationsandatrain/testsplitof75/25.Categoricalattributesweretransformedintoindicatorattributesusingone-hotencoding.
StatisticalDisclosureControl
Syntheticdataisconsiderednottocontainrealunitsandthereforetheriskofdisclosureofarealpersonisconsideredtobeunlikely[46].Whilstunlikely,thescenariowheresomeofthegeneratedsyntheticdataisverysimilartotherealdata,resultinginpotentialdisclosurerisk,mustbeconsideredandwhereadditionalprotectionscanbeappliedtosyntheticdataitisplausibletodoso.Additionalstatisticaldisclosurecontrol(SDC)measures,beyonddatasynthesis,canbeappliedasaprecautionarymeasuretoaddfurtherprotectionstosyntheticdatabyreducingtheriskofreproducingrealpersonrecordsandreplicatingoutlierdata,thus
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年心理咨询师考试题库300道及参考答案(黄金题型)
- 2026年高校教师资格证之高等教育学考试题库带答案(考试直接用)
- 2026年税务师考试题库【典型题】
- 2026年期货从业资格考试题库附答案(研优卷)
- 2026年劳务员考试题库附完整答案【夺冠系列】
- 2026年心理咨询师考试题库300道【轻巧夺冠】
- 专利技术开发合同(2025年授权)
- 宣传片交付协议书
- 安全课件选题
- 天气原理第5章-03-地面形势预报方程
- 2025年电商平台运营总监资格认证考试试题及答案
- 门窗质量保证措施
- 浙江省2025年初中学业水平考试浙真组合·钱塘甬真卷(含答案)
- (高清版)DB34∕T 5225-2025 风景名胜区拟建项目对景观及生态影响评价技术规范
- 社区矫正面试试题及答案
- 《察今》(课件)-【中职专用】高二语文(高教版2023拓展模块下册)
- GB/T 30425-2025高压直流输电换流阀水冷却设备
- GB/T 45355-2025无压埋地排污、排水用聚乙烯(PE)管道系统
- 2025年园长大赛测试题及答案
- 生命体征的评估及护理
- 2024年国家公务员考试行测真题附解析答案
评论
0/150
提交评论