算法拆箱:一个过程模型 算法解决方案_第1页
算法拆箱:一个过程模型 算法解决方案_第2页
算法拆箱:一个过程模型 算法解决方案_第3页
算法拆箱:一个过程模型 算法解决方案_第4页
算法拆箱:一个过程模型 算法解决方案_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

PortlandStateUniversity

PDXScholar

BusinessFacultyPublicationsand

Presentations

TheSchoolofBusiness

8-2021

UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution

MartaStelmaszakRosa

PortlandStateUniversity,

stmar

ta@

Followthisandadditionalworksat:

/busadmin_fac

Partofthe

BusinessCommons

Letusknowhowaccesstothisdocumentbenefitsyou.

CitationDetails

Stelmaszak,M.(2021)UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution.AmericasConferenceonInformationSystems2021,9-13August2021.

ThisConferenceProceedingisbroughttoyouforfreeandopenaccess.IthasbeenacceptedforinclusioninBusinessFacultyPublicationsandPresentationsbyanauthorizedadministratorofPDXScholar.Pleasecontactusifwecanmakethisdocumentmoreaccessible:

pdxscholar@

.

UnboxingtheAlgorithm:AProcessModel

Twenty-SeventhAmericasConferenceonInformationSystems,Montreal,2021

PAGE

10

UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution

CompletedResearch

MartaStelmaszakPortlandStateUniversity

stmarta@

Abstract

Withtheexplosionofdata,analyticsandartificialintelligence,informationsystemsresearchfocusesontheuse,managementandconsequencesofalgorithms.Thisfar,onlyahandfulofpapersofferinsightsintohowalgorithmicsolutionswork.Toaddressthisgap,westudiedthecodemakingup45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonadatascienceplatformK.Wesynthesizedaprocessmodelofanalgorithmicsolution:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingmodel.Unboxingthealgorithmandinvestigatingtheprocessoffersamorefine-tunedunderstandingandlanguagetobetterconceptualizetheuse,managementandconsequencesofalgorithmicsolutions.Italsoprovidesascaffoldingforresearchintothedevelopmentofalgorithmicsolutions,highlightingtheirvariability,experimentationanddatascientistdecisions.

Keywords

Algorithms,algorithmicsolutions,datascience,informationsystemsdevelopment,processmodel

Introduction

Algorithmshave,withoutadoubt,attractedresearchattentionacrossanumberoffields,frommediastudies,throughsociology,tocomputerscience.Managementandinformationsystems(IS)researchersstudyalgorithmicsolutionsprimarilyintermsoftheiruse,managementandconsequencesforindividualsinworkcontexts,inorganizations,andinthewidersociety(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,thisresearchcangreatlybenefitfromanimprovedunderstandingofhowalgorithmicsolutionsaredeveloped,andthustherehavebeencallstofocusmoreontheorizingtheirdevelopment(vandenBroeketal.2021).Thisfar,onlyahandfulofpapersinISofferinsightsintohowalgorithmicsolutionsworkwhichisanessentiallinkbetweenunderstandingtheiruseandtheirdevelopment.Againstthisbackground,thisstudyaimstoanswerasimplequestion:whatistheprocessofmakinganalgorithmicsolutionwork?

Touncoverthebuildingblocksandproposeaprocessmodel,westudied45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK(Dissanayakeetal.2015;MangalandKumar2016).Referringtoacommonproblemfacedbymanycompaniesandoftentackledbyalgorithmicsolutions,thecreditcarddatasetattractedover200notebookswithcodeandcommentsdescribingattemptstobestpredictcustomerchurn.Weselected35ofthebest-regardednotebooks,downloadedthemandcodedthemusingagroundedtheoryapproach(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006).Wethengroupedthethemestoidentifytheelementsthatmadeupeachproposedalgorithmicsolutionanddistilledaprocessmodelofhowtheyweredeveloped.

Basedonourfindings,weproposeaprocessmodelofmakinganalgorithmicsolutionworkencompassing:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingthemodel.Wecontributetoinformationsystemsandmanagementliteraturebydevelopingaprocessmodelofanalgorithmicsolution

thatoffersamorefine-tunedlanguagetoinvestigatenotonlytheuse,managementandconsequencesofalgorithmicsolutionsonindividuals,organizationsandsocieties,butalsoenablesafurtherstudyofthedesignanddevelopmentofsuchsolutionsfromasocio-technicalperspective.

MakingAlgorithmicSolutionsWork

Recenttechnological(processingcapabilities,bigdata,machinelearning),societal(useofsmartphones,attitudestowardsdata,socialmedia)andorganizational(phantomization,networks)developmentscontributedtothegrowthinuseofvariousalgorithms(Baptistaetal.2017;Berenteetal.2019).ISresearchinthesocio-technicaltraditionhasthusfocusedonthestudyoftheuse,managementandconsequencesofalgorithmsonindividual,organizationalandsocietallevels(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,farlessattentionhasbeenpaidsofartotheunderstandingofhowalgorithmsandalgorithmicsolutionsbasedonthemaredeveloped(vandenBroeketal.2021).Firstpapersbegintouncoverhowdatascientistsandsubjectmatterexpertsneedtoworktogetherinthedevelopmentprocess(vandenBroeketal.2021),howthepracticesofdatascientistsinthebankingindustryrelyonbothsubjectivityandobjectivityintheproductionofinformation(Joshi2020),andhowdatascientistsengageinthepracticesofknowledgehiding(GhasemaghaeiandTurel2021).Inotherwords,whilefocusingpredominantlyonwhathappensafterthealgorithmsareputtowork,currentliteratureofferslittleinsightintohowalgorithmsaremadetowork,thatiswhatstepsneedtobeinplaceforanalgorithmicsolutiontoworkeffectively.Suchunderstandingisessentialbecausetheprocessofmakinganalgorithmicsolutionwork,asweshowbelow,determineswhatkindsofinsightsandpredictionsitoffers,thusinfluencingdecisions.

Mostresearcherswhoinbroadstrokesdescribewhatgoesintomakingalgorithmicsolutionsworkintheirpapersrefertocertainaspectswithvaryingconsistency:thefactthatalgorithmsprocessdata(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;Gregoryetal.2020;GrønsundandAanestad2020;Lebovitz2020;Lycett2013;NewellandMarabelli2015;Pachidietal.2021;Shresthaetal.2019)inanautomatedorpreprogramedway(Galliersetal.,2017;Grønsund&Aanestad,2020;Güntheretal.,2017;Shresthaetal.,2019)tolearnmodels(BalasubramanianandYe2021;Lietal.2019;Shresthaetal.2019)leadingtonewinsights(Güntheretal.,2017;Günther&Joshi,2020;Pachidietal.,2021),decisions(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;NewellandMarabelli2015)orpredictions(Lebovitz2020;Lietal.2019;Shresthaetal.2019).Thisoffersapunctuatedandincompletepictureoftheelementsinvolvedindevelopingalgorithmsthatcanbesubsequentlyusedinbusinesssettings.

Ahandfulofpapersofferinsightsintotheessentialelementsofwhathappensinsidealgorithmicsolutions.Pachidietal.(Pachidietal.2021)provideadetaileddescription,coveringvariouselementsthatareatplayinapredictivemodel:

“Themodelcombinedanumberofinternalandexternaldatasources,suchastimeseriesofcustomertransactions,Nielsenmarketdata,GartnerICTspendingpredictions,financialdata,andusagedata.Theoutputofthemodelwasrepresentedinaspreadsheetformatthatcontainedalistofallmedium-sizedcustomersandpredictionsregardingpotentialsalesopportunities.TheCLMmodelallocatedcustomerstodifferentcustomersegments(A,B,C,D)basedontheirhistoricalandpredictedsaleswithTelCo.ForeachTelCoproductline(e.g.,businesstelephonesystems,mobilephonepackages,fixedlinesetups),theCLMmodelassignedapositioninthecustomersaleslifecycle(inform,specify,sell,maintain),eachofwhichentailedadifferentcontactstrategy.Thus,themodeloutputconsistedofarankingofopportunities,withaprioritizedactionlistforaccountmanagers.”

GrønsundandAanestad(2020,p.7)aresimilarlydetailed:

“Thealgorithm-supportedanalysissystemwasdesignedtoautomatebothdataacquisitionandtheprocessingofdataforsubsequentanalysis.Acquisitionofdatawasautomatedbythesystempullingstreamsofdataonshipactivityfromthesatellite-AISdataprovider,alongwithadditionaldatasuchasvesseldescriptionsandgeospatialdata,intoaHadoop-baseddatawarehouserepository.Herethedatawereextractedandconsolidated,thenclassifiedusingrule-basedNLP(NaturalLanguageProcessing)classification,andfinallypresentedinBItoolsthatallowedhumaninterpretationoftheoutput.”

Whilethedescriptionsbothpointtoobtaining,compilingandprocessingofdata,furtheranalysisandclassification,theyrevealdifferencesinhowthesolutionswork,anddonotofferacompletepicture.Takingamoregeneralview,OrlikowskiandScottdefineanalgorithmas“asetofstep-by-stepinstructionstoachieveadesiredresultinafinitenumberofmoves”(2015,p.210).Acknowledgingthismoretraditionaldefinitionofanalgorithm-aprogramcontainingafixedsequenceofinstructionsexecuteduntilasolutionisreached-rootedincomputerscience(HopcroftandUllman1983),Farajetal.(2018)‘update’andbroadenthescopeofthisdefinitionbyconceptualizinglearningalgorithmsas“anemergentfamilyoftechnologiesthatbuildonmachinelearning,computation,andstatisticaltechniques,aswellasrelyonlargedatasetstogenerateresponses,classifications,ordynamicpredictionsthatresemblethoseofaknowledgeworker”(p.62).AsimilardefinitionofartificialintelligencealgorithmsisputforwardbyTarafdaretal.(2020,p.1):“WedefineAIalgorithmsasthosethatextractinsightsandknowledgefrombigdatasources;computationalandstatisticaltechniquessuchasmachinelearning(ML)anddeeplearningembeddedinsuchalgorithms,aimto‘teach’computerstheabilitytododetectpatternsinbigdata”.

Whilethesedefinitionsofferagoodstartingpointandaninitialoverviewoftheelementsintheprocessofmakingalgorithmicsolutionswork,theyarepartialanddivergentintheirfocus.Thesedifferencesinthedefinition,understanding,scopeandscaleofthestepsandelementsrequiredtomakealgorithmsworkhamperthedevelopmentoftheunderstandingoftheuse,managementandconsequencesofalgorithms,andatthesametimemakeuncoveringtheirdevelopmentmoredifficult.ForISresearchtosystematicallyprogressinthisareaitisthusfundamentaltoask:whatistheprocessofmakinganalgorithmicsolutionwork?

ResearchSettingandMethods

Toanswerthisquestion,westudied45publicdatasciencenotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK.Belowwedescribetheresearchsetting,aswellasdatacollectionandanalysismethods.

ResearchSetting

Kisapopularplatformfordatascientistsandmachinelearningengineerswheretheycandevelopandimprovetheirskills,aswellasparticipateincorporate-sponsoredcompetitionsbyaddressingavarietyofproblemsrelatedtodatasetspublishedontheplatform.K,partofAlphabetInc,allowstouploaddatasets,setspecifictasksforthemandcreateinteractiveJypyternotebookswhereuserscandeveloptheiralgorithmicsolutions.Kwasselectedasasettingbecauseofitspublicavailabilityandopennessinsharingnotebooksthatallowsanunprecedentedaccesstotheinnerworkingsofalgorithmicsolutions.OthershaveusedKforresearchpurposesaswell(Dissanayakeetal.2015;MangalandKumar2016).

Thedatasetweselectedforthisstudyisawell-regardedandpopulardatasetwithhighusability.Itcontainsthedetailsofaround10,000creditcardcustomersofabank,wherebyaportionofcustomerschurned.Thegoalistoidentify,basedon18variablessuchasage,salary,creditcardlimitandsimilar,whatmakesacustomerchurn(giveupacreditcard)tobeabletopredictcustomersatriskofchurninginthefuture,aswellastoidentifythevariablesthataremostpredictiveoftheriskofchurn(“Kaggle.Com”2021).Whenthedatasetwasinvestigatedforthepurposesofthisresearchproject,therewerearound210notebookssubmittedthatcontainedalgorithmicsolutionspertainingtothisdataset,withconstantdailyactivityinexistingnotebooksandnewnotebooksbeingadded.

Weselectedanopenandpublicdatasetratherthanacompetitionbecausethemajorityofnotebookssubmittedforcompetitionsareprivateandthusvisibleonlytosponsorcompanies,andcompetitionsareusuallyveryspecificandlimitthenumberofpotentialalgorithmicsolutionsapplied.Incontrast,publicnotebooksallowgoodaccesstoavarietyofnotebookscontainingfairlyunrestrictedsolutionsandallowformuchmoreexperimentationonthepartofusers.FromthemanydatasetsavailableonK,weselectedthecreditcardcustomersdatasetbecauseitisrelatedtoacommonproblemthatmanycompaniesandbusinessesface,anditisaproblemthatisoftentackledbydevelopingalgorithmicsolutions,thusitisagoodrepresentativesampleofwhatresearchersininformationsystemsandmanagementwouldconsiderofinterest.

Datacollection

InJanuaryandFebruary2021,wecollected57JupyternotebooksthatwerecreatedusingthecreditcardcustomerdatasetinPythonasthesetprogramminglanguage.Thenotebookswerearrangedfromthe‘hottest’(ameasureusedonKtodefinenotebookswithmostactivity,editsandhighestvotesbythecommunity,Kaggle.Com,2021)totheleasthot,andthusthosethatwecollectedwereconsideredamongthe‘hottest’atthetime.Wedecidedtoselectthe‘hottest’notebooksasthesewereassessedashighqualitybythecommunity,thuswerelikelytocontainwell-developedalgorithmicsolutions.WediscardednotebooksinRtoeliminatedifferencesinprogramminglanguages,andnotebooksthatcontainedonlypartialsolutions,forexampleonlyanalyzeddatawithoutbuildingactualmodels.Weendedupwith45suitablenotebooks.UsingafeatureavailableonK,wedownloadedalloftheselectednotebooksandconvertedthemtoPDFdocumentstoanalyzetheminnVivo.

Dataanalysis

Sinceourstudyisrootedingroundedtheory(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006),weproceededbyinductivelycodingthenotebookstoidentifythedifferentelementsofcodetheycontainedbywhattheseelementsofcodedid.Wecodedeachsegmentofcodeineachnotebooktoidentifyitsfunction.Verbaldescriptionsofdatascientistssometimesprovidedadditionalinformationastotheroleofeachcodesegment,sothesewerecodedtoo.However,thedescriptionsweremostlyusefulinthesecondstageofdataanalysis,wherewegroupedthecodesweobtainedintohigher-levelelementsoftheprocess,astheyexplainedtheflowoftheprocess.Forexample,inthenotebooksdatascientistswouldsometimesindicatetheywereproceedingtoexploratorydataanalysis,andweusedthesecommentstogroupelementsofcodeidentifiedundertheelement‘ExploratoryDataAnalysis’.

Becauseoftheinductivenatureofourstudy,weoscillatedbetweendataanalysisandfurtherdatacollection.Aftercodingthefirst30notebooks,webegantogroupthecodestostartbuildingthemodel.Wethenproceededwithcodingandanalyzingnotebooksonebyonetosupplementandverifythemodelthatwasemergingfromouranalysis.Whenwereachednotebooknumber35,thesubsequent10notebooksdidnotaddanynewcodestothecodebookandatthispointwedecidedtostopcodingandanalyzingthenotebooksaswereachedthepointofsaturation.

UnboxingtheAlgorithm

Inthissection,wepresenttheelementsoftheprocessofmakinganalgorithmicsolutionworkthatweidentifiedinthedata.Eachelementisdiscussedinturnbyshowingwhatkindsofoperationswereperformedineveryelement.

PreparingtheEnvironment

Notebooksbeginwithsettingtheenvironmentinwhichthedevelopmentofthealgorithmicsolutiontakesplace:programminglanguage,accelerationandconnectiontotheinternet.ThenotebooksweobservedwereallsetupinaPython3environment,which“comeswithmanyhelpfulanalyticslibrariesinstalled”(Notebook002)andallowstowriteupto20GBtotheworkingdirectory.Notebooksgivethepossibilitytoturnonanaccelerator,suchasaGPU,forfasterprocessing,andtoconnecttotheinternetforaccesstoexternalfiles.Insomenotebooks,datascientistsuseverbalcommentstoidentifyandrestatetheproblem.

Afterthisinitialsetup,variousnecessarylibrariesareimported,thatispre-packagedfunctionsdesignedforspecificpurposesthatcanbedeployedbydatascientistswithouttheneedtocodesuchfunctionsfromscratch.Invariably,thenotebooksfeatured“numpy”(Notebook005),aPythonlibraryforlinearalgebraand“pandas”(Notebook007)allowingfordataprocessingandforexamplereadinginCSVfiles,amongothers.Thesetwolibrariesareessentialtodevelopthealgorithmicsolution.Otherlibrariesimportedincludedatavisualizationpackages,suchas“seaborn”or“matplotlib”(Notebook029),whicharefairlystandardandpopularlibrariesforthispurpose.Insomenotebooks,allrequiredpackagesareimportedinthebeginningofthenotebook,including“sklearn”and“keras”(Notebook014)thatareusedforbuildingmodels,whileothernotebooksimportadditionallibrariesasandwhenneeded.Librariesareimportedwithsimplecode:“importnumpyasnp”(Notebook001),forexample.Importinglibrariesisastandardprocedureandtherearenotsubstantialcommentsregardingthisstep.Thereexistsavarietyoflibraries

usedindevelopingalgorithmicsolutionsthatarewidelyused,andtheyencapsulateandabstractoutthecomplexitybehindsuchtasksliketrainingaspecificmodel,asexplainedbelow.

ReadinginData

Thenextelementintheprocessinanalgorithmicsolutionistoreadintherequireddata.Thefirststephere,quitelogically,includesloadingdatain.BecausethedatasetthatthenotebooksuseisuploadedtoKaggle,itcanbeattachedtoeachnotebookwithasimplesearchwithintheinterface,andthenimportedbyexecutingacommandfromthe“pandas”library“read_csv”(Notebook001).

Inspectingthedatafollows,usuallythroughfunction“head”,displayingfirstfive(bydefault)rowsofthedatasetandcorrespondingcolumnswithcolumnheaders,andsometimesfunction“shape”displayingthedimensionsofthedataset(numberofrowsandnumberofcolumns)aswellasfunction“columns”,givingthenamesofcolumnsinthedataset.Injustonenotebook,weobservedexplicitlylookingforduplicateentriesinthedataset.Commandstoperformthesefunctionsarepre-packagedandtakeformsof“df.head()”,“df.shape”or“df.columns”(Notebook003).Thisstageoftheprocessalsoinvolvescheckingdatatypespresentinthedataset,performedbyusingfunctions“info”or“dtypes”thatindicatewhichcolumnscontaininteger(wholenumbers),float(fractionswithdecimalpoints)orobject(textormixednumericandnon-numericvalues)datatype.Thisisimportantasmostalgorithmicsolutionsworkonlywithnumericalvalues.Aspartofreadingindata,simpledescriptivestatisticsofthedataareobtainedthroughfunction“describe”,resultingindisplayingthenumberofrows,mean,standarddeviation,minimumvalue,quartiles,andmaximumvalueforeachcolumn.

Conductingthethesestepsisessentialtoloadthedatasetandobtainbasicinformationaboutthedataneededtoconfirmthatthedataisloadedcorrectly,containstheexpectedcolumnsandrows,andtogaininitialfamiliaritywiththedataset.

CleaningData

Afterreadinginthedataset,dataiscleanedtoprepareitforfurtherprocessing.Thisisessentialbeforeanyanalysiscantakeplace.Stepsatthisstagetendtobetakeninvariousordersacrossthenotebooks,andarereportedhereinnoparticularorder.

Missingvaluesareidentifiedanddealtwith:thatisNULLvaluesinthedatasethavetoberesolvedbeforeanyanalysiscantakeplace.Thisisdonebyusingthefunction“isnull”,listingallcolumnswiththenumberofmissingvalues(Notebook001).Thecustomerchurndatasetcontainednonullvalues,sointhiscasetherewasnoneedtodeploysolutionstosolvethisproblem.Missingvalueshavetoberesolvedasthemajorityofalgorithmscannotdealwithdatasetscontainingmissingvalues.Oneofthewaystosolvethisproblemthatispresentedinthenotebooksisthemethodofimputation,thatisreplacingthemissingornullvaluewithanexistingvaluefromthedataset.Inthesolutionproposedinthenotebookthisisdonebasedonthenearestneighborofthemissingvalue,butsincenomissingvaluesweredetected,thesolutionisnotimplemented.

Inthenotebooks,wefoundsometimescolumnsarerenamediftheirnamesarenotintuitiveenoughorsimplytoolong.Certaincolumnscontainingvariablesthatarenotneededfortheanalysisareremoved.Forexample,thecustomerchurndatasetcontainstwocolumnswithNaïveBayesClassifierbydefault,andtheauthorofthedatasetsuggestsremovingthesecolumnsbeforeproceedingwithanalysis.Atdifferentpointsindatacleaning,exploratorydataanalysisorpre-processingthedatasetvariouscolumnsarealsoremovediftheyarenotcontributingtothemodel(forexample,removingcustomerID:“data=data.drop(columns=[‘CLIENTNUM’]”,Notebook015).Insomenotebooks,outliersareremovedfromthedatausingacommonstatisticalmeasureofz-score,indicatinghowfarfromthemeanagivendatapointis.Intheonlynotebookweobservedthatremovedoutliers,thisresultedinremoving810rowsfromthedataset.

Allnotebookswestudiedtransformdatatypesaspartofcleaningdata.Thisstep,sometimesreferredtoasfeatureengineering,isrequiredwhenthedatasetcontainsobjectdatatypes,whicharecategoricalvariablestypicalinmanydatasets,suchasmaritalstatus,levelofeducationorgender.Thesedatatypeshavetobetransformedintonumericalvariablesinordertobeanalyzed.Thisisconductedbyusingpre-existingfunctionstoencodethesevariablesasintegers(e.g.primaryeducationas1,secondaryas2,tertiaryas3)or

usingpopularone-hotencodingwherethereisnonaturalordinalrelationshipbetweencategoriesanddummyvariablesarecreated(e.g.maleis0,femaleis1).Cleaneddataisanessentialelementofanyalgorithmicsolution,aswithoutthestepstakeninthiselement,dataeitherresultsinerroneousanalysisandmodeltraining,orsimplycannotbeusedtotrainmodels.

ExploratoryDataAnalysis

Thenextstepinthealgorithmicsolutionprocessisexploratorydataanalysis,wherebyactionsaretakentolearnabouttherelationshipbetweenthedependentvariableofinterest(here:customerchurnorattrition)andindependentvariablesthatmayhelpbuildthepredictivemodel.Thisstepisessentialtouncoverwhatmodelwillbethemostappropriateforthedatasetandwhichvariablescanbepotentiallyofinterest.

Thefirststepistoidentifythedependentvariable(atrivialmatterinthegivendataset),andtoanalyzeindependentvariables.Thisisveryoftenperformedbyvisualizingthemindependently,inrelationtoeachother,orinrelationtothedependentvariable.Inmostcases,suchvisualizationswereimplementedusingfunctionsfromvisualizationlibraries,suchas“seaborn”,“matplotlib”orrarely“plotly”.Visualizingdataisthepartthattakesupthemostcodeinnearlyallnotebooksweanalyzed.Variousvisualizationsareproduced,suchasboxplots,piecharts,histograms,inordertohelpidentifywhichvariablesmaybeusefulinbuildingthemodel.Visualizationsareoftenaccompaniedbycommentssuchas“Femalesareslightlymorelikelytochurnwith17%comparedtomaleswith15%,we’llconvertthis9featureto1-0”(Notebook013).Somenotebookscontainmorecomprehensivecommentsonthelearningsfromvisualizations.

Thenextstepinexploratorydataanalysisistoidentifycorrelationsbetweenvariables.Identifyingcorrelationsisanimportantstepinexploratorydataanalysis,asfromthisdecisionscanbemadeastowhichfeaturestoincludeinpre-processingthedatasetformodelbuilding,asdescribedbelow.Forexample,Notebook022basedontheidentificationofcorrelationsdecidesto“#Dropsomefeatureswhichhavelessthan0.01correlationandgreaterthan-0.01correlation”.Exploratorydataanalysisisarequiredstepofbuildinganalgorithmicsolutionasitprovidesthenecessaryinsightintothedatasetforthepurposesofmodelbuilding.Itisatthisstagethattheimportanceofvariableswithrespecttothetargetvariableisassessed.

Pre-processingtheDataset

Thefollowingstepintheprocessistopre-processthedataset,whichinvolvespreparingthedatasetaccordingtotherequirementsofmodelbuilding.First,dataneedtobescaled,whichmayinvolveactualscaling,thatischangingtherangeofvariablestoacommonrange,e.g.between0and1,ornormalizingthevariablesfollowinganormaldistribution.Scalingisperformedtoensurethatnovariableisinterpretedasmorepredictivethanitactuallyisjustbecauseitsnumericalvaluesareonadifferentfromothervariables.Scalingisroutinelyperformedusingstandardpre-packagedfunctions,suchas“StandardScaler”fromthepopular“sklearn”library(Notebook026).

Thedatasetshouldberesampledifitisnotbalanced,thatisifonecategoryispresentmuchmorefrequentlythananother.Inthecaseofthedatasetinvestigated,customerswhoattiredoccurredmuchlessfrequently,asidentifiedinexploratorydataanalysis,soresamplingwasrequired.Thisisusuallydonebyoversamplingfromthegroupofattiredcustomers,mostfrequentlyusingapre-packagedfunction‘SMOTE’(SyntheticMinorityOversamplingTechnique)whichcreatesadditionaldatapoi

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论