




已阅读5页,还剩8页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
unsuspectedrelationshipswhichareofinterestorvaluetothedatabasesowners,ordataminers9.Duetothelargenumberofdimensionalityandthehugevolumeofdata,traditionalstatisticalmethodshavetheirlimitationsindatamining.Tomeetthechallengeofdatamining,articialintelligencebasedhumancomputerinteractivetechniqueshavebeenwidelyusedindatamining3,16.*ConceptualconstructiononincompletesurveydataShouhongWanga,*,HaiWangbaDepartmentofMarketing/BusinessInformationSystems,CharltonCollegeofBusiness,UniversityofMassachusettsDartmouth,285OldWestportRoad,NorthDartmouth,MA02747-2300,USAbDepartmentofComputerScience,UniversityofToronto,Toronto,ON,CanadaM5S3G4Received22March2003;receivedinrevisedform9September2003;accepted20October2003Availableonline26November2003AbstractTherawsurveydatafordataminingareoftenincomplete.Theissuesofmissingdatainknowledgediscoveryareoftenignoredindatamining.Thisarticlepresentstheconceptualfoundationsofdataminingwithincompletesurveydata,andproposesqueryprocessingforknowledgediscoveryandasetofqueryfunctionsfortheconceptualconstructioninsurveydatamining.Throughacase,thispaperdemonstratesthatconceptualconstructiononincompletedatacanbeaccomplishedbyusingarticialintelligencetoolssuchasself-organizingmaps.C2112003ElsevierB.V.Allrightsreserved.Keywords:Incompletesurveydata;Surveydatamining;Conceptualconstruction;Self-organizingmaps;Clusteranalysis;Knowledgediscovery;Queryprocessing1.IntroductionDataminingistheprocessoftrawlingthroughdatainthehopeofidentifyinginterpretablepatterns.D/locate/datakData&KnowledgeEngineering49(2004)311323Correspondingauthor.E-mailaddresses:(S.Wang),(H.Wang).0169-023X/$-seefrontmatterC2112003ElsevierB.V.Allrightsreserved.doi:10.1016/j.datak.2003.10.007aneectivemethodindealingwithhigh-dimensionaldata6,12.Moreimportantly,theSOMmethodprovidesabaseforthevisibilityofclustersofhigh-dimensionaldata.Thisfeatureisnot312S.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323availableinanyotherdataanalysismethods.Itallowsthedataminertoanalyzeclustersbasedontheproblemdomain.Surveyisoneofthecommondataacquisitionmethodsfordatamining4.Indatamining,onecanrarelyndasurveydatasetthatcontainscompleteentriesofeachobservationforallofthevariables.Commonly,surveysandquestionnairesareoftenonlypartiallycompletedbyrespon-dents.Theextentofdamageofmissingdataisunknownwhenitisvirtuallyimpossibletoreturnthesurveyorquestionnairestothedatasourceforcompletion,butisoneofthemostimportantpartsofknowledgefordataminingtodiscover.Infact,missingdataisanimportantdebatableissueintheknowledgeengineeringeld15.Inminingasurveydatabasewithincompletedatathroughclusteranalysis,patternsofthemissingdataaswellasthepotentialimpactsofthesemissingdataontheminingresultsareknowledge.Forinstance,adatamineroftenwishestoknowhowreliableaclusteranalysisis;whenandwhycertaintypesofvaluesareoftenmissing;whatvariablesarecorrelatedintermsofhavingmissingvaluesatthesametime.Thesevaluablepiecesofknowledgecanbediscoveredonlyafterthemissingpartofthedatasetisfullyexplored.Thispaperdiscussestheissueofmissingdatainminingsurveydatabasesforknowledgedis-covery,presentstheconceptualfoundationsofconceptualconstruction,andproposesasetofqueryfunctionsforconceptualconstructioninSOM-baseddatamining.Therestofthepaperisorganizedasfollows.Section2discussestheissuesofmissingdatarelatedtodatamining.Section3introducesSOMforconceptualconstructiononincompletedata.Section4suggestsfourconceptsasknowledgediscoveryindataminingwithincompletedata.ItprovidesaschemeofconceptualconstructiononincompletedatausingSOM.Section5proposesaquerytoolthatisusedtomanipulateSOMforconceptualconstruction.Section6presentsacasestudythatappliesthequerytooltomanipulatetheSOMfortheconceptualconstructiononastudentopinionsurveydataset.Finally,Section7oersconcludingremarks.2.IssuesofmissingdataIncompletedatasetsareubiquitousindatamining.Therehavebeenmanytreatmentsofmissingdata.Oneoftheconvenientsolutionstoincompletedataistoeliminatefromthedatasetthoserecordsthataremissingvalues.This,however,ignorespotentiallyusefulinformationinthoserecords.Incaseswheretheproportionofmissingdataislarge,theconclusionsdrawnfromthescreeneddatasetaremorelikelybiasedormisleading.Therehavebeenmanynon-statisticaltechniquesfordatamining.Theself-organizingmaps(SOM)methodbasedonKohonenneuralnetwork12isoneofthepromisingtechniques.SOM-basedclustertechniqueshaveadvantagesoverothermethodsfordatamining.Dataminingtypicallydealswithveryhigh-dimensionaldata.Thatis,anobservationinthedatabasefordataminingistypicallydescribedbyalargenumberofvariables.Thecurseofdimensionalityturnsstatisticalcorrelationsofdatainsignicant,andthusmakesstatisticalmethodspowerless.TheSOMmethod,however,doesnotrelyonanyassumptionsofstatisticaltests,andisconsideredasS.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323313Anothersimpleapproachofdealingwithmissingdataistousegenericunknownforallmissingdataitems.Indatamining,unspeciedunknownforallmissingdataitemsoftencausesconfusionandmisinterpretation.Thethirdsolutiontodealingwithmissingdataistoestimatethemissingvalueinthedataeld.Inthecaseoftimeseriesdata,interpolationbasedontwoadjacentdatapointsthatareobservedispossible.Ingeneralcases,onemayusesomeexpectedvalueinthedataeldbasedonstatisticalmeasures7.However,indatamining,surveydataarecommonlyofthetypesofranking,cat-egory,multiplechoices,andbinary.Interpolationanduseofanexpectedvalueforaparticularmissingdatavariableinthesecasesaregenerallyinadequate.Moreimportantly,research2indicatesthatameaningfultreatmentofmissingdatashallalwaysbeindependentoftheproblembeinginvestigated.Morerecently,therehavebeenmathematicalmethodsforndingtheaggregateconceptualdirectionsofadatasetwithmissingdata(e.g.,1,10).Thesemethodsmakethemselvesdistinctfromthetraditionalapproachesoftreatingmissingdatabyfocusingonthecollectiveeectsofthemissingdatainsteadofindividualmissingvalues.Thissuperiorfeatureofthesemethodscanbebestbuiltupfordataminingonincompletedata.However,thesestatisticalmethodshavelimi-tations.First,itisassumedthatmissingvaluesoccurinarandomfashionorfollowacertaindistributionfunctions.Theirstrongassumptionsaboutthedistributionsofdataareofteninvalidespeciallyforcasesofsurveywithincompletedata.Second,thesemathematicalmodelsaredata-driven,insteadofproblem-domain-driven.Infact,asinglegenericconceptualconstructionalgorithmisinsucienttohandleavarietyofgoalsofdataminingsinceagoalofdataminingisoftenrelatedtoitsspecicproblemdomain.Knowledgediscoveryindatabasesisthenon-trivialprocessofidentifyingvalid,novel,potentiallyuseful,andultimatelyunderstandablepatternsofdata8.Followingthisdenition,thisresearchemphasizestwoaspectsofconceptconstructionindataminingwithincompletedata.First,thecriteriaofvalidity,novelty,usefulnessoftheconceptstobeconstructedindataminingwithincompletedatacouldbeproblem-dependent.Thatis,theinterestofadatapatterndependsonthedatamineranddoesnotsolelydependontheestimatedstatisticalstrengthofthepattern14.Second,theconceptualconstructionbasedontheincompletedataisaccomplishedthroughheuristicsearchincombinatorialspacesbuiltoncomputerandhumancognitivetheories13.Humancomputercollaborationconceptconstructionistheinteractiveprocessbetweenthedataminerandcomputertoextractnovel,plausible,useful,relevant,andinterestingknowledgeassociatedwiththemissingdata.Inourview,dataminingdiersfromtraditionalstatisticsindealingmissingdatainmanyways.(1)Dataminingattemptstoextractunsuspectedandpotentiallyusefulpatternsfromthedataforthedataminerswithnovelgoalsrelatedtothemissingdata,ratherthantoestimatetheindi-vidualvaluesofthemissingdata.(2)Dataminingisahumancenteredprocessimplementedthroughknowledgediscoveryloopscoupledwithhumancomputerinteractiontoperceivetheimpactofthemissingdataatanaggregatelevel,ratherthanaone-waymathematicalderivationbasedonunveriedassump-tions.3.Toolforconceptualconstruction:self-organizingmaps(SOM)Givenalargesetofhigh-dimensionalsurveysamples,thereusuallybeasignicantnumberofobservationshavemissingvalues;however,notallmissingdataarerelevanttothedataminerC213sinterest.Hence,anysimplebrute-forcesearchmethodformissingdataisnotonlyinfeasibleforahugeamountofdata,butalsohelplesswhenthedatamineristoidentifyproblems,ordevelopconcepts,throughdatamining.Toidentifyproblemsordevelopconcepts,thedataminerneedsatooltoobserveunsuspectedpatternsoftheavailabledataandthemissingparts.Self-organizingmaps(SOM)12havebeenwidelyusedforclustering,sinceSOMaremorecomputationallyecientthanthepopulark-meansclusteringalgorithm.Moreimportantly,SOMprovidedatavisualizationforthedataminertoviewhigh-dimensionaldata11.Research14,16314S.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323indicatesthatSOMareeectiveindataminingfortheidenticationofunsuspectedpatternofthedata.Specically,SOMcanbeusedforclusteranalysisonmultivariatesurveydata.ThisstudytakesonestepfurtherandusesSOMasatoolforconceptconstructionrelatedtomissingdata.Conceptualconstructiononincompletedataistoinvestigatethepatternsofthemissingdataaswellasthepotentialimpactsofthesemissingdataontheminingresultsbasedonlyonthecompletedata.Asseenlaterinourillustrativeexamples,SOMprovideamechanismforhumancomputercollaborationtoconstructconceptsfromthedatawithmissingvalues.SOMcanlearncertainusefulfeaturesfo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 索道支架焊接工艺参数调整工艺考核试卷及答案
- 金属成形机床维修规范考核试卷及答案
- 动物胶制造工岗前考核试卷及答案
- 卡轨车司机岗前考核试卷及答案
- 城市轨道交通行车调度员适应性考核试卷及答案
- 现代学徒制下高职校企协同专业诊改体系构建
- 养殖技术考试题目及答案
- 美术中考专业试题及答案
- 果树专业试题及答案
- 单招空乘专业试题及答案
- 2025年第一届安康杯安全生产知识竞赛试题题库及答案(完整版)
- 电力工程冬季施工安全技术措施
- 贵州省贵阳市2026届高三上学期摸底考试数学试卷含答案
- 公司年度员工安全教育培训计划
- 供电所安全教育培训课件
- 2025年杭州市上城区望江街道办事处 编外人员招聘8人考试参考试题及答案解析
- 百果园水果知识培训资料课件
- 2025年灌注桩考试题及答案
- 公司安全生产责任书范本
- 养老护理员培训班课件
- 隔爆水棚替换自动隔爆装置方案及安全技术措施
评论
0/150
提交评论