已阅读5页,还剩8页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
unsuspectedrelationshipswhichareofinterestorvaluetothedatabasesowners,ordataminers9.Duetothelargenumberofdimensionalityandthehugevolumeofdata,traditionalstatisticalmethodshavetheirlimitationsindatamining.Tomeetthechallengeofdatamining,articialintelligencebasedhumancomputerinteractivetechniqueshavebeenwidelyusedindatamining3,16.*ConceptualconstructiononincompletesurveydataShouhongWanga,*,HaiWangbaDepartmentofMarketing/BusinessInformationSystems,CharltonCollegeofBusiness,UniversityofMassachusettsDartmouth,285OldWestportRoad,NorthDartmouth,MA02747-2300,USAbDepartmentofComputerScience,UniversityofToronto,Toronto,ON,CanadaM5S3G4Received22March2003;receivedinrevisedform9September2003;accepted20October2003Availableonline26November2003AbstractTherawsurveydatafordataminingareoftenincomplete.Theissuesofmissingdatainknowledgediscoveryareoftenignoredindatamining.Thisarticlepresentstheconceptualfoundationsofdataminingwithincompletesurveydata,andproposesqueryprocessingforknowledgediscoveryandasetofqueryfunctionsfortheconceptualconstructioninsurveydatamining.Throughacase,thispaperdemonstratesthatconceptualconstructiononincompletedatacanbeaccomplishedbyusingarticialintelligencetoolssuchasself-organizingmaps.C2112003ElsevierB.V.Allrightsreserved.Keywords:Incompletesurveydata;Surveydatamining;Conceptualconstruction;Self-organizingmaps;Clusteranalysis;Knowledgediscovery;Queryprocessing1.IntroductionDataminingistheprocessoftrawlingthroughdatainthehopeofidentifyinginterpretablepatterns.D/locate/datakData&KnowledgeEngineering49(2004)311323Correspondingauthor.E-mailaddresses:(S.Wang),(H.Wang).0169-023X/$-seefrontmatterC2112003ElsevierB.V.Allrightsreserved.doi:10.1016/j.datak.2003.10.007aneectivemethodindealingwithhigh-dimensionaldata6,12.Moreimportantly,theSOMmethodprovidesabaseforthevisibilityofclustersofhigh-dimensionaldata.Thisfeatureisnot312S.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323availableinanyotherdataanalysismethods.Itallowsthedataminertoanalyzeclustersbasedontheproblemdomain.Surveyisoneofthecommondataacquisitionmethodsfordatamining4.Indatamining,onecanrarelyndasurveydatasetthatcontainscompleteentriesofeachobservationforallofthevariables.Commonly,surveysandquestionnairesareoftenonlypartiallycompletedbyrespon-dents.Theextentofdamageofmissingdataisunknownwhenitisvirtuallyimpossibletoreturnthesurveyorquestionnairestothedatasourceforcompletion,butisoneofthemostimportantpartsofknowledgefordataminingtodiscover.Infact,missingdataisanimportantdebatableissueintheknowledgeengineeringeld15.Inminingasurveydatabasewithincompletedatathroughclusteranalysis,patternsofthemissingdataaswellasthepotentialimpactsofthesemissingdataontheminingresultsareknowledge.Forinstance,adatamineroftenwishestoknowhowreliableaclusteranalysisis;whenandwhycertaintypesofvaluesareoftenmissing;whatvariablesarecorrelatedintermsofhavingmissingvaluesatthesametime.Thesevaluablepiecesofknowledgecanbediscoveredonlyafterthemissingpartofthedatasetisfullyexplored.Thispaperdiscussestheissueofmissingdatainminingsurveydatabasesforknowledgedis-covery,presentstheconceptualfoundationsofconceptualconstruction,andproposesasetofqueryfunctionsforconceptualconstructioninSOM-baseddatamining.Therestofthepaperisorganizedasfollows.Section2discussestheissuesofmissingdatarelatedtodatamining.Section3introducesSOMforconceptualconstructiononincompletedata.Section4suggestsfourconceptsasknowledgediscoveryindataminingwithincompletedata.ItprovidesaschemeofconceptualconstructiononincompletedatausingSOM.Section5proposesaquerytoolthatisusedtomanipulateSOMforconceptualconstruction.Section6presentsacasestudythatappliesthequerytooltomanipulatetheSOMfortheconceptualconstructiononastudentopinionsurveydataset.Finally,Section7oersconcludingremarks.2.IssuesofmissingdataIncompletedatasetsareubiquitousindatamining.Therehavebeenmanytreatmentsofmissingdata.Oneoftheconvenientsolutionstoincompletedataistoeliminatefromthedatasetthoserecordsthataremissingvalues.This,however,ignorespotentiallyusefulinformationinthoserecords.Incaseswheretheproportionofmissingdataislarge,theconclusionsdrawnfromthescreeneddatasetaremorelikelybiasedormisleading.Therehavebeenmanynon-statisticaltechniquesfordatamining.Theself-organizingmaps(SOM)methodbasedonKohonenneuralnetwork12isoneofthepromisingtechniques.SOM-basedclustertechniqueshaveadvantagesoverothermethodsfordatamining.Dataminingtypicallydealswithveryhigh-dimensionaldata.Thatis,anobservationinthedatabasefordataminingistypicallydescribedbyalargenumberofvariables.Thecurseofdimensionalityturnsstatisticalcorrelationsofdatainsignicant,andthusmakesstatisticalmethodspowerless.TheSOMmethod,however,doesnotrelyonanyassumptionsofstatisticaltests,andisconsideredasS.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323313Anothersimpleapproachofdealingwithmissingdataistousegenericunknownforallmissingdataitems.Indatamining,unspeciedunknownforallmissingdataitemsoftencausesconfusionandmisinterpretation.Thethirdsolutiontodealingwithmissingdataistoestimatethemissingvalueinthedataeld.Inthecaseoftimeseriesdata,interpolationbasedontwoadjacentdatapointsthatareobservedispossible.Ingeneralcases,onemayusesomeexpectedvalueinthedataeldbasedonstatisticalmeasures7.However,indatamining,surveydataarecommonlyofthetypesofranking,cat-egory,multiplechoices,andbinary.Interpolationanduseofanexpectedvalueforaparticularmissingdatavariableinthesecasesaregenerallyinadequate.Moreimportantly,research2indicatesthatameaningfultreatmentofmissingdatashallalwaysbeindependentoftheproblembeinginvestigated.Morerecently,therehavebeenmathematicalmethodsforndingtheaggregateconceptualdirectionsofadatasetwithmissingdata(e.g.,1,10).Thesemethodsmakethemselvesdistinctfromthetraditionalapproachesoftreatingmissingdatabyfocusingonthecollectiveeectsofthemissingdatainsteadofindividualmissingvalues.Thissuperiorfeatureofthesemethodscanbebestbuiltupfordataminingonincompletedata.However,thesestatisticalmethodshavelimi-tations.First,itisassumedthatmissingvaluesoccurinarandomfashionorfollowacertaindistributionfunctions.Theirstrongassumptionsaboutthedistributionsofdataareofteninvalidespeciallyforcasesofsurveywithincompletedata.Second,thesemathematicalmodelsaredata-driven,insteadofproblem-domain-driven.Infact,asinglegenericconceptualconstructionalgorithmisinsucienttohandleavarietyofgoalsofdataminingsinceagoalofdataminingisoftenrelatedtoitsspecicproblemdomain.Knowledgediscoveryindatabasesisthenon-trivialprocessofidentifyingvalid,novel,potentiallyuseful,andultimatelyunderstandablepatternsofdata8.Followingthisdenition,thisresearchemphasizestwoaspectsofconceptconstructionindataminingwithincompletedata.First,thecriteriaofvalidity,novelty,usefulnessoftheconceptstobeconstructedindataminingwithincompletedatacouldbeproblem-dependent.Thatis,theinterestofadatapatterndependsonthedatamineranddoesnotsolelydependontheestimatedstatisticalstrengthofthepattern14.Second,theconceptualconstructionbasedontheincompletedataisaccomplishedthroughheuristicsearchincombinatorialspacesbuiltoncomputerandhumancognitivetheories13.Humancomputercollaborationconceptconstructionistheinteractiveprocessbetweenthedataminerandcomputertoextractnovel,plausible,useful,relevant,andinterestingknowledgeassociatedwiththemissingdata.Inourview,dataminingdiersfromtraditionalstatisticsindealingmissingdatainmanyways.(1)Dataminingattemptstoextractunsuspectedandpotentiallyusefulpatternsfromthedataforthedataminerswithnovelgoalsrelatedtothemissingdata,ratherthantoestimatetheindi-vidualvaluesofthemissingdata.(2)Dataminingisahumancenteredprocessimplementedthroughknowledgediscoveryloopscoupledwithhumancomputerinteractiontoperceivetheimpactofthemissingdataatanaggregatelevel,ratherthanaone-waymathematicalderivationbasedonunveriedassump-tions.3.Toolforconceptualconstruction:self-organizingmaps(SOM)Givenalargesetofhigh-dimensionalsurveysamples,thereusuallybeasignicantnumberofobservationshavemissingvalues;however,notallmissingdataarerelevanttothedataminerC213sinterest.Hence,anysimplebrute-forcesearchmethodformissingdataisnotonlyinfeasibleforahugeamountofdata,butalsohelplesswhenthedatamineristoidentifyproblems,ordevelopconcepts,throughdatamining.Toidentifyproblemsordevelopconcepts,thedataminerneedsatooltoobserveunsuspectedpatternsoftheavailabledataandthemissingparts.Self-organizingmaps(SOM)12havebeenwidelyusedforclustering,sinceSOMaremorecomputationallyecientthanthepopulark-meansclusteringalgorithm.Moreimportantly,SOMprovidedatavisualizationforthedataminertoviewhigh-dimensionaldata11.Research14,16314S.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323indicatesthatSOMareeectiveindataminingfortheidenticationofunsuspectedpatternofthedata.Specically,SOMcanbeusedforclusteranalysisonmultivariatesurveydata.ThisstudytakesonestepfurtherandusesSOMasatoolforconceptconstructionrelatedtomissingdata.Conceptualconstructiononincompletedataistoinvestigatethepatternsofthemissingdataaswellasthepotentialimpactsofthesemissingdataontheminingresultsbasedonlyonthecompletedata.Asseenlaterinourillustrativeexamples,SOMprovideamechanismforhumancomputercollaborationtoconstructconceptsfromthedatawithmissingvalues.SOMcanlearncertainusefulfeaturesfo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 小学二年级下册数学期末易错题专项复习测试卷
- 生理课的试题及答案
- 阀门知识考试题及答案
- 2026年重大妇幼项目培训试题
- 2026年正式交警考试题库附答案【预热题】
- 2026年医院药房中药考试试题及答案
- 2026年全国辅警考试题库及参考答案
- 2026年机关妇女工作知识题含答案
- 2026年湖南常德安乡县部分事业单位招聘笔试易考易错模拟试题
- 2026年公安机关人民警察专业科目真题试卷
- 部编版五年级语文下册:期末测试卷(有答案)
- NB-T31129-2018风力发电机组振动状态评价导则
- 2023年珠海横琴粤澳深度合作区执行委员会招聘考试真题
- 2025届河南省郑州市外国语高中物理高一第二学期期末统考试题含解析
- DZ∕T 0201-2020 矿产地质勘查规范 钨、锡、汞、锑(正式版)
- 多级离心泵培训
- 派出所民警培训课件
- 无人机装调与维修 课件 第二课时 无人机动力系统的安装
- 门急诊运用PDCA循环降低门急诊输液率品管圈QCC持续质量改进成果汇报
- 高压旋喷桩、CFG桩、水泥土搅拌桩、振冲碎石桩计算(2012规范)-PJ
- 安全风险分级管控培训
评论
0/150
提交评论