从你的数据仓库发掘隐藏财富_第1页
从你的数据仓库发掘隐藏财富_第2页
从你的数据仓库发掘隐藏财富_第3页
从你的数据仓库发掘隐藏财富_第4页
从你的数据仓库发掘隐藏财富_第5页
已阅读5页,还剩25页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、An Introduction to Data MiningDiscovering hidden value in your data warehouseOverviewData miining, the extraactionn of hhiddenn preddictivve infformattion ffrom llarge databbases, is aa poweerful new ttechnoology with greatt poteentiall to hhelp ccompannies ffocus on thhe mosst impportannt infformat

2、tion iin theeir daata waarehouuses. Data mininng toools prredictt futuure trrends and bbehaviiors, allowwing bbusineesses to maake prroactiive, kknowleedge-ddrivenn deciisionss. Thee autoomatedd, proospecttive aanalysses offferedd by ddata mminingg movee beyoond thhe anaalysess of ppast eeventss pro

3、vvided by reetrosppectivve toools tyypicall of ddecisiion suupportt systtems. Data mininng toools caan ansswer bbusineess quuestioons thhat trraditiionallly werre tooo timee conssumingg to rresolvve. Thhey sccour ddatabaases ffor hiidden patteerns, findiing prredicttive iinformmationn thatt expeerts

4、 mmay miiss beecausee it llies ooutsidde theeir exxpectaationss.Most coompaniies allreadyy colllect aand reefine massiive quuantitties oof datta. Daata miining technniquess can be immplemeented rapiddly onn exissting softwware aand haardwarre plaatformms to enhannce thhe vallue off exissting inforrm

5、atioon ressourcees, annd cann be iintegrrated with new pproduccts annd sysstems as thhey arre broought on-liine. WWhen iimplemmentedd on hhigh pperforrmancee clieent/seerver or paaralleel proocessiing coomputeers, ddata mminingg toolls cann anallyze mmassivve dattabasees to delivver annswerss to qqu

6、estiions ssuch aas, WWhich cliennts arre mosst likkely tto resspond to myy nextt prommotionnal maailingg, andd why?This whhite ppaper proviides aan inttroducction to thhe bassic teechnollogiess of ddata mminingg. Exaampless of pprofittable appliicatioons illlustrrate iits reelevannce too todaays bbu

7、sineess ennvironnment as weell ass a baasic ddescriiptionn of hhow daata waarehouuse arrchiteecturees cann evollve too deliiver tthe vaalue oof datta minning tto endd userrs.The Fouundatiions oof Datta MinningData miining technniquess are the rresultt of aa longg proccess oof ressearchh and produuct

8、 deeveloppment. Thiss evollutionn begaan wheen bussinesss dataa was firstt storred onn compputerss, conntinueed witth impprovemments in daata acccess, and more recenntly, generrated technnologiies thhat alllow uusers to naavigatte thrrough theirr dataa in rreal ttime. Data mininng takkes thhis evvol

9、utiionaryy proccess bbeyondd retrrospecctive data accesss andd naviigatioon to prosppectivve andd proaactivee infoormatiion deeliverry. Daata miining is reeady ffor appplicaation in thhe bussinesss commmunityy becaause iit is suppoorted by thhree ttechnoologiees thaat aree now suffiicienttly maature

10、: Massivee dataa colllectioon Powerfuul mulltiproocessoor commputerrs Data miining algorrithmss Commerccial ddatabaases aare grrowingg at uunpreccedentted raates. A reccent MMETA GGroup surveey of data warehhouse projeects ffound that 19% oof resspondeents aare beeyond the 550 giggabytee leveel, whh

11、ile 559% exxpect to bee therre by seconnd quaarter of 19996.1 In soome inndustrries, such as reetail, thesse nummbers can bbe mucch larrger. The aaccomppanyinng neeed forr imprroved compuutatioonal eenginees cann now be meet in a cosst-efffectivve mannner wwith pparalllel muultiprrocesssor coomputee

12、r tecchnoloogy. DData mminingg algoorithmms embbody ttechniiques that have existted foor at leastt 10 yyears, but have only recenntly bbeen iimplemmentedd as mmaturee, relliablee, undderstaandablle toools thhat coonsisttentlyy outpperforrm oldder sttatisttical methoods.In the evoluution from businne

13、ss ddata tto bussinesss infoormatiion, eeach nnew sttep haas buiilt uppon thhe preeviouss one. For exampple, ddynamiic datta acccess iis criiticall for drilll-throough iin datta navvigatiion appplicaationss, andd the abiliity too storre larrge daatabasses iss crittical to daata miining. Fromm the us

14、ers poiint off vieww, thee fourr stepps lissted iin Tabble 1 were revollutionnary bbecausse theey alllowed new bbusineess quuestioons too be aanswerred acccurattely aand quuicklyy.Evolutiionaryy SteppBusinesss QueestionnEnablinng TecchnoloogiesProductt ProvviderssCharactteristticsData Coollecttion (

15、1960s)What wwas myy totaal revvenue in thhe lasst fivve yeaars?Computeers, ttapes, diskksIBM, CDDCRetrosppectivve, sttatic data delivveryData Acccess (1980s)What wwere uunit ssales in Neew Enggland last Marchh?Relatioonal ddatabaases (RDBMSS), Sttructuured QQuery Languuage (SQL), ODBCCOracle, Sybaas

16、e, IInformmix, IIBM, MMicrossoftRetrosppectivve, dyynamicc dataa deliivery at reecord levellData Waarehouusing & Decisioon Suppport(1990s)What wwere uunit ssales in Neew Enggland last Marchh? Driill doown too Bostton.On-linee anallytic proceessingg (OLAAP), mmultiddimenssionall dataabasess, datta wa

17、rrehoussesPilot, Comshhare, Arborr, Coggnos, MicroostrattegyRetrosppectivve, dyynamicc dataa deliivery at muultiplle levvelsData Miining (Emergiing Tooday)Whatss likeely too happpen too Bostton unnit saales nnext mmonth? Why?Advanceed alggorithhms, mmultipprocesssor ccomputters, massiive daatabasses

18、Pilot, Lockhheed, IBM, SGI, numerrous sstartuups (nnascennt inddustryy)Prospecctive, proaactivee infoormatiion deeliverryTable 11. Steeps inn the Evoluution of Daata Miining.The corre commponennts off dataa miniing teechnollogy hhave bbeen uunder devellopmennt forr decaades, in reesearcch areeas suu

19、ch ass stattisticcs, arrtificcial iintellligencce, annd macchine learnning. Todayy, thee matuurity of thhese ttechniiques, couppled wwith hhigh-pperforrmancee relaationaal dattabasee engiines aand brroad ddata iintegrrationn effoorts, make thesee techhnologgies ppractiical ffor cuurrentt dataa waree

20、housee enviironmeents.The Scoope off Dataa MiniingData miining derivves itts namme froom thee simiilaritties bbetweeen seaarchinng forr valuuable businness iinformmationn in aa largge dattabasee foor exaample, findding llinkedd prodducts in giigabyttes off storre scaanner data andd miniing a mountta

21、in ffor a vein of vaaluablle oree. Botth proocessees reqquire eitheer siffting throuugh ann immeense aamountt of mmateriial, oor inttelliggentlyy probbing iit to find exacttly whhere tthe vaalue rresidees. Giiven ddatabaases oof suffficieent siize annd quaality, dataa miniing teechnollogy ccan geene

22、ratte neww busiiness opporrtunitties bby proovidinng theese caapabillitiess: Automatted prredicttion oof treends aand beehavioors. DData mminingg autoomatess the proceess off findding ppredicctive inforrmatioon in largee dataabasess. Queestionns thaat traaditioonallyy requuired extennsive handss-on

23、aanalyssis caan noww be aanswerred diirectlly froom thee dataa quuicklyy. A ttypicaal exaample of a prediictivee probblem iis tarrgetedd markketingg. Datta minning uuses ddata oon passt proomotioonal mmailinngs too idenntify the ttargetts mosst likkely tto maxximizee retuurn onn inveestmennt in futu

24、rre maiilingss. Othher prredicttive pprobleems inncludee foreecastiing baankrupptcy aand otther fforms of deefaultt, andd idenntifyiing seegmentts of a poppulatiion liikely to reespondd simiilarlyy to ggiven eventts. Automatted diiscoveery off prevviouslly unkknown patteerns. Data mininng toools sww

25、eep tthrouggh dattabasees andd idenntify previiouslyy hiddden paatternns in one sstep. An exxamplee of ppatterrn disscoverry is the aanalyssis off retaail saales ddata tto ideentifyy seemminglyy unreelatedd prodducts that are ooften purchhased togetther. Otherr patttern ddiscovvery pprobleems innclu

26、dee deteectingg frauudulennt creedit ccard ttransaactionns andd idenntifyiing annomaloous daata thhat coould rrepressent ddata eentry keyinng errrors. Data miining technniquess can yieldd the beneffits oof auttomatiion onn exissting softwware aand haardwarre plaatformms, annd cann be iimplemmentedd

27、on nnew syystemss as eexistiing pllatforrms arre upggradedd and new pproduccts deevelopped. WWhen ddata mminingg toolls aree impllementted onn highh perfformannce paaralleel proocessiing syystemss, theey cann anallyze mmassivve dattabasees in minuttes. FFasterr proccessinng meaans thhat ussers ccan

28、auutomatticallly expperimeent wiith moore moodels to unndersttand ccompleex datta. Hiigh sppeed mmakes it prracticcal foor useers too anallyze hhuge qquantiities of daata. LLargerr dataabasess, in turn, yielld impprovedd preddictioons. Databasses caan be largeer in both depthh and breaddth: More coo

29、lumnss. Anaalystss mustt ofteen limmit thhe nummber oof varriablees theey exaamine when doingg handds-on analyysis ddue too timee consstrainnts. YYet vaariablles thhat arre disscardeed beccause they seem unimpportannt mayy carrry infformattion aabout unknoown paatternns. Hiigh peerformmance data min

30、inng alllows uusers to exxploree the full depthh of aa dataabase, withhout ppresellectinng a ssubsett of vvariabbles. More roows. LLargerr sampples yyield lowerr estiimatioon errrors aand vaariancce, annd alllow ussers tto makke infferencces abbout ssmall but iimporttant ssegmennts off a poopulattio

31、n. A recennt Garrtner Groupp Advaanced Technnologyy Reseearch Note listeed datta minning aand arrtificcial iintellligencce at the ttop off the five key ttechnoology areass thatt willl cleearly have a majjor immpact acrosss a wwide rrange of inndustrries wwithinn the next 3 to 5 yeaars.22 Garttner aa

32、lso llistedd paraallel archiitectuures aand daata miining as twwo of the ttop 100 new technnologiies inn whicch commpaniees willl invvest dduringg the next 5 yeaars. AAccordding tto a rrecentt Garttner HHPC Reesearcch Notte, WWith tthe raapid aadvancce in data captuure, ttransmmissioon andd storrage

33、, largee-systtems uusers will increeasinggly neeed too impllementt new and iinnovaative ways to miine thhe aftter-maarket valuee of ttheir vast storees of detaiil datta, emmployiing MPPP maassiveely paaralleel proocessiing ssystemms to creatte neww sourrces oof bussinesss advaantagee (0.99 probbabil

34、iity).3 The mosst commmonlyy usedd techhniquees in data mininng aree: Artificcial nneurall netwworks: Non-lineaar preedictiive moodels that learnn throough ttrainiing annd ressemblee biollogicaal neuural nnetworrks inn struucturee. Decisioon treees: TTree-sshapedd struucturees thaat reppresennt sett

35、s of decissions. Thesse deccisionns genneratee rulees forr the classsificaation of a datasset. SSpeciffic deecisioon treee metthods incluude Cllassifficatiion annd Reggressiion Trrees (CART) and Chi SSquaree Autoomaticc Inteeractiion Deetectiion (CCHAID) . Geneticc algoorithmms: Opptimizzationn tech

36、hniquees thaat usee proccessess suchh as ggenetiic commbinattion, mutattion, and nnaturaal sellectioon in a dessign bbased on thhe connceptss of eevoluttion. Nearestt neigghbor methood: A technnique that classsifiess eachh recoord inn a daatasett baseed on a commbinattion oof thee classses oof thee

37、k reecord(s) moost siimilarr to iit in a hisstoriccal daatasett (wheere k 1). Someetimess callled thhe k-nnearesst neiighborr techhniquee. Rule innductiion: TThe exxtracttion oof useeful iif-theen rulles frrom daata baased oon staatistiical ssignifficancce. Many off thesse tecchnoloogies have been i

38、n usse forr moree thann a deecade in sppeciallized analyysis ttools that work with relattivelyy smalll vollumes of daata. TThese capabbilitiies arre noww evollving to inntegraate diirectlly witth inddustryy-stanndard data warehhouse and OOLAP pplatfoorms. The aappenddix too thiss whitte papper prrov

39、idees a gglossaary off dataa miniing teerms.How Datta Minning WWorksHow exaactly is daata miining able to teell yoou impportannt thiings tthat yyou diidnt know or whhat iss goinng to happeen nexxt? Thhe tecchniquue thaat is used to peerformm thesse feaats inn dataa miniing iss callled moodelinng. Mo

40、odelinng is simplly thee act of buuildinng a mmodel in onne sittuatioon wheere yoou knoow thee answwer annd theen appplyingg it tto anoother situaation that you ddont. For instaance, if yoou werre loooking for aa sunkken Sppanishh gallleon oon thee highh seass the firstt thinng youu mighht do is too

41、 reseearch the ttimes when Spaniish trreasurre hadd beenn founnd by otherrs in the ppast. You mmight note that thesee shipps oftten teend too be ffound off tthe cooast oof Berrmuda and tthat tthere are ccertaiin chaaracteeristiics too the oceann currrents, and certaain rooutes that have likelly beee

42、n takken byy the ships capptainss in tthat eera. YYou noote thhese ssimilaaritiees andd builld a mmodel that incluudes tthe chharactteristtics tthat aare coommon to thhe loccationns of thesee sunkken trreasurres. WWith tthese modells in hand you ssail ooff loookingg for treassure wwhere your modell

43、indiicatess it mmost llikelyy mighht be givenn a siimilarr situuationn in tthe paast. HHopefuully, if yoouve got aa goodd modeel, yoou finnd youur treeasuree.This acct of modell buillding is thhus soomethiing thhat peeople have been doingg for a lonng timme, ceertainnly beefore the aadventt of ccomp

44、utters oor datta minning ttechnoology. Whatt happpens oon commputerrs, hooweverr, is not mmuch ddifferrent tthan tthe waay peoople bbuild modells. Coomputeers arre loaaded uup witth lotts of inforrmatioon aboout a varieety off situuationns wheere ann answwer iss knowwn andd thenn the data mininng so

45、fftwaree on tthe coomputeer musst runn throough tthat ddata aand diistilll the charaacteriisticss of tthe daata thhat shhould go innto thhe moddel. OOnce tthe moodel iis buiilt itt can then be ussed inn simiilar ssituattions wheree you dontt knoww the answeer. Foor exaample, say that you aare thhe d

46、irrectorr of mmarketting ffor a teleccommunnicatiions ccompanny andd youd likke to acquiire soome neew lonng disstancee phonne cusstomerrs. Yoou couuld juust raandomlly go out aand maail coouponss to tthe geenerall popuulatioon - jjust aas youu coulld ranndomlyy saill the seas lookiing foor sunnken

47、ttreasuure. IIn neiither case wouldd you achieeve thhe ressults you ddesireed andd of ccoursee you have the oopporttunityy to ddo mucch bettter tthan rrandomm - yoou couuld usse youur bussinesss expeeriencce stoored iin youur dattabasee to bbuild a moddel.As the markeeting direcctor yyou haave accce

48、ss to a lot oof infformattion aabout all oof youur cusstomerrs: thheir aage, ssex, ccreditt histtory aand loong diistancce callling usagee. Thee goodd newss is tthat yyou allso haave a lot oof infformattion aabout your prosppectivve cusstomerrs: thheir aage, ssex, ccreditt histtory eetc. YYour pprob

49、leem is that you ddont know the llong ddistannce caallingg usagge of thesee prosspectss (sinnce thhey arre mosst likkely nnow cuustomeers off yourr comppetitiion). Youdd likee to cconcenntratee on tthose prosppects who hhave llarge amounnts off longg disttance usagee. Youu can accommplishh thiss by

50、bbuildiing a modell. Tabble 2 illusstratees thee dataa usedd for buildding aa modeel forr new custoomer pprospeectingg in aa dataa wareehousee.CustomeersProspecctsGenerall infoormatiion (ee.g. ddemogrraphicc dataa)KnownKnownProprieetary inforrmatioon (e.g. cuustomeer traansacttions)KnownTargetTable

51、22 - Daata Miining for PProspeectinggThe goaal in prosppectinng is to maake soome caalculaated gguessees aboout thhe infformattion iin thee loweer rigght haand quuadrannt bassed onn the modell thatt we bbuild goingg fromm Custtomer Generral Innformaation to Cuustomeer Proopriettary IInformmationn. F

52、orr insttance, a siimple modell for a tellecommmunicaationss comppany mmight be:98% of my cuustomeers whho makke morre thaan $600,000/year spendd moree thann $80/monthh on llong ddistannceThis moodel ccould then be apppliedd to tthe prrospecct datta to try tto telll sommethinng aboout thhe proopriet

53、tary iinformmationn thatt thiss teleecommuunicattions compaany dooes noot currrentlly havve acccess tto. Wiith thhis moodel iin hannd neww custtomerss can be seelectiively targeeted.Test maarketiing iss an eexcelllent ssourcee of ddata ffor thhis kiind off modeeling. Miniing thhe ressults of a test

54、markeet reppresennting a brooad buut rellativeely smmall ssamplee of pprospeects ccan prrovidee a fooundattion ffor iddentiffying good prosppects in thhe oveerall markeet. Taable 33 showws anoother commoon sceenarioo for buildding mmodelss: preedict what is gooing tto happpen iin thee futuure.Yester

55、ddayTodayTomorroowStatic inforrmatioon andd currrent pplans (e.g. demoographhic daata, mmarketting pplans)KnownKnownKnownDynamicc infoormatiion (ee.g. ccustommer trransacctionss)KnownKnownTargetTable 33 - Daata Miining for PPredicctionssIf someeone ttold yyou thhat hee had a moddel thhat coould ppre

56、dicct cusstomerr usagge howw woulld youu knoww if hhe reaally hhad a good modell? Thee firsst thiing yoou migght trry wouuld bee to aask hiim to applyy his modell to yyour ccustommer baase - wheree you alreaady knnew thhe ansswer. With data mininng, thhe besst wayy to aaccompplish this is byy settti

57、ng aaside some of yoour daata inn a vaault tto isoolate it frrom thhe minning pprocesss. Onnce thhe minning iis commpletee, thee resuults ccan bee testted aggainstt the data held in thhe vauult too conffirm tthe moodelss valiidity. If tthe moodel wworks, its obserrvatioons shhould hold for tthe vaau

58、ltedd dataa.An Archhitectture ffor Daata MiiningTo bestt applly theese addvanceed tecchniquues, tthey mmust bbe fullly inntegraated wwith aa dataa wareehousee as wwell aas fleexiblee inteeractiive buusinesss anaalysiss toolls. Maany daata miining toolss currrentlyy operrate ooutsidde of the wwarehoo

59、use, requiiring extraa stepps forr extrractinng, immportiing, aand annalyziing thhe datta. Fuurtherrmore, whenn new insigghts rrequirre opeeratioonal iimplemmentattion, integgratioon witth thee wareehousee simpplifiees thee appllicatiion off resuults ffrom ddata mminingg. Thee resuultingg anallytic

60、data warehhouse can bbe appplied to immprovee busiiness proceesses throuughoutt the organnizatiion, iin areeas suuch ass prommotionnal caampaiggn mannagemeent, ffraud detecction, new produuct roolloutt, andd so oon. Fiigure 1 illlustraates aan arcchiteccture for aadvancced annalysiis in a larrge daa

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论