




已阅读5页,还剩98页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
2008年6月14日,weka,朱廷劭(zhu, tingshao)ph.d,what is weka,weka is a machine learning tool: tools for pre-processing data many popular machine learning algorithms data mining algorithms visualization tools experiment management tools an open source package available under the gnu public license written in java often used with data mining by witten and frank,weka versions,weka has evolved over time 3.0: the version compatible with the data mining book 3.2: a graphical user interface was added 3.4: current version added functionality classes reorganized slightly old documentation doesnt match exactly,weka is designed for traditional machine learning problems a moderate number of attributes (features) per element e.g., fewer than 100 a moderate number of values per attribute e.g., fewer than 10 a typical text classification tasks weka can run on problems with large feature sets but it would be very slow,weka features,weka features,weka has many algorithms useful for text mining clustering: k-means, em, classification: decision trees, nave bayes, svm, using weka for text mining feature selection (outside of weka) to reduce the size of the problem weka for the clustering and machine learning parts,adding/removing attributes attribute value substitution time series filters (delta, shift) sampling, randomization missing value management normalization and other numeric transformations,filters,other features,probability estimators method evaluators margin curve (difference between the predicted class probability and max probability of other classes) threshold curve (varying class proportions) statistics for binary class domains precision: correct p/total predicted p f-measure: 2*tp*precision/(tp+precision),weka input:,weka input must be in arff format: a relation name e.g., relation carmanufactuer a list of attribute definitions e.g., attribute age real e.g., attribute sex female, male e.g., attribute degree none, bs, ms, phd the last attribute is the class to be predicted a list of data elements data 24, female, ms 22, male, bs,input:,statistics about the dataset are shown:,select the panel corresponding to the data mining task to be performed and click on “choose“ to select the algorithm to be used.,weka explorer,start training/testing,start the data mining task by clicking the “start“ button in the “remote“ panel,interpretation of the results,whenever the results have been received, they are visualized in the “classifier output“ panel.,main features,49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 3 algorithms for finding association rules 15 attribute/subset evaluators + 10 search algorithms for feature selection,main gui,three graphical user interfaces “the explorer” (exploratory data analysis) “the experimenter” (experimental environment) “the knowledgeflow” (new process model inspired interface),content,what is weka? the explorer: preprocess data classification clustering association rules attribute selection data visualization references and resources,explorer: pre-processing the data,data can be imported from a file in various formats: arff, csv, c4.5, binary data can also be read from a url or from an sql database (using jdbc) pre-processing tools in weka are called “filters” weka contains filters for: discretization, normalization, resampling, attribute selection, transforming and combining attributes, ,relation heart-disease-simplified attribute age numeric attribute sex female, male attribute chest_pain_type typ_angina, asympt, non_anginal, atyp_angina attribute cholesterol numeric attribute exercise_induced_angina no, yes attribute class present, not_present data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present .,weka only deals with “flat” files,flat file in arff format,relation heart-disease-simplified attribute age numeric attribute sex female, male attribute chest_pain_type typ_angina, asympt, non_anginal, atyp_angina attribute cholesterol numeric attribute exercise_induced_angina no, yes attribute class present, not_present data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present .,weka only deals with “flat” files,numeric attribute,nominal attribute,10/30/2019,university of waikato,20,10/30/2019,university of waikato,21,10/30/2019,university of waikato,22,10/30/2019,university of waikato,23,10/30/2019,university of waikato,24,10/30/2019,university of waikato,25,10/30/2019,university of waikato,26,10/30/2019,university of waikato,27,10/30/2019,university of waikato,28,10/30/2019,university of waikato,29,10/30/2019,university of waikato,30,10/30/2019,university of waikato,31,10/30/2019,university of waikato,32,10/30/2019,university of waikato,33,10/30/2019,university of waikato,34,10/30/2019,university of waikato,35,10/30/2019,university of waikato,36,10/30/2019,university of waikato,37,10/30/2019,university of waikato,38,10/30/2019,university of waikato,39,10/30/2019,university of waikato,40,explorer: building “classifiers”,classifiers in weka are models for predicting nominal or numeric quantities implemented learning schemes include: decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, bayes nets, ,decision tree induction: training dataset,this follows an example of quinlans id3 (playing tennis),output: a decision tree for “buys_computer”,algorithm for decision tree induction,basic algorithm (a greedy algorithm) tree is constructed in a top-down recursive divide-and-conquer manner at start, all the training examples are at the root attributes are categorical (if continuous-valued, they are discretized in advance) examples are partitioned recursively based on selected attributes test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain),10/30/2019,university of waikato,45,10/30/2019,university of waikato,46,10/30/2019,university of waikato,47,10/30/2019,university of waikato,48,10/30/2019,university of waikato,49,10/30/2019,university of waikato,50,10/30/2019,university of waikato,51,10/30/2019,university of waikato,52,10/30/2019,university of waikato,53,10/30/2019,university of waikato,54,10/30/2019,university of waikato,55,10/30/2019,university of waikato,56,10/30/2019,university of waikato,57,10/30/2019,university of waikato,58,10/30/2019,university of waikato,59,10/30/2019,university of waikato,60,10/30/2019,university of waikato,61,10/30/2019,university of waikato,62,10/30/2019,university of waikato,63,10/30/2019,university of waikato,64,10/30/2019,university of waikato,65,10/30/2019,university of waikato,66,10/30/2019,university of waikato,67,10/30/2019,university of waikato,68,explorer: clustering data,weka contains “clusterers” for finding groups of similar instances in a dataset implemented schemes are: k-means, em, cobweb, x-means, farthestfirst clusters can be visualized and compared to “true” clusters (if given) evaluation based on loglikelihood if clustering scheme produces a probability distribution,the k-means clustering method,given k, the k-means algorithm is implemented in four steps: partition objects into k nonempty subsets compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) assign each object to the cluster with the nearest seed point go back to step 2, stop when no more new assignment,explorer: finding associations,weka contains an implementation of the apriori algorithm for learning association rules works only with discrete data can identify statistical dependencies between groups of attributes: milk, butter bread, eggs (with confidence 0.9 and support 2000) apriori can compute all rules that have a given minimum support and exceed a given confidence,basic concepts: frequent patterns,itemset: a set of one or more items k-itemset x = x1, , xk (absolute) support, or, support count of x: frequency or occurrence of an itemset x (relative) support, s, is the fraction of transactions that contains x (i.e., the probability that a transaction contains x) an itemset x is frequent if xs support is no less than a minsup threshold,basic concepts: association rules,find all the rules x y with minimum support and confidence support, s, probability that a transaction contains x y confidence, c, conditional probability that a transaction having x also contains y let minsup = 50%, minconf = 50% freq. pat.: beer:3, nuts:3, diaper:4, eggs:3, beer, diaper:3,customer buys diaper,customer buys both,customer buys beer,nuts, eggs, milk,40,nuts, coffee, diaper, eggs, milk,50,beer, diaper, eggs,30,beer, coffee, diaper,20,beer, nuts, diaper,10,items bought,tid,association rules: (many more!) beer diaper (60%, 100%) diaper beer (60%, 75%),10/30/2019,university of waikato,74,10/30/2019,university of waikato,75,10/30/2019,university of waikato,76,10/30/2019,university of waikato,77,10/30/2019,university of waikato,78,10/30/2019,university of waikato,79,10/30/2019,university of waikato,80,explorer: attribute selection,panel that can be used to investigate which (subsets of) attributes are the most predictive ones attribute selection methods contain two parts: a search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking an evaluation method: correlation-based, wrapper, information gain, chi-squared, very flexible: weka allows (almost) arbitrary combinations of these two,10/30/2019,university of waikato,82,10/30/2019,university of waikato,83,10/30/2019,university of waikato,84,10/30/2019,university of waikato,85,10/30/2019,university of waikato,86,10/30/2019,university of waikato,87,10/30/2019,university of waikato,88,10/30/2019,university of waikato,89,explorer: data visualization,visualization very useful in practice: e.g. helps to determine difficulty of the learning problem weka can visualize single attributes (1-d) and pairs of attributes (2-d) to do: rotating 3-d visualizations (xgobi-style) color-coded class values “jitter” option to deal with nominal attributes (and to de
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 【正版授权】 IEC TS 62271-313:2025 EXV EN High-voltage switchgear and controlgear - Part 313: Direct current circuit-breakers
- 【正版授权】 IEC 63505:2025 EN Guidelines for measuring the threshold voltage (VT) of SiC MOSFETs
- 【正版授权】 IEC TS 62565-5-3:2025 EN Nanomanufacturing – Product specification – Part 5-3: Nanoenabled energy storage – Blank detail specification: silicon nanosized materials for the n
- 2025年英语专业八级考试试卷及答案
- 2025年艺术设计专业期末考试试卷及答案
- 2025年新媒体艺术专业考试题及答案
- 2025年市场分析与预测能力测试卷及答案
- 2025年成人高考学历考试试题及答案
- 2025年公共卫生应急管理课程考试试题及答案
- 2025年区域经济发展研究专业考试试卷及答案
- 跨境电商合伙投资协议书
- 2024年网格员考试题库及答案1套
- 国开(辽宁)2024年《中国传统文化概观》形考1-4答案
- 状元展厅方案策划
- 土壤农化分析实验智慧树知到期末考试答案章节答案2024年甘肃农业大学
- 鸢飞鱼跃:〈四书〉经典导读智慧树知到期末考试答案章节答案2024年四川大学
- 空压机日常维护保养点检记录表
- MOOC 统计学-南京审计大学 中国大学慕课答案
- 福建省厦门市集美区2023届小升初语文试卷(含解析)
- 毛泽东诗词鉴赏
- 电机与拖动(高职)全套教学课件
评论
0/150
提交评论