已阅读5页,还剩98页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
2008年6月14日,weka,朱廷劭(zhu, tingshao)ph.d,what is weka,weka is a machine learning tool: tools for pre-processing data many popular machine learning algorithms data mining algorithms visualization tools experiment management tools an open source package available under the gnu public license written in java often used with data mining by witten and frank,weka versions,weka has evolved over time 3.0: the version compatible with the data mining book 3.2: a graphical user interface was added 3.4: current version added functionality classes reorganized slightly old documentation doesnt match exactly,weka is designed for traditional machine learning problems a moderate number of attributes (features) per element e.g., fewer than 100 a moderate number of values per attribute e.g., fewer than 10 a typical text classification tasks weka can run on problems with large feature sets but it would be very slow,weka features,weka features,weka has many algorithms useful for text mining clustering: k-means, em, classification: decision trees, nave bayes, svm, using weka for text mining feature selection (outside of weka) to reduce the size of the problem weka for the clustering and machine learning parts,adding/removing attributes attribute value substitution time series filters (delta, shift) sampling, randomization missing value management normalization and other numeric transformations,filters,other features,probability estimators method evaluators margin curve (difference between the predicted class probability and max probability of other classes) threshold curve (varying class proportions) statistics for binary class domains precision: correct p/total predicted p f-measure: 2*tp*precision/(tp+precision),weka input:,weka input must be in arff format: a relation name e.g., relation carmanufactuer a list of attribute definitions e.g., attribute age real e.g., attribute sex female, male e.g., attribute degree none, bs, ms, phd the last attribute is the class to be predicted a list of data elements data 24, female, ms 22, male, bs,input:,statistics about the dataset are shown:,select the panel corresponding to the data mining task to be performed and click on “choose“ to select the algorithm to be used.,weka explorer,start training/testing,start the data mining task by clicking the “start“ button in the “remote“ panel,interpretation of the results,whenever the results have been received, they are visualized in the “classifier output“ panel.,main features,49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 3 algorithms for finding association rules 15 attribute/subset evaluators + 10 search algorithms for feature selection,main gui,three graphical user interfaces “the explorer” (exploratory data analysis) “the experimenter” (experimental environment) “the knowledgeflow” (new process model inspired interface),content,what is weka? the explorer: preprocess data classification clustering association rules attribute selection data visualization references and resources,explorer: pre-processing the data,data can be imported from a file in various formats: arff, csv, c4.5, binary data can also be read from a url or from an sql database (using jdbc) pre-processing tools in weka are called “filters” weka contains filters for: discretization, normalization, resampling, attribute selection, transforming and combining attributes, ,relation heart-disease-simplified attribute age numeric attribute sex female, male attribute chest_pain_type typ_angina, asympt, non_anginal, atyp_angina attribute cholesterol numeric attribute exercise_induced_angina no, yes attribute class present, not_present data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present .,weka only deals with “flat” files,flat file in arff format,relation heart-disease-simplified attribute age numeric attribute sex female, male attribute chest_pain_type typ_angina, asympt, non_anginal, atyp_angina attribute cholesterol numeric attribute exercise_induced_angina no, yes attribute class present, not_present data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present .,weka only deals with “flat” files,numeric attribute,nominal attribute,10/30/2019,university of waikato,20,10/30/2019,university of waikato,21,10/30/2019,university of waikato,22,10/30/2019,university of waikato,23,10/30/2019,university of waikato,24,10/30/2019,university of waikato,25,10/30/2019,university of waikato,26,10/30/2019,university of waikato,27,10/30/2019,university of waikato,28,10/30/2019,university of waikato,29,10/30/2019,university of waikato,30,10/30/2019,university of waikato,31,10/30/2019,university of waikato,32,10/30/2019,university of waikato,33,10/30/2019,university of waikato,34,10/30/2019,university of waikato,35,10/30/2019,university of waikato,36,10/30/2019,university of waikato,37,10/30/2019,university of waikato,38,10/30/2019,university of waikato,39,10/30/2019,university of waikato,40,explorer: building “classifiers”,classifiers in weka are models for predicting nominal or numeric quantities implemented learning schemes include: decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, bayes nets, ,decision tree induction: training dataset,this follows an example of quinlans id3 (playing tennis),output: a decision tree for “buys_computer”,algorithm for decision tree induction,basic algorithm (a greedy algorithm) tree is constructed in a top-down recursive divide-and-conquer manner at start, all the training examples are at the root attributes are categorical (if continuous-valued, they are discretized in advance) examples are partitioned recursively based on selected attributes test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain),10/30/2019,university of waikato,45,10/30/2019,university of waikato,46,10/30/2019,university of waikato,47,10/30/2019,university of waikato,48,10/30/2019,university of waikato,49,10/30/2019,university of waikato,50,10/30/2019,university of waikato,51,10/30/2019,university of waikato,52,10/30/2019,university of waikato,53,10/30/2019,university of waikato,54,10/30/2019,university of waikato,55,10/30/2019,university of waikato,56,10/30/2019,university of waikato,57,10/30/2019,university of waikato,58,10/30/2019,university of waikato,59,10/30/2019,university of waikato,60,10/30/2019,university of waikato,61,10/30/2019,university of waikato,62,10/30/2019,university of waikato,63,10/30/2019,university of waikato,64,10/30/2019,university of waikato,65,10/30/2019,university of waikato,66,10/30/2019,university of waikato,67,10/30/2019,university of waikato,68,explorer: clustering data,weka contains “clusterers” for finding groups of similar instances in a dataset implemented schemes are: k-means, em, cobweb, x-means, farthestfirst clusters can be visualized and compared to “true” clusters (if given) evaluation based on loglikelihood if clustering scheme produces a probability distribution,the k-means clustering method,given k, the k-means algorithm is implemented in four steps: partition objects into k nonempty subsets compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) assign each object to the cluster with the nearest seed point go back to step 2, stop when no more new assignment,explorer: finding associations,weka contains an implementation of the apriori algorithm for learning association rules works only with discrete data can identify statistical dependencies between groups of attributes: milk, butter bread, eggs (with confidence 0.9 and support 2000) apriori can compute all rules that have a given minimum support and exceed a given confidence,basic concepts: frequent patterns,itemset: a set of one or more items k-itemset x = x1, , xk (absolute) support, or, support count of x: frequency or occurrence of an itemset x (relative) support, s, is the fraction of transactions that contains x (i.e., the probability that a transaction contains x) an itemset x is frequent if xs support is no less than a minsup threshold,basic concepts: association rules,find all the rules x y with minimum support and confidence support, s, probability that a transaction contains x y confidence, c, conditional probability that a transaction having x also contains y let minsup = 50%, minconf = 50% freq. pat.: beer:3, nuts:3, diaper:4, eggs:3, beer, diaper:3,customer buys diaper,customer buys both,customer buys beer,nuts, eggs, milk,40,nuts, coffee, diaper, eggs, milk,50,beer, diaper, eggs,30,beer, coffee, diaper,20,beer, nuts, diaper,10,items bought,tid,association rules: (many more!) beer diaper (60%, 100%) diaper beer (60%, 75%),10/30/2019,university of waikato,74,10/30/2019,university of waikato,75,10/30/2019,university of waikato,76,10/30/2019,university of waikato,77,10/30/2019,university of waikato,78,10/30/2019,university of waikato,79,10/30/2019,university of waikato,80,explorer: attribute selection,panel that can be used to investigate which (subsets of) attributes are the most predictive ones attribute selection methods contain two parts: a search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking an evaluation method: correlation-based, wrapper, information gain, chi-squared, very flexible: weka allows (almost) arbitrary combinations of these two,10/30/2019,university of waikato,82,10/30/2019,university of waikato,83,10/30/2019,university of waikato,84,10/30/2019,university of waikato,85,10/30/2019,university of waikato,86,10/30/2019,university of waikato,87,10/30/2019,university of waikato,88,10/30/2019,university of waikato,89,explorer: data visualization,visualization very useful in practice: e.g. helps to determine difficulty of the learning problem weka can visualize single attributes (1-d) and pairs of attributes (2-d) to do: rotating 3-d visualizations (xgobi-style) color-coded class values “jitter” option to deal with nominal attributes (and to de
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 小学语文期末考试综合复习要点
- 企业创新激励制度建设方案
- 篮球培训寒假活动方案
- 朱自清《背影》读后感写作指导
- 线上培训活动方案
- 教学中名词复数变化规律总结
- 线上放松活动方案
- 统编版二年级语文教材教案完整
- 班级学习小组组建及管理方法
- 幼儿园科学启蒙活动教学设计案例
- 二构钢筋包工合同范本
- 医疗健康体检服务投标书标准范本
- 建筑公司安全生产责任制度模板
- 滴灌设备相关知识培训课件
- 医院培训课件:《中医护理文书书写规范》
- 2025-2026学年冀教版(2024)小学信息技术三年级上册(全册)教学设计(附目录P168)
- 城市燃气设施提升改造工程节能评估报告
- 2025团校入团积极分子100题题库(含答案)
- 学堂在线 实绳结技术 章节测试答案
- 2024年国家公务员考试《行测》真题卷(行政执法)答案和解析
- 生猪屠宰兽医卫生检验人员理论考试题库及答案
评论
0/150
提交评论