加权朴素贝叶斯算法及其weka程序分析20151216A.docx

上传人：x*** IP属地：河南上传时间：2020-01-28 格式：DOCX 页数：16 大小：61.60KB 积分：20 举报 版权申诉

已阅读5页，还剩11页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

16局部加权朴素贝叶斯算法及其Weka程序分析局部加权朴素贝叶斯算法及其Weka程序分析本文是多年来经过无数次修正的版本，其中融入了许多学生的建议而且也是时时更新的特别地，本文仅供学生学习使用，并不适合于发表在任何公开媒体上，也不允许任何学生将之存放到互联网上另外，与一般学术论文不同，本文许多地方采用第1人称进行讲述张伟(北京交通大学计算机与信息技术学院，北京，100044)摘要：局部加权朴素贝叶斯是一种改进朴素贝叶斯算法独立性假设缺陷的算法通过实验证明加权朴素贝叶斯算法具有很好的效果，比朴素贝叶斯和K最近邻方法的效果都要好。关键字：局部加权，朴素贝叶斯在机器学习中直接使用贝叶斯定理是不现实的，因为训练集不足以获得全概率分布的准确估计。朴素贝叶斯分类算法是一种优秀的分类算法，但由于其必须满足属性独立性假设，使得该算法具有了一定的局限性。局部加权朴素贝叶斯算法为了从该算法的弱点独立性假设入手，对朴素贝叶斯算法进行改进，提出了一种基于K近邻法的局部加权朴素贝叶斯分类算法。实验表明该算法提高了分类的可靠性与准确率。1 局部加权朴学习局部加权学习（locally weighted learning，简称LWL），既可用于回归问题（如局部加权线性回归），又可用于分类问题（如局部加权朴素贝叶斯）。局部加权分类是一种比较新的方式，在一些实验中表现出更高的准确率。分类过程需要对训练实例根据它们离测试实例的距离进行加权。在传统的加权学习算法中通常使用欧几里德距离来度量实例间的距离。局部加权学习是方法是懒惰学习（lazy learning）和基于记忆学习（memory-based learning）的一种形式，它需要存储数据集，当需要对一个新实例进行处理，通过距离函数计算训练实例和测试实例的距离以确定和测试实例相关的训练实例的加权集合构，然后用该集合构造一个新的模型来处理新实例。1.1 局部加权朴素贝叶斯原则上，贝叶斯定理保证了对一个给定属性值向量的新实例的类标的最优预测。不幸的是，直接将贝叶斯定理用于机器学习是不现实的，因为不可避免训练数据不足以获得全概率分布的精确估计。为了使推理可行必须先满足一些独立性假设。朴素贝叶斯方法把独立性假设发挥到了极致，假定属性对于给定的类标值是统计上独立的。虽然这个假设在实际中并不成立，朴素贝叶斯在许多分类问题上表现的非常好。此外，朴素贝叶斯计算效率训练在实例个数和属性个数上都是线性的且易于执行。机器学习相关文章开始关注朴素贝叶斯学习算法归功于Clark和Niblett的有关CN2规则学习的文章。在这篇文章中他们在实验评估中使用了一个简单的贝叶斯分类器（朴素贝叶斯）作为对比，朴素贝叶斯分类器比其他更成熟的学习算法表现更好。虽然已经对朴素贝叶斯在一些违反属性独立假设的情况下具有良好表现进行了解释，但一个基本事实没有改变，那就是当独立性假设不成立时，概率估计精度和效果都会下降。很多用于提高朴素贝叶斯效果的方法被提出，其中许多方法在保持原算法的简单性和计算高效性的同时降低算法的“朴素性”。Zheng和Webb在这个领域的工作进行了很好的总结。最有效的方法包括：贝叶斯网络的限制子类、结合了属性选择的朴素贝叶斯或者将朴素贝叶斯模型结合到其他分类器（例如决策树）。事实证明局部加权的朴素贝叶斯算法具有很好的效果，比朴素贝叶斯和K最近邻方法的效果都要好。我们用来加权朴素贝叶斯的方法是从一项源于用来对非线性回归模型进行估计的技术中借鉴而来，线性回归模型适合基于加权函数的数据，这个加权函数用来处理要进行预测的实例。由于加权函数随着每个需要处理的实例改变，所以由此产生的估计是非线性的。本文我们研究了用于分类的局部加权学习，局部加权学习在机器学习中没有得到很多关注。Loader（1999）和Hastie(2001)从统计学角度研究了所谓的“局部可能性”方法，包括局部加权线性逻辑回归和局部加权密度估计。朴素贝叶斯是用密度估计进行分类的例子。和逻辑回归相比它具有优势：在属性个数上是线性的，这是这种方法在具有多属性的学习问题上具有更高的计算有效性。我们使用朴素贝叶斯的方式和在局部加权线性回归中使用线性回归的方式一样：一个局部朴素贝叶斯模型适合于用来预测类属性实例（我们称这个实例为测试实例）的领域中的数据集的子集。此领域中的训练实例是加权的，距离测试实例越远的例子具有的权重越小。然后一个分类器可以从朴素贝叶斯模型获得，朴素贝叶斯模型将测试实例的属性值作为输入。用来训练每个局部加权朴素贝叶斯模型的数据集的子集由最近邻算法决定。用户指定的参数k控制使用多少个实例。这通过使用具有紧支撑的加权函数和为k最近邻的距离设定宽度（或带宽）来实现。1.2 属性处理令di表示测试实例到第i个最近邻点xi的欧几里德距离。我们假设所有属性在计算距离前都被标准化为0到1之间的数值，名称型属性都进行二值化处理。令f为一个加权函数对所有的y 1有f（y）= 0。接下来我们设每个实例xi的权重wi为这意味着实例xk的权重为0，所有距离测试实例很远的实例的权重都为0，和测试实例相同的实例权重为1。所有具有以上性质的单调递减函数都可以作为加权函数。在我们的实验中使用的如下线性加权函数flinear，对于y 0 , 1，有即，我们令权重随距离线性递减。更高的k值导致模型对数据中的波动变化很小，而较小的k值使得模型和数据保持一致。过小的k值可能导致模型和数据中的噪声匹配。我们的实验表明这种方法对于k值的选择在不是特别小的情况下是不敏感的。有一点需要说明。为了避免零频率的出现，我们在实现朴素贝叶斯的过程中使用Laplace估计对名称型属性的条件概率进行估计，这与加权方案进行交互。我们的经验发现适当的划分权重可以使得用来生成朴素贝叶斯模型的实例的总权重逼近k。假定有r个训练实例xi满足di dk。那么重新调整的权重计算公式如下其中n是训练实例的总个数。朴素贝叶斯计算具有属性值a1，a2，a3，am测试实例的类属性cl的后验概率的公式如下：其中o是类属性的总数。等式右边的单个个体的概率根据加权后的数据进行估计，类属性cl的先验概率为其中ci是序号为i的训练实例的类属性值，当x = y时指标函数I（x = y）为1，其他为0。假设属性j是名称型的，aj（属性在测试实例中对应的值）的条件概率计算公式如下：其中nj属性j取值的个数，aij是属性j在实例i中的值。如果数据中包含数值属性，我们用Fayyad和Irani的基于MDL准则的离散化方案，将结果视作名称型属性，也可以做一个名称型假设，基于加权数据集对均值和方差进行估计。我们将给出两种方法的实证结果。2 加权朴素贝叶斯的实现分析我们将详细分析Weka-3-6-13平台下，LWL.java源代码之中的所有成员变量与方法。2.1 成员变量在Weka-3-6-13平台下的LWL.java类共定义了13个成员变量，现逐一介绍所有成员变量如下：（1）序列化serialVersionUID简单来说，Java的序列化机制是通过在运行时判断类的serialVersionUID来验证版本一致性的。在进行反序列化时，JVM会把传来的字节流中的serialVersionUID与本地相应实体（类）的serialVersionUID进行比较，如果相同就认为是一致的，可以进行反序列化，否则就会出现序列化版本不一致的异常。（2）训练实例集合m_Train用于分类的训练实例的集合。（3）邻接点数目m_kNN用来对核的带宽进行选择的邻接点数目。（4）加权核方法m_WeightKernel目前选择使用的加权核方法。（5）布尔型变量m_UseAllK用来决定是否对每个实例设置邻接点数目m_kNN。（6）抽象类对象m_NNSearch用来设置使用的最近邻搜索算法。（7）五个静态最终变量LINEAR、EPANECHNIKOV、TRICUBE、INVERSE、GAUSS、CONSTANT分别代表了五种不同可用的核权重方法。这五个静态最终变量代表了可使用核加权方法。（6）分类器m_ZeroR以防当不能从数据集中建立分类器时有ZeroR模型可用。ZeroR指建立和使用0-r分类器类。可预测均值（数值型类）或模型（名称型类）。2.2 辅助性的方法在Weka-3-6-13平台下的LWL.java类共定义了25个方法，首先简单介绍这25个方法，然后对其中的7个方法我们做重点介绍。现逐一介绍所有方法如下：（1）类的说明信息globalInfo()这个方法就是返回描述这个类的一串字符主要是对相应分类模型的说明。在浏览器（explorer）或Weka实验环境（experimenter）下，通过这串字符来显示这个类的功能。这个方法的实现机制依赖于Weka.core之下的TechnicalInformation类与Technical InformationHandler接口。这个方法主要用于：01: /*02: * Returns a string describing classifier.03: * return a description suitable for04: * displaying in the explorer/experimenter gui05: */（2）列出参考文献所有著录项getTechnicalInformation()这个方法用于01: /*02: * Returns an instance of a TechnicalInformation object, containing 03: * detailed information about the technical background of this class,04: * e.g., paper reference or book this class is based on.05: * 06: * return the technical information about this class07: */（3）构造器LWL()用来实例化一个决策树树桩分类器对象。（4）列出描述默认分类器的字符串defaultClassifierString()这个方法用于返回用于描述默认分类器的字符串。01: /*02: * String describing default classifier.03: * 04: * return the default classifier classname05: */（5）返回枚举对象enumerateMeasures()这个方法用于返回其他测试名字的枚举对象01: /*02: * Returns an enumeration of the additional measure names 03: * produced by the neighbour search algorithm.04: * return an enumeration of the measure names05: */（6）返回命名的测试的值getMeasure(String additionalMeasureName)01: /*02: * Returns the value of the named measure from the 03: * neighbour search algorithm.04: * param additionalMeasureName the name of the measure to query for 05: * its value06: * return the value of the named measure07: * throws IllegalArgumentException if the named measure is not 08: * supported09: */（7）返回枚举对象listOptions()这个方法用于返回一个描述可使用选项的枚举类。01: /*02: * Returns an enumeration describing the available options.03: *04: * return an enumeration of all the available options.05: */（8）解析一个选项列表setOptions(String options) throws Exception这个方法用于：01: /*02: * Parses a given list of options. 03: *04: 05: * Valid options are: 06: * 07: * -A08: * The nearest neighbour search algorithm to use (default: 09: * weka.core.neighboursearch.LinearNNSearch).10: * 11: * 12: * -K <number of neighbours>13: * Set the number of neighbours used to set the kernel bandwidth.14: * (default all)15: * 16: * -U <number of weighting method>17: * Set the weighting kernel shape to use. 0=Linear, 1=Epanechnikov,18: * 2=Tricube, 3=Inverse, 4=Gaussian.19: * (default 0 = Linear)20: * 21: * -D22: * If set, classifier is run in debug mode and23: * may output additional info to the console24: * 25: * -W26: * Full name of base classifier.27: * (default: weka.classifiers.trees.DecisionStump)28: * 29: * 30: * Options specific to classifier 31: * weka.classifiers.trees.DecisionStump:32: * 33: * 34: * -D35: * If set, classifier is run in debug mode and36: * may output additional info to the console37: * 38: 39: *40: * param options the list of options as an array of strings41: * throws Exception if an option is not supported42: */（9）返回字符串数组存放分类器的当前设置getOptions()这个方法用于获得分类器的当前设置。01: /*02: * Gets the current settings of the classifier.03: *04: * return an array of strings suitable for passing to setOptions05: */（10）返回这些性质的简要说明KNNTipText()这个方法用于：01: /*02: * Returns the tip text for this property.03: * return tip text for this property suitable for04: * displaying in the explorer/experimenter gui05: */（11）设置的邻接点的数目setKNN(int knn)这个方法用于设置用于核带宽的邻接点数目。01: /*02: * Sets the number of neighbours used for kernel bandwidth setting.03: * The bandwidth is taken as the distance to the kth neighbour.04: *05: * param knn the number of neighbours included inside the kernel06: * bandwidth, or 0 to specify using all neighbors.07: */（12）返回邻接点数目getKNN()这个方法返回用于核带宽设置的邻接点数目。（13）返回这些性质的简要说明weightingKernelTipText()01: /*02: * Returns the tip text for this property.03: * return tip text for this property suitable for04: * displaying in the explorer/experimenter gui05: */（14）setWeightingKernel(int kernel)这个方法用于设定将要使用的核加权方法，只能是六个方法（线性、抛物线、三立方体、逆、高斯、常数）之一。01: /*02: * Sets the kernel weighting method to use. Must be one of LINEAR, 03: * EPANECHNIKOV, TRICUBE, INVERSE, GAUSS or CONSTANT, other values04: * are ignored.05: *06: * param kernel the new kernel method to use. Must be one of LINEAR,07: * EPANECHNIKOV, TRICUBE, INVERSE, GAUSS or CONSTANT.08: */（15）获得将要使用的核加权方法的整型值getWeightingKernel()（16）返回最近邻搜索算法的说明nearestNeighbourSearchAlgorithmTipText()（17）返回正在使用的最近邻搜索算法getNearestNeighbourSearchAlgorithm()（18）设定用于寻找最近邻的最近邻搜索算setNearestNeighbourSearchAlgorithm(NearestNeighbourSearch nearestNeighbourSearchAlgorithm)（19）返回分类器的默认功能getCapabilities()（20）生成分类器buildClassifier(Instances instances)（21）将提供的实例加入训练集updateClassifier(Instance instance)将实例作为参数传给方法，然后这个方法将该实例加入训练集。01: /*02: * Adds the supplied instance to the training set.03: *04: * param instance the instance to add05: * throws Exception if instance could not be incorporated06: * successfully07: */（22）计算测试实例类的概率distributionForInstance(Instance instance)这个方法用于计算给定的测试实例的类成员概率。01: /*02: * Calculates the class membership probabilities for the given test instance.03: *04: * param instance the instance to be classified05: * return preedicted class probability distribution06: * throws Exception if distribution cant be computed successfully07: */（23）返回分类器的描述toString()这个方法将对字符串的描述作为字符串返回。01: /*02: * Returns a description of this classifier.03: *04: * return a description of this classifier as a string.05: */（24）返回修订的字符串getRevision()（25）主方法用于测试这个类。01: /*02: * Main method for testing this class.03: *04: * param argv the options05: */2.3 主要方法下面我们来详细介绍算法中的几个主要方法及：（1）public Enumeration listOptions()这个方法用于返回一个描述可使用选项的枚举类。01: public Enumeration listOptions() 02: /实例化Vector对象，Vector类封装了一个动态的、允许再分派的Object数组03: Vector newVector = new Vector(3);04: /*Option:类型安全的方法：addElement（对象）属于原型载体，用来向对象尾部添加05: *元素。对泛型类型向量的引用要进行参数化。06: */07: newVector.addElement(new Option(tThe nearest neighbour search +08: algorithm to use +(default: 09: weka.core.neighboursearch.LinearNNSearch).n,10: A, 0, -A);11: newVector.addElement(new Option(tSet the number of neighbours used 12: to set13: + the kernel bandwidth.n14: +t(default all),15: K, 1, -K );16: newVector.addElement(new Option(tSet the weighting kernel shape to 17: use.18: + 0=Linear, 1=Epanechnikov,n19: +t2=Tricube, 3=Inverse, 4=Gaussian.n20: +t(default 0 = Linear),21: U, 1,-U );22: /23: Enumeration enu = super.listOptions();24: while (enu.hasMoreElements() 25: newVector.addElement(enu.nextElement();26: 27: /返回向量的所有组成元素的枚举28: return newVector.elements();29: （2）public void setOptions(String options)这个方法主要用于分析一个给定的选项列表01: public void setOptions(String options) throws Exception 02:03: String knnString = Utils.getOption(K, options);04: if (knnString.length() != 0) 05: setKNN(Integer.parseInt(knnString);06: else 07: setKNN(-1);08: 09:10: String weightString = Utils.getOption(U, options);11: if (weightString.length() != 0) 12: setWeightingKernel(Integer.parseInt(weightString);13: else 14: setWeightingKernel(LINEAR);15: 16: 17: String nnSearchClass = Utils.getOption(A, options);18: if(nnSearchClass.length() != 0) 19: String nnSearchClassSpec = Utils.splitOptions(nnSearchClass);20: if(nnSearchClassSpec.length = 0) 21: throw new Exception(Invalid NearestNeighbourSearch algorithm +22: specification string.); 23: 24: String className = nnSearchClassSpec0;25: nnSearchClassSpec0 = ;26:27: setNearestNeighbourSearchAlgorithm( (NearestNeighbourSearch)28: Utils.forName( NearestNeighbourSearch.class, 29: className, 30: nnSearchClassSpec)31: );32: 33: else 34: this.setNearestNeighbourSearchAlgorithm(new LinearNNSearch();35:36: super.setOptions(options);37: （3）public String getOptions()这个方法用来返回分类器的当前设置数组。01: public String getOptions() 02:03: String superOptions = super.getOptions();04: String options = new String superOptions.length + 6;05: int current = 0;06: optionscurrent+ = -U; optionscurrent+ = + 07: getWeightingKernel();08: if ( (getKNN() = 0) & m_UseAllK) 09: optionscurrent+ = -K; optionscurrent+ = -1;10: 11: else 12: optionscurrent+ = -K; optionscurrent+ = + getKNN();13: 14: optionscurrent+ = -A;15: optionscurrent+ = m_NNSearch.getClass().getName()+ 16: +Utils.joinOptions(m_NNSearch.getOptions();17: System.arraycopy(superOptions, 0, options, current,18: superOptions.length);19: return options;20: （4）public String getOptions()这个方法返回分类器的默认的功能。01: public String getOptions() 02:03: String superOptions = super.getOptions();04: String options = new String superOptions.length + 6;05:06: int current = 0;07:08: optionscurrent+ = -U; optionscurrent+ = + 09: getWeightingKernel();10: if ( (getKNN() = 0) & m_UseAllK) 11: optionscurrent+ = -K; optionscurrent+ = -1;12: 13: else 14: optionscurrent+ = -K; optionscurrent+ = + getKNN();15: 16: optionscurrent+ = -A;17: optionscurrent+ = m_NNSearch.getClass().getName()+ 18: +Utils.joinOptions(m_NNSearch.getOptions();19: System.arraycopy(superOptions, 0, options, current,20: superOptions.length);21:22: return options;23: （5）public void buildClassifier(Instances instances)这个方法用来生成构造器01: public void buildClassifier(Instances instances) throws Exception 02:03: if (!(m_Classifier instanceof WeightedInstancesHandler) 04: throw new IllegalArgumentException(Classifier must be a 05: + WeightedInstancesHandler!);06: 07:08: / can classifier handle the data?09: getCapabilities().testWithFail(instances);10:11: / remove instances with missing class12: instances = new Instances(instances);13: instances.deleteWithMissingClass();14: 15: / only class? - build ZeroR model16: if (instances.numAttributes() = 1) 17: System.err.println(18: Cannot build model (only class attribute present in data!), 19: + using ZeroR model instead!);20: m_ZeroR = new weka.classifiers.rules.ZeroR();21: m_ZeroR.buildClassifier(instances);22: return;23: 24: else 25: m_ZeroR = null;26: 27: 28: m_Train = new Instances(instances, 0, instances.numInstances();29:30: m_NNSearch.setInstances(m_Train);31: （6）public void updateClassifier(Instance instance)这个方法主要用于将实例作为参数传给方法，然后这个方法将该实例加入训练集。01: public void updateClassifier(Instance instance) throws Exception 02:03: if (m_Train = null) 04: throw new Exception(No training instance structure set!);05: 06: else if (m_Train.equalHeaders(instance.dataset() = false) 07: throw new Exception(Incompatible instance types);08: 09: if (!instance.classIsMissing() 10: m_NNSearch.update(instance);11: m_Train.add(instance);12: 13: （7）public double distributionForInstance(Instance instance)这个方法用于计算给定的测试实例类成员的概率。01: public double distributionForInstance(Instance instance) throws 02: Exception 03: / default model?默认模型04: /*如果m_ZeroR不为空，则用其调用distributionForInstance（）方法求实例类05: *成员的概率，并作为返回值返回06: */07: if (m_ZeroR != null) 08: return m_ZeroR.distributionForInstance(instance);09: 10: 11: if (m_Train.numInstances() = 0) 12: throw new Exception(No training instances!);13: 14: 15: m_NNSearch.addInstanceInfo(instance);16: 17: int k = m_Train.numInstances();18: if( (!m_UseAllK & (m_kNN k) &19: !(m_WeightKernel=INVERSE |20: m_WeightKernel=GAUSS) ) 21: k = m_kNN;22: 23: 24: Instances neighbours = m_NNSearch.kNearestNeighbours(instance, k);25: double distances = m_NNSearch.getDistances();26:27: if (m_Debug) 28: System.out.println(Test Instance: +instance);29: System.out.println(For +k+ kept + neighbours.numInstances() + 30: out of +31: m_Train.numInstances() + instances.);32: 33: 34: /IF LinearNN has skipped so much that distances.length)36: k = distances.length;37:38: if (m_Debug) 39: System.out.println(Instance Distances);40: for (int

人人文库> 全部分类> 应用文书 > 技术指导

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

加权朴素贝叶斯算法及其weka程序分析20151216A.docx

文档简介

温馨提示

最新文档

评论

加权朴素贝叶斯算法及其weka程序分析20151216A.docx

文档简介

温馨提示

最新文档

评论

相关文档