版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、基于遗传算法的K均值聚类分析¨计算机科学2003Voi.3DN9,2王敞陈增强袁著祉(南开大学信息技术科学学院天津300071K-Means Clustering Based oll Genetic AlgorithmWANG Chang CHEN ZengQiang YUAN Zhu。Zhi(CollegeoI Information Science and TechnologyNankai UniversityTianjin300071Abstract This paper proposes a KMeans clustering method based on genetic
2、algorithm.We compare our methodwith the traditional KMeans method and clustering method based on simple genetic algorithmThe comparisonprovesthat our method achieves a better result than the other two-The drawback of this method is8comparably slower speed in clustering.Keywords Data mining.Clusterin
3、g,Genetic algorithm,KMeans clustering1前言聚类分析就是将数据对象分组成为多个类或簇.在同一个簇中的对象之问具有较高的相似度,而不同的簇中的对象差别较大.聚类分析目前应用广泛.已经成为数据挖掘主要的研究领域.通过聚类.人们能够识别密集的和稀疏的区域,从而发现数据的整体分布模式,还能找到数据间的有趣的相互关系.关于聚类分析目前已经有K均值,CURE等很多算法,而且在实践中得到了应用.在这里,我们针对应用最为广泛的K均值方法的缺点.提出了基于遗传算法的K均值聚类分析方法.实验表明.新方法在聚类问题中得到的结果全面要优于传统K均值聚类方法,也好于单纯的遗传算法聚类
4、.只是由于用到了遗传操作.聚类速度相对K均值方法要慢一些.2K均值方法的一般描述K均值方法是基于划分的聚类方法.它在目前的聚类分析中应用最为广泛.其基本思想为:对于给定的聚类数目K.首先随机创建一个初始划分.然后采用选代方法通过将聚类中心不断移动来尝试着改进划分.为了达到最优.这种K均值方法理论上应该穷举所有可能的划分.但实际上,这里采用了启发式方法.用每类的平均值来表示诙类.这大大降低了计算的复杂性.提高了运算速度,使处理大规模数据集成为可能。但同时,这也导致了该方法受初始值影响很大.通常只能以局部最优结束.而且这种方法对于孤立点是敏感的.5基于单纯遗传算法的聚类遗传算法是基于自然选择和遗传
5、规律的搜索方法.该方法是随机选择与适者生存理论的结合.群体中的强者拥有更大的机会将其基因传给后代。我们可以简单地将一个实际问题的不同解编码成位串也就是所谓个体(Xi(t.并评价它们的适应度(f(t,然后基于个体的适应度,按一定比例(PsIXi(t选择个体.进行遗传操作.遗传操作包括交叉和变异两步,这两步操作也是接一定概率(交叉概率Pc.变异概率Pm进行的.在进行若干代遗传操作后,算法就根有希望-本文得到国束自然科学基金(60174021资助找到最优解或近似最优解.由于遗传算法采用群体的方式进行搜索,这使得它可以同时搜索空间内的多十区域。在赋予遗传算法自组织,自适应。自学习能力的同时t优胜劣汰的
6、自然选择和遗传操作使遗传算法具有不受其搜索空间限翩性条件的约束和不需要其他辅助信息的特点.这些特点使遗传算法不仅能获得较高的效率而且具有简单.易于操作和通用的特性.这些特性使得遗传算法在各领域中应用得越发广泛.已经有人尝试用遗传算法进行聚类,基本思想如下:将一染色体分成N十部分,每个部分对应一个数据元素的类别.例如,第lo部分值为1表示10号数据元素属于类别1.在这里,适应度函数定义为数据间的欧式距离.初始群体采用随机产生的方法获得.遗传操作与传统遗传算法类似.根据遗传算法的理论基础,这种方法可以找到全局最优解,不受孤立点影响.不失为一个很好的方法.但是这种方法在数据量很大.要求聚类的类别很多
7、时,运算时间就显得过长.而且往往效果还不如K均值方法.因此这种方法通常只适用于数量较小,类别数不大的情况.也有人改进了遗传算法的编码方式.把各类别的乘粪中心坐标编码成染色体,其他遗传操作与传统遗传算法一致.这种方法,收敛速度要好于第一种遗传算法.具有更强的适用性.不过这种方法由于单纯依靠遗传算法寻优.因此收敛依然缓慢,而且采用这种编码方式又导致了算法对孤立点变得敏感。4基于遗传算法的K均值聚类分析我们的算法是在遗传算法思想与K均值算法思想的基础上提出来的.我们把K均值方法引入到遗传算法的进化中.首先.随机产生遗传算法的第一代并开始进化.在每代进化中,我们都用K均值方法对每个个体进行进一步的优化
8、.选相当于在每一代都要对所有个体计算以其为初始值的K 均值问题的局部最优结果,并以这些局部最优结果替换掉原来的个体并继续进化,直到达到最大代数或者结果符台要求为止。这种方法力图通过遗传算法来保证获取全局最优解.而用K均值方法提高算法的收敛速度.163算法流程图如下:【I韧始化:确定遗传参数;产生初始群体。(2对当前群体的每个个体,用K均值方法将其优化为以该个体为初始值的K均值问题的局部最优结果。【3对这些局部最忧个体进行选择.(4动态改变交叉概率进行交叉。(5动态改变变异概率进行变异。(6若到达最大代数或者满足适应度要求,继续执行。否则.转。(7输出结果。染色体蝙码策略我们将各个类别的中心坐标
9、编码为染色体。例如对于一个类别为3的聚类问题,假设数据集为5维的,韧始的三个聚类中心点为(10。zo,30.40.50,(20,40,60, 80,lOO,(30,50,70,90,110,则我们的染色体编码就为(10。203040-50.20t40,60.80,lOO.30,50,70,90,110。这种基于聚类中心的编码方式减少了染色体的长度,大大提高了遗传算法的收敛速度,对于求解大量数据的复杂聚类问题效果较好。适应度函数我们将适应度函数定义成误差平方报值丘=一(/艺【P一胁。其中。K为聚类的数目.P为问题空间的点,Mi是簇ci的平均值.其中P,Mi都是多维的。这里,我们只考虑了数值属性的
10、聚类,对于非数值属性,可以用适当方法转化为数值属性之后再使用该方法。初姑群体初始群体有多种生成办法.为了获得全局最忧解,这里徒们的初始群体完全随机生成。考虑到要对大量数据进行聚类,我们不能将初始群体设得很大。但是为了使遗传算法数据元素能够多样性,我们在时间和硬件条件允许的条件下应该尽可能地提高群体的规模.选择交叉与变异为了让遗传算法能够获得全局最优解,我们采用了随机选择,最佳个体保留的方式。交叉则采用一点交叉方式.对本问题.我们经过仿真试验发现,变异是能够跳出局部最优的关键.变异算子对最终能否获得全局最优解影响重大.为此。我们对变异率实行动态变化.开始时,变异率较小随着整代的平均适应度趋于最优
11、个体的适应度时.变异率就自适应地增大。缸立点的影响由于本方法的进化目标是找到最优的聚类中心,因此与K均值方法的本质是一样的。少量的孤立点势必导致聚类中心的偏移.影响垒局的聚类效果.但由于遗传算法的全局寻忱特性,本算法对孤立点的敏感性要小于单纯的K均值法。5仿真实验我们用VC¨实现了新算法.实验数据采用在聚类中心附近产生的高斯分布数据。我们用文中所谈及的K均值方法基于单纯遗传算法的聚类方法和新方法对于300.1000, 5000,10000个有5维特征的点集进行了3类(K#3聚类分析164和10类(K一10聚类分析。下面把一组用新方法和K均值方法进行比较的结果通过表格显示如下:表13疑
12、真分析(表格内容为误差平方根,进化:20代K均值法新方法性能提升3点】北.478279¨720“ooo点189.6175007573.6%sooo点576.49611338l803%lOOOO.点812.978159、91803%K均值法新方法性能提升j300点124.8897918鼬f365蹦1000点22B.511144426368“5000点529.76423596555,;10000点799979523878345“ 基于遗传算法的K均值聚类分析作者:王敞, 陈增强, 袁著祉作者单位:南开大学信息技术科学学院,天津,300071刊名: 计算机科学英文刊名:COMPUTER S
13、CIENCE年,卷(期:2003,30(2被引用次数:16次参考文献(6条1.HanJiawei.Kamber M Data Mining:Concepts and TechniquesThe Morgan Kaufmann Series in Data Management Systems2.徐勇.刘奕文.陈贺新.戴逸松一种基于自适应遗传算法的聚类分析方法期刊论文-系统工程与电子技术1997(093.王磊.戚飞虎大矢量空间聚类的遗传k-均值算法期刊论文-上海交通大学学报 1999(094.韩斌基于单亲遗传算法和模糊C-均值算法的混合聚类算法期刊论文-华东船舶工业学院学报 2000(035.M
14、aulik.Ujjwal Bandyopadhyay.Sanghamitra Genetic algorithmbased clustering technique Pattern Recognition 20006.Cowgill M C.Harvey R J.Watson L Z Genetic algorithm approach to cluster analysis 1999(07相似文献(10条1.外文会议Eva Kovacs.Iosif Ignat Clustering with Prototype Entity Selection in Data MiningThis pape
15、r presents an original classification method used in data mining, namely, the clustering with prototype entity selection (abbreviated CPES. The paper describes its mathematical essence, presents the algorithm and the experimental results obtained as compared to other methods of classification2.外文期刊M
16、ichael K. Ng.Mark Junjie Li.Joshua Zhexue Huang.Zengyou He On the Impact ofDissimilarity Measure in k-Modes Clustering AlgorithmThis correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical obje
17、cts, a heuristic approach was developed in (Z. He, et al., 2005, (O. San, et al., 2004 which allows the use of the k-modes paradigm to obtain a cluster with strong intrasimilarity and to efficiently cluster large categorical data sets. The main aim of this paper is to rigorously derive the updating
18、formula of the k-modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework3.外文会议Jian Yin.Zhi-Fang Tan.Jiang-Tao Ren.Yi-Qun Chen An efficient clustering algorithm formixed type attributes in large datasetClustering is a widely
19、used technique in data mining, at present there exists many clustering algorithms, but most existing clustering algorithms either are limited to handle the single attribute or can handle both data types but are not efficient when clustering large data sets. Few algorithms can do both well. In this a
20、rticle, we propose a clustering algorithm that can handle large datasets with mixed type of attributes. We first use CF*tree (just like CF-tree in BIRCH to pre-cluster datasets. After that the dense regions are stored in leaf nodes, then we look every dense region as a single point and use the ameli
21、orated k-prototype to cluster such dense regions. Experiment shows that this algorithm is very efficient in clustering large datasets with mixed type of attributes.4.外文会议Xiao-Gao Yu.Yin Jian A new clustering algorithm based on KNN and DENCLUEClustering in data mining is used for identifying useful p
22、atterns and interested distributions in the underlying data. Clustering techniques have been studied extensively in e-commerce, statistics, pattern recognition, and machine learning. This increases the need for efficient and effective analysis methods to make use of this information. Traditional DEN
23、CLUE is an important clustering algorithm. But it is difficult to make its two global parameters (/spl sigma/, /spl xi/ be globally effective. A new algorithm based on KNN and DENCLUE is proposed in this paper, which offers DENCLUE the appropriate and globally effective parameters based on KNN and D
24、ENCLUE. At the first, the window-width (WW of each data point is determined and the whole data set is partitioned into some fuzzy cluster (FC by KNN based on KDE. Then, the local /spl sigma/ of each FC is unsupervised determined according to the entropy theory. At the last, each local /spl sigma/ is
25、 mapped to the global /spl sigma/ and each FC is independently clustered, which makes the global /spl sigma/ and /spl xi/ have the global validity. The analysis and experiment prove that our clustering method achievesbetter performance on the quality of the resulting clustering and the results are n
26、ot sensitive to the parameter k.5.外文期刊Huang. J.Z.Ng. M.K.Hongqiang Rong.Zichen Li Automated variable weighting in k-meanstype clusteringThis paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step isformula for weight calculation is proposed.
27、The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental result
28、s on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data.6.外文会议Zhijie Xu.Laisheng Wang.Jiancheng Luo.Jianqin Zhang A modified clustering algorithm fordata miningClustering is a widely used technique of findin
29、g interesting patterns residing in the dataset that were not obviously known. Itis a division of data into groups of similar objects. The clustering of large data sets has received a lot of attention in recent years, however, clustering is still a challenging task since many cluster algorithms fail
30、to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. This paper describes a clustering method for unsupervised classification of objects in large data
31、sets. The new methodology combines the simulating annealing algorithm with CLARANS (clustering large application based upon randomized search in order to cluster large data sets efficiently. At last, the method is experimented on the generated data set. The result shows that the approach is quick th
32、an CLARANS and can produce a similar division of data as CLARANS.7.外文会议Wei-Di Dai.Yue-Xian Hou.Pi-Lian He.Xiao-Shen Zheng A clustering algorithm based onbuilding a density-treeA new kind of clustering algorithm called CABDET is presented in this paper. CABDET creates a tree structure for every clust
33、er, from which the neighbor's radius of the current object is calculated by the local density of its father node. Those unprocessed objects in the neighbor of the current object are added to extend the tree structure until no new object is founded. Each density-tree is regarded as one cluster. C
34、ABDET requires only one input parameter as the initial radius of the root node and has nolimitation of density threshold. Other characteristics include the abilities of discovering clusters with arbitrary shape and processing the noise data. The result of our experiments demonstrates that CABDET is
35、significantly more accurate in discovering density-changeable clustering than the algorithm DBSCAN, and that CABDET is less sensitive to input parameters.8.外文会议Tout. S.Sverdlik. W.Junping Sun Cluster Detection with the PYRAMID AlgorithmAs databases continue to grow in size, efficient and effective c
36、lustering algorithms play a paramount role in data mining applications. Practical clustering faces several challenges including: identifying clusters of arbitrary shapes, sensitivity to the order of input, dynamic determination of the number of clusters, outlier handling, processing speed of massive
37、 data sets, handling higher dimensions, and dependence on user-supplied parameters. Many studies have addressed one or more of these challenges. PYRAMID, or parallel hybrid clustering using genetic programming and multi-objective fitness with density, is an algorithm that we introduced in a previous
38、 research, which addresses some of the above challenges. While leaving significant challenges for future work, such as handling higher dimensions, PYRAMID employs a combination of data parallelism, a form of genetic programming, and a multi-objective density-based fitness function in the context of
39、clustering. This study adds to our previous research by exploring the detection capability of PYRAMID against a challenging dataset and evaluating its independence on user supplied parameters9.外文会议Islam. M.Z.Brankovic. L.DETECTIVE: a decision tree based categorical value clusteringand perturbation t
40、echnique for preserving privacy in data miningData mining is a powerful tool for information discovery from huge datasets. Various sectors, including commercial, government, financial, medical, and scientific, are applying data mining techniques on their datasets that typically contain sensitive ind
41、ividual information. During this process the datasets get exposed to several parties, which can potentially lead to disclosure of sensitive information and thus to breaches of privacy. Several data mining privacy preserving techniques have been recently proposed. In this paper we focus on data pertu
42、rbation techniques, i.e., those that add noise to the data in order to prevent exact disclosure of confidential values. Some of these techniques were designed for datasets having only numerical non-class attributes and a categorical class attribute. However, natural datasets are more likely to have
43、both numerical and categorical non-class attributes, and occasionally they contain only categorical attributes. Noise addition techniques developed for numerical attributes are not suitable for such datasets, due to the absence of natural ordering among categorical values. In this paper we propose a
44、 new method for adding noise to categorical values, which makes use of the clusters that exist among these values. We first discuss several existing categorical clustering methods and point out the limitations they exhibit in our context. Then we present a novel approach towards clustering of catego
45、rical values and use it to perturb data while maintaining the patterns in the dataset. Our clustering approach partitions the values of a given categorical attribute rather than the records of the datasets; additionally, our approach operates on the horizontally partitioned dataset and it is possibl
46、e for two values to belong to the same cluster in one horizontal partition of the dataset, and to two distinct clusters in another partition. Finally, we provide some experimental results in order to evaluate our perturbation technique and to com10.外文期刊Ng. R.T.Jiawei Han CLARANS: a method for clustering objects for spatial data miningSpatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, it proposes a new clustering
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 统编版五下六上学科融合劳动教材-茄子种植实践(同一班级进阶版)
- 2026黑龙江七台河市人民医院第一批招聘编外医疗卫生技术人员56人备考题库及1套参考答案详解
- 2026贵州黔东南州镇远县人民医院上半年招聘编制外聘用人员17人备考题库含答案详解
- 2026年河南省三门峡市湖滨区事业单位联考招聘备考题库附答案详解(突破训练)
- 2026中国共产党曲靖市委员会统一战线工作部招聘公益性岗位3人备考题库(云南)含答案详解(夺分金卷)
- 2026甘肃省中医院考核招聘高层次人才1人备考题库(第四期)附答案详解(综合题)
- 2026新疆兵团兴新职业技术学院面向高校毕业生招聘37人备考题库含答案详解(培优b卷)
- 2026云南西双版纳勐腊海关招聘2人备考题库含答案详解(预热题)
- 2026辽宁大连理工大学盘锦产业技术研究院招聘(高层次和 急需紧缺)13人备考题库及一套完整答案详解
- 2026北京顺义区教委所属事业单位第二次招聘教师189人备考题库及答案详解(易错题)
- 2025年江西省高考物理试卷真题(含答案及解析)
- 2025年党纪法规知识测试题(含答案)
- 电梯型式试验规则
- 线材生产车间管理制度
- CJ/T 371-2011垃圾填埋场用高密度聚乙烯管材
- CJ 3057-1996家用燃气泄漏报警器
- 基于大数据的临床检验结果分析
- DBJ04T 292-2023 住宅物业服务标准
- 中药天花粉简介
- 2024-2025年全国高中数学联赛试题及解答
- 连续退火铜大拉线机性能参数及操作规范
评论
0/150
提交评论