版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Feature selection for text categorization on imbalanced dataAdvisor : Dr. HsuPresenter : Zih-Hui LinAuthor :Zhaohui Zheng , Xiaoyun Wu, Rohini Srihari1第1页,共42页。MotivationObjectiveIntroductionFeature selection frameworkExperimental setupPrimary results and analysisConclusionsFuture workOutline2第2页,共4
2、2页。MotivationFeature selection using one-sided metrics selects the features most indicative of membership onlyFeature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the
3、signs of features.3第3页,共42页。ObjectiveWe investigate the usefulness of explicit control of that combination within a proposed feature selection framework.Our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion acco
4、rding to the imbalanced data.4第4页,共42页。IntroductionOne choice in the feature selection policy is whether to rule out all negative features.Others believe that negative features are numerous, given the imbalanced data set, and quite valuable in practical experience.5第5页,共42页。Introduction (cont.)negat
5、ive features are useful because their presence in a document highly indicates its non-relevance. They help to confidently reject non-relevant documents.When deprived of negative features, the performance of all feature selection metrics degrades, which indicates negative features are essential to hi
6、gh quality classification.6第6页,共42页。Introduction (cont.)neither one-sided nor two-sided metrics themselves allow control of the combination. The focus in this paper is to answer the following three questions with empirical evidence: How sub-optimal are two sided metrics?To what extent can the perfor
7、mance be improved by better combination of positive and negative features? How can the optimal combination be learned in practice?7第7页,共42页。Feature selection metricsCorrelation Coefficient (CC), and Odds Ratio (OR) are one-sided metrics which select the features most indicative of membership for a c
8、ategory only,IG and CHI are two-sided metrics, which consider the features most indicative of either membership (e.g. positive features) or non-membership (e.g. negative features).8第8页,共42页。Feature selection metrics (cont.)positivepositivenegativenegativet陳水扁t無陳水扁Ci:政治類PositiveNegativeCi:非政治類Negativ
9、ePositive9第9页,共42页。Feature selection metrics (cont.)Information gain (IG)Measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document.Chi-square (CHI)measures the lack of independence between a term t and a category ci and can
10、 be compared to the chi-square distribution with one degree of freedom to judge extremeness10第10页,共42页。Feature selection metrics (cont.)Correlation coefficient (CC)Correlation coefficient of a word t with a category ciOdds ratio (OR)the odds of the word occurring in the positive class normalized by
11、that of the negative class.11第11页,共42页。Feature selection metrics (cont.)IG and CHI are two-sided, whose values are non-negative.CC and OR are one-sided metrics, whose positive and negative values correspond to the positive and negative features respectivelyWe can easily obtain that the sign for a on
12、e-sided metric, e.g. CC or OR, is sign (AD-BC).A positiveCnegativeB negative D positive12第12页,共42页。Feature selection metrics (cont.)A one-sided metric could be converted to its two-sided counterpart by ignoring the sign, while a two-sided metric could be converted to its one-sided counterpart by rec
13、overing the sign.We propose the two-sided counterpart of OR, namely OR square, and the one-sided counterpart of IG, namely signed IG as follows.13第13页,共42页。The imbalanced data problemThe imbalanced data problem occurs when the training examples are unevenly distributed among different classes.there
14、is an overwhelming number of non-relevant training documents especially when there is a large collection of categories with each assigned to a small number of documents,This problem presents a particular challenge to classification algorithms, which can achieve high accuracy by simply classifying ev
15、ery example as negative.14第14页,共42页。The imbalanced data problem ( cont.)The impact of imbalanced data problem on the standard feature selection can be illustrated as follows, which primarily answers the first question of “how sub-optimal are two side metric”:How to confidently reject the non-relevan
16、t documents is important in that case.given a two-sided metric, the values of positive features are not necessarily comparable with those of negative features.15第15页,共42页。The imbalanced data problem ( cont.)TP,FP, FN and TN denote true positives, false positives, false negatives and true negatives r
17、espectively.Positive features have more effect on TP and FN, negative features have more effect on TN and FP.feature selection using a two-sided metric combines the positive and negative features so as to optimize the accuracy, which is defined to beF1 has been widely used in information retrieval,
18、which is: Finally, some performance measures themselves,16第16页,共42页。Feature selection framework General formulationFor each category ci :l is the size of feature set0 l1= 1 is the key parameter of the framework to be set. The function = (; ci) should be defined so that the larger = (t; ci) is, the m
19、ore likely the term t belongs to the category ci.17第17页,共42页。Feature selection framework General formulation (cont.)Obviously, one-sided metrics like SIG, CC, and OR can serve as such functions.we can easily obtain:SIG(t; ci) = - SIG(t; ci);CC(t; ci) = - CC(t; ci);OR(t; ci) = - OR(t; ci);18第18页,共42页
20、。Feature selection framework General formulation (cont.)the second step can be rewritten as:the framework combines the l1 terms with largest =(; ci) the l2 = l - l1 terms with smallest = ( ; ci).19第19页,共42页。Feature selection framework two special casesThe standard feature selection methods generally
21、 fall into one of the following two groups:select the positive features only using one-sided metrics, e.g. SIG, CC, and OR. For convenience, we will use CC as the representative of this group.implicitly combine the positive and negative features using two-sided metrics, e.g. IG, CHI, and ORS. CHI wi
22、ll be chosen to represent this group.20第20页,共42页。Feature selection framework two special cases (cont.)The positive subset of Fi is 21第21页,共42页。Feature selection framework OptimizationThe feature selection framework facilitates the control on explicit combination of the positive and negative features
23、 through the parameter l1=l (size ratio).How to optimize the size ratio?Ideal scenariouse the training data to learn different models per category according to different size ratios ranging from 0 to 1 and select the ratio having best performance,Practical scenariowe tried in this paper is to empiri
24、cally select the size ratio per category having best performance on the training set.22第22页,共42页。Feature selection framework Optimization (cont.)The efficient implementation of optimization within the framework is as follows:select l positive features with greatest (t; ci) in a decreasing order.sele
25、ct l negative features with smallest (t; ci) in an increasing order.empirically choose the size ratio l1 / l such that the feature set constructed by combining the first l1; 0 l1 l; positive features and the first l - l1 negative features has the optimal performance.23第23页,共42页。Experimental setup da
26、ta collection We conduct experiments on standard text categorization data using two popular classifiers: Naiive Bayes (NB) and logistic regression (LR).Data collectionReuters-21578 (ModApte split) is used as our data collection,contains 90 categories, with 7769 training documents and 3019 test docum
27、ents.After filtering out all numbers, stop-words and words occurring less than 3 times, we have 9484 indexing words in the vocabulary.Words in document titles are treated same as in documentbody.24第24页,共42页。Experimental setup ClassifiersA NB score between a document d and the category ci can be calc
28、ulated as: fj is the feature appearing in the documentP(ci) and P(ci) represent prior probabilities of relevant and non-relevant respectivelyand P(fj | ci) and P(fj | ci) are conditional probabilities estimated with Laplacian smoothing.25第25页,共42页。Logistic regression tries to model the conditional p
29、robability as: The optimization problem for LR is to minimize: di is the ith training example, w is the weight vector,yi -1, 1 is the label associated with di, is an appropriately chosen regularization parameterExperimental setup Classifiers (cont.)26第26页,共42页。Experimental setup performance measureT
30、o measure the performance, we use both precision (p) and recall (r) in their combined from F1 : . To remain compatible with other results, the F1 value at Break Even Point (BEP).BEP is defined to be the point where precision equals recall. It corresponds to the minimum of |FP FN| in practice.We repo
31、rt both micro and macro-averaged BEP F1.The micro-averaged F1 is largely dependent on the most common categories while the rare categories influence macro-averaged F1.27第27页,共42页。Experimental setup performance measure28第28页,共42页。Primary results and analysis ideal scenarioMicro F1(BEP) values for NB
32、with the feature selection methods at different sizes of features over the 58 categories: 3rd-60th. The micro-averaged F1(BEP) for NB without feature selection (using all 9484 features ) is .64129第29页,共42页。Primary results and analysis ideal scenario (cont.)As table1 but for macro-averaged F1(BEP). T
33、he macro-averaged F1(BEP) for NB without feature selection is .48330第30页,共42页。Primary results and analysis ideal scenario (cont.)Micro-averaged F1(BEP) values for LR with the feature selection methods at different sizes of features over the 58 categories. The micro-averaged F1(BEP) for LR without fe
34、ature selection is .76631第31页,共42页。Primary results and analysis ideal scenario (cont.)As table 3, but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for LR without feature selection is .67632第32页,共42页。Primary results and analysis ideal scenario (cont.)Figure 2: Size ratios implicitly decided
35、 by using two-sided metrics: IG, CHI and ORS respectively (58 categories:3rd- 60th, feature size = 50)confirms that feature selection using a two-sided metric is similar to its one-sided counterpart (size ratio = 1) when the feature size is small.33第33页,共42页。Primary results and analysis ideal scenar
36、io (cont.)Figure 3: BEP F1 for test over the first two and last two categories (out of 58 categories) at different l1=l values (NB, = CC, feature size = 50). The optimal size ratios for money-fx, grain, dmk and lumber are 0.95, 0.95, 0.1 and 0.2 respectively34第34页,共42页。Primary results and analysis ideal scenario (cont.)Figure 4: Optimal size ratios of iSIG, iCC and iOR (NB, 58 categories:3rd-60th, feature size = 50)35第35页,共42页。Primary results and analysis ideal scenario (cont.)Figure 5: As gure 4, but for LR36第36页,共42页。Primary results and ana
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 财务报表修改请求函5篇
- 远离不健康元素共建和谐学习环境小学主题班会课件
- 科学应对突发意外掌握自救互救技能小学主题班会课件
- 绿化工程施工方案全
- 智慧展厅(安全体验馆)VR虚拟现实安全教育体验措施
- 小学主题班会课件:阳光心态促成长自信微笑迎挑战
- 屋面正置式保温系统安装施工方案及工艺方法
- (完整)成品保护管理措施
- ICU病房血液透析管路气泡事故应急演练脚本
- 2026二级建造师考试施工管理真题及答案
- 混凝土路面劳务分包协议模板合同5篇
- 肺结核病例诊疗记录模板
- 外阴硬化性苔藓
- DGTJ08-2240-2017 道路注浆加固技术规程
- 生猪急宰管理办法
- DB11∕T 2387-2024 城市轨道交通工程盾构机吊装技术规程
- 药品技术转移管理制度
- 2021版220kV厂站二次接线标准图纸集
- 夏令营教官业务培训
- T-CROPSSC 009-2023 茎尖菜用甘薯生产技术规程
- 2023学年度高一下学期班主任工作总结
评论
0/150
提交评论