SampleSelectionBias样本选择偏差.ppt_第1页
SampleSelectionBias样本选择偏差.ppt_第2页
SampleSelectionBias样本选择偏差.ppt_第3页
SampleSelectionBias样本选择偏差.ppt_第4页
SampleSelectionBias样本选择偏差.ppt_第5页
已阅读5页,还剩13页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Sample Selection Bias,Lei Tang Feb. 20th, 2007,Classical ML vs. Reality,Training data and Test data share the same distribution (In classical Machine Learning) But thats not always the case in reality. Survey data Species habitat modeling based on data of only one area Training and test data collect

2、ed by different experiments Newswire articles with timestamps,Sample selection bias,Standard setting: data (x,y) are drawn independently from a distribution D If the selected samples is not a random samples of D, then the samples are biased. Usually, training data are biased, but we want to apply th

3、e classifier to unbiased samples.,Four cases of Bias(1),Let s denote whether or not a sample is selected. P(s=1|x,y) = P(s=1) (not biased) P(s=1|x,y) = P(s=1|x) (depending only on the feature vector) P(s=1|x,y) = P(s=1|y) (depending only on the class label) P(s=1|x,y) (depending on both x and y),Fou

4、r cases of Bias(2),P(s=1|x, y)= P(s=1|y): learning from imbalanced data. Can alleviate the bias by changing the class prior. P(s=1|x,y) = P(s=1|x) imply P(y|x) remain unchanged. This is mostly studied. If the bias depends on both x and y, lack information to analyze.,An intuitive Example,P(s=1|x,y)

5、= P(s=1|x) = s and y are independent. So P(y|x, s=1) = P(y|x). Does it really matter as P(y|x) remain unchanged?,Bias Analysis for Classifiers(1),Logistic Regression Any classifiers directly models P(y|x) wont be affected by bias Bayesian Classifier But for nave Bayesian classifier,Bias Analysis for

6、 Classifiers(2),Hard margin SVM: no bias effect. Soft margin SVM: has bias effect as the cost of misclassification might change. Decision Tree usually results in a different classifier if the bias is presented In sum, most classifiers are still sensitive to the sample bias. This is in asymptotic ana

7、lysis assuming the samples are “enough”,Correcting Bias,Expected Risk: Suppose training set from Pr, test set from Pr So we minimize the empirical regularized risk:,Estimate the weights,The samples which are likely to appear in the test data will gain more weight. But how to estimate the weight of e

8、ach sample? Brute force approach: Estimate the density of Pr(x) and Pr(x), respectively, Then calculate the sample weight. Not applicable as density estimation is more difficult than classification given limited number of samples. Existing works use simulation experiments in which both Pr(x) and Pr(

9、x) are known (NOT REALISTIC),Distribution Matching,The expectation in feature space: We have Hence, the problem can be formulated as Solution is:,Empirical KMM optimization,where,Therefore, its equivalent to solve the QP problem:,Experiments,A Toy Regression Example,Simulation,Select some UCI datase

10、ts to inject some sample selection bias into training, then test on unbiased samples.,Bias on Labels,Unexplained,From theory, the importance sampling should be the best, why KMM performs better? Why kernel methods? Can we just do the matching using input features? Can we just perform a logistic regr

11、ession to estimate beta by treating test data as positive class, and training data as negative. Then, beta is the odds.,Some Related Problems,Semi-supervised Learning (Is it equivalent?) Multi-task Learning: assume P(y|x) to be different. But sample selection bias(mostly) assume P(y|x) to be the same. MT

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论