应用scikit-learn做文本分类.doc_第1页
应用scikit-learn做文本分类.doc_第2页
应用scikit-learn做文本分类.doc_第3页
应用scikit-learn做文本分类.doc_第4页
应用scikit-learn做文本分类.doc_第5页
已阅读5页,还剩3页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

应用scikit-learn做文本分类1.加载数据集从20news-19997.tar.gz下载数据集,解压到scikit_learn_data文件夹下,加载数据,详见code注释。pythonview plaincopy1. #firstextractthe20news_groupdatasetto/scikit_learn_data2. fromsklearn.datasetsimportfetch_20newsgroups3. #allcategories4. #newsgroup_train=fetch_20newsgroups(subset=train)5. #partcategories6. categories=comp.graphics,7. comp.os.ms-windows.misc,8. comp.sys.ibm.pc.hardware,9. comp.sys.mac.hardware,10. comp.windows.x;11. newsgroup_train=fetch_20newsgroups(subset=train,categories=categories);可以检验是否load好了:pythonview plaincopy1. #printcategorynames2. frompprintimportpprint3. pprint(list(newsgroup_train.target_names)结果:comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x2. 提feature:刚才load进来的newsgroup_train就是一篇篇document,我们要从中提取feature,即词频啊神马的,用fit_transformMethod 1. HashingVectorizer,规定feature个数pythonview plaincopy1. #newsgroup_train.dataistheoriginaldocuments,butweneedtoextractthe2. #featurevectorsinordertomodelthetextdata3. fromsklearn.feature_extraction.textimportHashingVectorizer4. vectorizer=HashingVectorizer(stop_words=english,non_negative=True,5. n_features=10000)6. fea_train=vectorizer.fit_transform(newsgroup_train.data)7. fea_test=vectorizer.fit_transform(newsgroups_test.data);8. 9. 10. #returnfeaturevectorfea_trainn_samples,n_features11. printSizeoffea_train:+repr(fea_train.shape)12. printSizeoffea_train:+repr(fea_test.shape)13. #11314documents,130107vectorsforallcategories14. printTheaveragefeaturesparsityis0:.3f%.format(15. fea_train.nnz/float(fea_train.shape0*fea_train.shape1)*100);结果:Size of fea_train:(2936, 10000)Size of fea_train:(1955, 10000)The average feature sparsity is 1.002%因为我们只取了10000个词,即10000维feature,稀疏度还不算低。而实际上用TfidfVectorizer统计可得到上万维的feature,我统计的全部样本是13w多维,就是一个相当稀疏的矩阵了。*上面代码注释说TF-IDF在train和test上提取的feature维度不同,那么怎么让它们相同呢?有两种方法:Method 2.CountVectorizer+TfidfTransformer让两个CountVectorizer共享vocabulary:pythonview plaincopy1. #-2. #method1:CountVectorizer+TfidfTransformer3. print*nCountVectorizer+TfidfTransformern*4. fromsklearn.feature_extraction.textimportCountVectorizer,TfidfTransformer5. count_v1=CountVectorizer(stop_words=english,max_df=0.5);6. counts_train=count_v1.fit_transform(newsgroup_train.data);7. printtheshapeoftrainis+repr(counts_train.shape)8. 9. count_v2=CountVectorizer(vocabulary=count_v1.vocabulary_);10. counts_test=count_v2.fit_transform(newsgroups_test.data);11. printtheshapeoftestis+repr(counts_test.shape)12. 13. tfidftransformer=TfidfTransformer();14. 15. tfidf_train=tfidftransformer.fit(counts_train).transform(counts_train);16. tfidf_test=tfidftransformer.fit(counts_test).transform(counts_test);结果:*CountVectorizer+TfidfTransformer*the shape of train is (2936, 66433)the shape of test is (1955, 66433)Method 3.TfidfVectorizer让两个TfidfVectorizer共享vocabulary:pythonview plaincopy1. #method2:TfidfVectorizer2. print*nTfidfVectorizern*3. fromsklearn.feature_extraction.textimportTfidfVectorizer4. tv=TfidfVectorizer(sublinear_tf=True,5. max_df=0.5,6. stop_words=english);7. tfidf_train_2=tv.fit_transform(newsgroup_train.data);8. tv2=TfidfVectorizer(vocabulary=tv.vocabulary_);9. tfidf_test_2=tv2.fit_transform(newsgroups_test.data);10. printtheshapeoftrainis+repr(tfidf_train_2.shape)11. printtheshapeoftestis+repr(tfidf_test_2.shape)12. analyze=tv.build_analyzer()13. tv.get_feature_names()#statisticalfeatures/terms结果:*TfidfVectorizer*the shape of train is (2936, 66433)the shape of test is (1955, 66433)此外,还有sklearn里封装好的抓feature函数,fetch_20newsgroups_vectorizedMethod 4.fetch_20newsgroups_vectorized但是这种方法不能挑出几个类的feature,只能全部20个类的feature全部弄出来:pythonview plaincopy1. print*nfetch_20newsgroups_vectorizedn*2. fromsklearn.datasetsimportfetch_20newsgroups_vectorized3. tfidf_train_3=fetch_20newsgroups_vectorized(subset=train);4. tfidf_test_3=fetch_20newsgroups_vectorized(subset=test);5. printtheshapeoftrainis+repr(tfidf_train_3.data.shape)6. printtheshapeoftestis+repr(tfidf_test_3.data.shape)结果:*fetch_20newsgroups_vectorized*the shape of train is (11314, 130107)the shape of test is (7532, 130107)3. 分类3.1 Multinomial Naive Bayes Classifier见代码&comment,不解释pythonview plaincopy1. #2. #MultinomialNaiveBayesClassifier3. print*nNaiveBayesn*4. fromsklearn.naive_bayesimportMultinomialNB5. fromsklearnimportmetrics6. newsgroups_test=fetch_20newsgroups(subset=test,7. categories=categories);8. fea_test=vectorizer.fit_transform(newsgroups_test.data);9. #createtheMultinomialNaiveBayesianClassifier10. clf=MultinomialNB(alpha=0.01)11. clf.fit(fea_train,newsgroup_train.target);12. pred=clf.predict(fea_test);13. calculate_result(newsgroups_test.target,pred);14. #noticeherewecanseethatf1_scoreisnotequalto2*precision*recall/(precision+recall)15. #becausethem_precisionandm_recallwegetisaveraged,however,metrics.f1_score()calculates16. #weithedaverage,i.e.,takesintothenumberofeachclassintoconsideration.注意我最后的3行注释,为什么f12*(准确率*召回率)/(准确率+召回率)其中,函数calculate_result计算f1:pythonview plaincopy1. defcalculate_result(actual,pred):2. m_precision=metrics.precision_score(actual,pred);3. m_recall=metrics.recall_score(actual,pred);4. printpredictinfo:5. printprecision:0:.3f.format(m_precision)6. printrecall:0:0.3f.format(m_recall);7. printf1-score:0:.3f.format(metrics.f1_score(actual,pred);8. 3.2 KNN:pythonview plaincopy1. #2. #KNNClassifier3. fromsklearn.neighborsimportKNeighborsClassifier4. print*nKNNn*5. knnclf=KNeighborsClassifier()#defaultwithk=56. knnclf.fit(fea_train,newsgroup_train.target)7. pred=knnclf.predict(fea_test);8. calculate_result(newsgroups_test.target,pred);3.3 SVM:cppview plaincopy1. #2. #SVMClassifier3. fromsklearn.svmimportSVC4. print*nSVMn*5. svclf=SVC(kernel=linear)#defaultwithrbf6. svclf.fit(fea_train,newsgroup_train.target)7. pred=svclf.predict(fea_test);8. calculate_result(newsgroups_test.target,pred);结果:*Naive Bayes*predict info:precision:0.764recall:0.759f1-score:0.760*KNN*predict i

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论