机器学习算法详解与Scikitlearn实战技巧

上传人：1*** IP属地：福建上传时间：2025-12-21 格式：DOCX 页数：16 大小：41.56KB 积分：18 举报 版权申诉

已阅读5页，还剩11页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

机器学习算法详解与Scikitlearn实战技巧监督学习算法详解线性回归线性回归是最基础的监督学习算法之一，适用于预测连续型数值。其核心思想是通过寻找最优的线性函数来拟合数据点。在一元线性回归中，模型可以表示为y=wx+b，其中w是权重，b是偏置。在多元线性回归中，模型扩展为y=w₁x₁+w₂x₂+...+wₙxₙ+b。Scikitlearn中实现线性回归的`LinearRegression`类提供了简单易用的接口。例如：pythonfromsklearn.linear_modelimportLinearRegressionimportnumpyasnp创建样本数据X=np.array([[1],[2],[3],[4],[5]])y=np.array([2,4,5,4,5])初始化模型并拟合数据model=LinearRegression()model.fit(X,y)获取模型参数print("权重:",model.coef_)print("偏置:",ercept_)线性回归的优势在于简单直观，计算效率高；但缺点是假设数据线性关系强，对非线性数据拟合效果不佳。逻辑回归逻辑回归虽然名为回归，实则用于分类问题。它通过Sigmoid函数将线性组合的输出映射到0-1之间，表示样本属于正类的概率。模型决策规则为：若概率>0.5，则分类为正类；否则为负类。Scikitlearn中的`LogisticRegression`类提供了完整的实现：pythonfromsklearn.linear_modelimportLogisticRegressionfromsklearn.datasetsimportload_irisfromsklearn.model_selectionimporttrain_test_split加载鸢尾花数据集iris=load_iris()X=iris.data[:,:2]#取前两维特征y=iris.target划分训练集和测试集X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)初始化并训练模型model=LogisticRegression(max_iter=200)model.fit(X_train,y_train)预测和评估print("准确率:",model.score(X_test,y_test))逻辑回归适用于二分类问题，计算效率高，但对多重共线性敏感。支持向量机支持向量机(SVM)通过寻找最优超平面来分离不同类别的数据。对于非线性问题，SVM通过核函数将数据映射到高维空间，使其线性可分。常用核函数包括线性核、多项式核、径向基函数(RBF)核等。Scikitlearn中`SVC`类提供了SVM分类器的实现：pythonfromsklearn.svmimportSVCimportmatplotlib.pyplotasplt创建非线性样本数据defplot_svm():fromsklearn.datasetsimportmake_blobsX,y=make_blobs(n_samples=100,centers=2,random_state=6)创建并训练RBFSVM模型model=SVC(kernel='rbf',gamma=10)model.fit(X,y)绘制决策边界plt.scatter(X[:,0],X[:,1],c=y,s=50,cmap='winter')plt.show()plot_svm()SVM在高维空间中表现优异，对小样本数据鲁棒性强，但计算复杂度较高，对参数选择敏感。无监督学习算法详解K-均值聚类K-均值聚类是最常用的无监督学习算法之一，通过将数据划分为K个簇，使得簇内距离最小化而簇间距离最大化。算法流程包括随机初始化K个中心点，然后将每个点分配给最近的中心点，再更新中心点位置，重复迭代直至收敛。Scikitlearn中`KMeans`类提供了实现：pythonfromsklearn.clusterimportKMeansimportnumpyasnp创建样本数据X=np.array([[1,2],[1,4],[1,0],[10,2],[10,4],[10,0]])初始化并拟合模型model=KMeans(n_clusters=2,random_state=42)model.fit(X)获取聚类结果print("聚类标签:",model.labels_)print("中心点坐标:",model.cluster_centers_)K-均值算法简单高效，但对初始中心点敏感，需要预先指定簇数量。主成分分析主成分分析(PCA)是一种降维技术，通过寻找数据的主要变异方向来降低特征维度。其核心思想是将原始特征投影到新坐标系中，使得投影后保留最大方差的主成分。Scikitlearn中`PCA`类提供了实现：pythonfromsklearn.decompositionimportPCAfromsklearn.datasetsimportload_digits加载手写数字数据集digits=load_digits()X=digits.datay=digits.target应用PCA降维到2维pca=PCA(n_components=2)X_pca=pca.fit_transform(X)绘制降维结果plt.scatter(X_pca[:,0],X_pca[:,1],c=y,cmap='viridis')plt.xlabel('主成分1')plt.ylabel('主成分2')plt.show()print("解释方差比:",pca.explained_variance_ratio_)PCA适用于高维数据预处理，但会损失部分信息，且假设数据呈椭球状分布。Scikitlearn实战技巧数据预处理数据预处理是机器学习流程的关键步骤。Scikitlearn提供了`Pipeline`和`ColumnTransformer`等工具来组织预处理流程。pythonfromsklearn.pipelineimportPipelinefromsklearn.preprocessingimportStandardScalerfromsklearn.imputeimportSimpleImputerfromsklearn.preprocessingimportOneHotEncoderfromposeimportColumnTransformer定义数值特征和类别特征num_features=['age','height','weight']cat_features=['gender','city']创建预处理管道num_pipeline=Pipeline([('imputer',SimpleImputer(strategy='median')),('scaler',StandardScaler())])cat_pipeline=Pipeline([('imputer',SimpleImputer(strategy='most_frequent')),('encoder',OneHotEncoder(handle_unknown='ignore'))])preprocessor=ColumnTransformer([('num',num_pipeline,num_features),('cat',cat_pipeline,cat_features)])模型选择与评估Scikitlearn提供了多种模型评估工具，如交叉验证、学习曲线等。pythonfromsklearn.model_selectionimportcross_val_score,learning_curveimportmatplotlib.pyplotasplt交叉验证评估model=RandomForestClassifier(n_estimators=100)scores=cross_val_score(model,X_train,y_train,cv=5)print("交叉验证平均准确率:",scores.mean())绘制学习曲线train_sizes,train_scores,test_scores=learning_curve(model,X_train,y_train,cv=5,train_sizes=np.linspace(0.1,1.0,10))plt.plot(train_sizes,np.mean(train_scores,axis=1),'o-',label='训练集')plt.plot(train_sizes,np.mean(test_scores,axis=1),'o-',label='验证集')plt.xlabel('训练样本数量')plt.ylabel('准确率')plt.legend()plt.show()超参数调优网格搜索和随机搜索是常用的超参数调优方法。pythonfromsklearn.model_selectionimportGridSearchCV,RandomizedSearchCV网格搜索param_grid={'n_estimators':[50,100,200],'max_depth':[None,10,20],'min_samples_split':[2,5]}grid_search=GridSearchCV(RandomForestClassifier(),param_grid,cv=5)grid_search.fit(X_train,y_train)print("最佳参数:",grid_search.best_params_)随机搜索fromscipy.statsimportrandintrandom_search=RandomizedSearchCV(RandomForestClassifier(),{'n_estimators':randint(50,200),'max_depth':[None,10,20]},n_iter=10,cv=5,random_state=42)random_search.fit(X_train,y_train)print("最佳参数:",random_search.best_params_)集成学习集成学习方法通过组合多个模型来提高预测性能。Scikitlearn提供了Bagging和Boosting两种主流集成方法。pythonfromsklearn.ensembleimportRandomForestClassifier,GradientBoostingClassifier,VotingClassifier随机森林rf=RandomForestClassifier(n_estimators=100,random_state=42)rf.fit(X_train,y_train)极端梯度提升gb=GradientBoostingClassifier(n_estimators=100,random_state=42)gb.fit(X_train,y_train)集成多个模型voting=VotingClassifier(estimators=[('rf',rf),('gb',gb)],voting='soft')voting.fit(X_train,y_train)特殊场景处理处理不平衡数据数据不平衡是常见问题，可通过过采样、欠采样或代价敏感学习解决。pythonfromimblearn.over_samplingimportSMOTEfromimblearn.under_samplingimportRandomUnderSamplerfromimblearn.pipelineimportPipelineasImbPipelineSMOTE过采样smote=SMOTE()X_resampled,y_resampled=smote.fit_resample(X_train,y_train)欠采样rus=RandomUnderSampler()X_resampled,y_resampled=rus.fit_resample(X_train,y_train)创建包含过采样的管道imb_pipeline=ImbPipeline([('smote',SMOTE()),('classifier',RandomForestClassifier())])imb_pipeline.fit(X_train,y_train)高维数据降维除了PCA，还有其他降维方法如LDA和t-SNE。pythonfromsklearn.discriminant_analysisimportLinearDiscriminantAnalysisfromsklearn.manifoldimportTSNELDA降维lda=LinearDiscriminantAnalysis(n_components=2)X_lda=lda.fit_transform(X_train,y_train)t-SNE降维tsne=TSNE(n_components=2,random_state=42)X_tsne=tsne.fit_transform(X_train)异常值处理异常值可能影响模型性能，可通过Z-Score或IQR方法检测和处理。pythonfromscipyimportstatsimportnumpyasnpZ-Score检测z_scores=np.abs(stats.zscore(X_train))threshold=3outliers=np.where(z_scores>threshold)print("异常值索引:",outliers)IQR方法Q1=np.per

人人文库> 全部分类> 行业资料 > 管理策划

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

机器学习算法详解与Scikitlearn实战技巧

文档简介

温馨提示

最新文档

评论

机器学习算法详解与Scikitlearn实战技巧

文档简介

温馨提示

最新文档

评论

相关文档