6第五章典型大数据计算框架752apache之三课件

上传人：我*** IP属地：北京上传时间：2022-03-12 格式：DOCX 页数：18 大小：2.63MB 积分：9.6 举报 版权申诉

已阅读5页，还剩13页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Apache Spark-3福建师大数信学院张仕目录C o n t e n t s01020304Spark简介Spark安装与应用Spark API实例Spark的运行机制Spark API实例Ø Spark的APIØ DataFrame API实例Ø RDD API实例Ø MLlib实例1 Spark的APIØ Spark是基于分布式数据集的概念的，可以包含任意的Java、Python对象。我们只需要基于这些外部数据构造数据集，然后对这些数据集进行并行操作。Spark API的基础构件是RDD API，在RDD API之上，又提供了的

2、API 供使用，例如DataFrame API，学习API。这些更次的API提供了特定数据操作的方法，本部分将通过若干例子说明最简单的Spark应用，展示Spark的强大功能。2 RDD API实例 Word Count：构造 (String, Int) 数据集，并保存。Java：JavaRDD<String> textFile = sc.textFile("hdfs:/."); JavaPairRDD<String, Integer> counts = textFile.flatMap(s -> Arrays.asList(s.split

3、(" ").iterator().mapToPair(word -> new Tuple2<>(word, 1).reduceByKey(a, b) -> a + b); counts.saveAsTextFile("hdfs:/.");Scala：val textFile = sc.textFile("hdfs:/.")val counts = textFile.flatMap(line => line.split(" ").map(word => (word, 1).redu

4、ceByKey(_ + _) counts.saveAsTextFile("hdfs:/.")Python：text_file = sc.textFile("hdfs:/.")counts = text_file.flatMap(lambda line: line.split(" ") .map(lambda word: (word, 1) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs:/.")3 DataFrame API实例textFi

5、le = sc.textFile("hdfs:/.")# Creates a DataFrame having a single column named "line"df = textFile.map(lambda r: Row(r).toDF("line")errors = df.filter(col("line").like("%ERROR%") # Counts all the errorserrors.count()# Counts errors mentioning MySQL er

6、rors.filter(col("line").like("%MySQL%").count() # Fetches the MySQL errors as an array of strings errors.filter(col("line").like("%MySQL%").collect()Ø DataFrame API实例： DataFrame是分布式数据集，其数据带列名组织。DataFrame API可以方便的支持各种关系操作。并且，基于DataFrame API的程序会自动使用Spark内置优

7、化器进行优化。Ø 实例：文本查找（Python）3 DataFrame API实例 Ø 实例：文本查找（Scala） val textFile = sc.textFile("hdfs:/.")/ Creates a DataFrame having a single column named "line"val df = textFile.toDF("line")val errors = df.filter(col("line").like("%ERROR%")/ Count

8、s all the errors errors.count()/ Counts errors mentioning MySQL errors.filter(col("line").like("%MySQL%").count()/ Fetches the MySQL errors as an array of strings errors.filter(col("line").like("%MySQL%").collect()3 DataFrame API实例 Ø 实例：文本查找（Java） / Creat

9、es a DataFrame having a single column named "line"JavaRDD<String> textFile = sc.textFile("hdfs:/."); JavaRDD<Row> rowRDD = textFile.map(RowFactory:create); List<StructField> fields = Arrays.asList( DataTypes.createStructField("line", DataTypes.StringTy

10、pe, true); StructType schema = DataTypes.createStructType(fields); DataFrame df = sqlContext.createDataFrame(rowRDD, schema);DataFrame errors = df.filter(col("line").like("%ERROR%");errors.count();/ Counts all the errors/ Counts errors mentioning MySQL errors.filter(col("lin

11、e").like("%MySQL%").count();/ Fetches the MySQL errors as an array of strings errors.filter(col("line").like("%MySQL%").collect();4 MLlib实例# Every record of this DataFrame contains the label and# features represented by a vector.df = sqlContext.createDataFrame(data

12、, "label", "features")# Set parameters for the algorithm.# Here, we limit the number of iterations to 10. lr = LogisticRegression(maxIter=10)m= lr.fit(df) # Fit the mto the data.# Given a dataset, predict each point's label, and show the results.m.transform(df).show()实例：回归（Pr

13、ediction with Logistic Regression）（Python）本实例中，利用一些带的数据集和特征向量，利用回归算法进行数据的预测。Machine Learning 实例：MLlib（Sparks Machine Learning (ML) library）提供了许多分布式ML算法。这些算法覆盖了特征提取、分类、回归、聚类、推荐等等方面。MLlib 也提供一些诸如ML Pipeline构造工作流、CrossValidator用于调参、模型持久性用于和装载模型。4 MLlib实例/ Every record of this DataFrame contains the l

14、abel and/ features represented by a vector.val df = sqlContext.createDataFrame(data).toDF("label", "features")/ Set parameters for the algorithm./ Here, we limit the number of iterations to 10. val lr = new LogisticRegression().setMaxIter(10) 些带的数据集和特征向量，利用回归算法进行数据/ Fit the mto

15、the data.val m= lr.fit(df)的。/ Inspect the mval weights = m: get the feature weights.weights/ Given a dataset, predict each point's label, and show the results.m.transform(df).show()实例：回归（Prediction with Logistic Regression）（Scala）本实例中，利用一4 MLlib实例/ Every record of this DataFrame contains the lab

16、el and/ features represented by a vector.StructType schema = new StructType(new StructFieldnew StructField("label", DataTypes.DoubleType, false, Metadata.empty(), new StructField("features", new VectorUDT(), false, Metadata.empty(),);DataFrame df = jsql.createDataFrame(data, sche

17、ma);/ Set parameters for the algorithm./ Here, we limit the number of iterations to 10. LogisticRegression lr = new LogisticRegression().setMaxIter(10); 些带的数据集和特征向量，利用回归算法进行数据LogisticRegressionMm= lr.fit(df); / Fit the mto the data.的。/ Inspect the mVector weights = m: get the feature weights.weight

18、s();/ Given a dataset, predict each point's label, and show the results.m.transform(df).show();实例：回归（Prediction with Logistic Regression）（Java）本实例中，利用一Spark安装与应用Ø Spark的安装scala程序开发环境与步骤1 Spark的安装推荐：把Spark安装于Linux系统。具体安装 Apache Spark 的步骤如下：Step 1: 验证Java是否正确安装$java -version若正确输出Java版本信息，则说明J

19、ava已经正确安装，否则需要安装Java，并设置相关路径。Step 2: 验证Scala是否正确安装$scala -version若正确输出Scala版本信息，则说明Scala正确安装，可跳过Step3和Step4，否则需要安装Scala。Step 3:Scala最新版Scala，这里使用的是 scala-2.11.6 。1 Spark的安装Step 4: 安装 Scala解压tar文件，命令如下：$ tar xvf scala-2.11.6.tgz把解压后的文件移到/usr/local/scala（也可以是你指定的其他文件夹位置）$ su Password:# cd /home/Hadoo

20、p/Downloads/ # mv scala-2.11.6 /usr/local/scala # exit设置Scala路径：$ export PATH = $PATH:/usr/local/scala/bin验证安装是否$scala -version：正确显示Scala版本信息，则说明安装。1 Spark的安装Step 5:Apache Spark到Apache Spark官网Spark，本实例这里的是spark-1.3.1-bin-hadoop2.6Step 6: 安装Spark按解压、修改存放文件夹、设置环境变量几个步骤进行Spark的设置，参考如下：$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz$ su Password:# cd /home/Hadoop/Downloads/# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark # exit设置环境变量：打开 /.bashrc文件，并且在文件后面加上如下路径： export PATH = $PATH:/usr/local/spark/bin执行 /.bashrc file，使设置生效。$ sou

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

6第五章典型大数据计算框架752apache之三课件

文档简介

温馨提示

最新文档

评论

6第五章典型大数据计算框架752apache之三课件

文档简介

温馨提示

最新文档

评论

相关文档