spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解

上传人：汤*** IP属地：北京上传时间：2022-10-28 格式：PPTX 页数：21 大小：946.10KB 积分：15 举报 版权申诉

免费预览已结束，剩余16页可下载查看

 下载本文档

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

静态视图static

view

：即RDD,

transformation

and

action动态视图dynamic

view：即lifeof

ajob,每一个job又分为多个stage，每一个stage中可以包含多个rdd及其transformation,这些stage又是如何

成为task被distributed到cluster中Transformations:Spark

API的两种类型之一，Transformations返回值还是一个R

DD，Actions:

Actions返回值不是一个RDD，而是一个Scala集合所有的Transformations都是采用的Lazy策略，如果只是将Transformation提交是不会执行计算的，计算只会在Action提交时才会被触发TransformationMeaningmap(func)Return

new

distributed

dataset

formed

passing

each

element

the

sourcethrough

function

func.filter(func)Return

new

dataset

formedby

selecting

those

elements

the

source

onwhich

funcreturns

true.flatMap(func)Similar

map,

but

each

input

item

can

mapped

output

items(so

funcshould

return

Seq

rather

than

single

item).mapPartitions(func)Similar

map,

but

runs

separa y

eachpartition

(block)

the

RDD,so

func

must

type

Iterator<T>

Iterator<U>

when

running

RDD

oftypeT.mapPartitionsWithIndex(func)Similar

mapPartitions,

but

also

provides

func

with

integer

value

representingthe

index

the

partition,

func

must

type

(Int,

Iterator<T>)

=>Iterator<U>

when

running

RDD

type

T.sample(withReplacement,

fraction,

seed)Sample

fraction

the

data,

with

without

replacement,

using

givenrandom

numbergenerator

seed.union(otherDataset)Return

new

dataset

that

contains

the

union

the

elements

the

source

datasetand

the

ersection(otherDataset)Return

new

RDD

that

contains

the

intersection

elements

the

source

datasetand

the

argument.groupByKey([numTasks])Whencalled

dataset

(K,

V)pairs,

returns

dataset

(K,

Iterable<V>)pairs.Note:

you

are

grou

orderto

perform

aggregation

(such

sumor

average)

over

each

key,

using

reduceByKey

aggregateByKey

will

yieldmuch

better

performance.Note:

default,

the

level

parallelism

the

output

depends

thenumber

partitions

the

parent

RDD.

You

can

pass

anoptional

numTasks

argument

set

different

number

tasks.reduceByKey(func,

[numTasks])When

called

dataset

(K,

pairs,

returns

dataset

(K,

V)pairs

wherethe

values

for

each

key

are

aggregated

using

the

given

reduce

function

func,whi ust

type

(V,V)

groupByKey,

the

number

reducetasks

configurable

through

optional

second

argument.aggregateByKey(zeroValue)Op,

[numTasks])When

called

dataset

(K,

pairs,

returns

dataset

(K,

pairs

wherethe

values

for

each

key

are

aggregated

using

the

given

combine

functionsanda

neutral

"zero"

value.

Allows

aggregated

value

type

that

different

thanthe

input

value

type,while

avoidingunnecessary

allocations.

Likein

groupByKey,

the

number

reduce

tasks

configurable

through

anoptional

second

argument.sortByKey([ascending],

[numTasks])When

called

ona

dataset

(K,

pairs

where

implements

Ordered,

returnsa

dataset

(K,

pairs

sorted

keys

ascending

descending

order,

asspecified

the

boolean

ascending

argument.distinct([numTasks]))ements

the

sourceReturn

new

dataset

that

contains

the

distincdataset.join(otherDataset,

[numTasks])When

called

datasets

type

(K,

and

(K,

W),

returnsa

dataset

of(K,

(V,

W))

pairs

with

all

pairs

elements

for

each

key.

Outer

joinsaresupported

through

leftOuterJoin,

rightOuterJoin,

and

fullOuterJoin.cogroup(otherDataset,

[numTasks])When

called

datasets

type

(K,

and

(K,

W),

returns

dataset

of(K,

(Iterable<V>,

Iterable<W>))

tuples.

This

operation

alsocalled

groupWith.cartesian(otherDataset)When

called

datasets

types

and

returns

dataset

(T,

U)pairs

(all

pairs

elements).pipe(command,

[envVars])Pipe

each

partition

the

RDDthrough

command,

e.g.

rlor

bash

script.

RDD

elementsare

written

the

process's

stdin

andlines

output

its

stdout

are

returned

RDD

strings.coalesce(numPartitions)Decrease

the

number

partitions

the

RDD

numPartitions.Useful

for

running

operations

efficiently

after

filtering

down

alarge

dataset.repartition(numPartitions)Reshuffle

the

data

the

RDD

randomly

create

either

orfewer

partitions

and

balance

across

them.

This

always

shuffles

alldata

over

the

network.reduce(func)Aggregate

the

elements

the

dataset

using

afunction

func

(which

takes

two

arguments

andreturns

one).

The

function

should

commutativeand

associative

thatit

can

computedcorrectly

parallel.collect()Return

all

the

elements

the

dataset

arrayat

the

driver

program.

This

usually

useful

after

afilter

other

operation

that

returns

sufficientlysmall

subset

the

data.count()Return

the

number

elements

thedataset.Returnthe element

the

dataset

(similar

totake(1)).Return

array

with

the

elements

thedataset.()take(n)takeSample(withReplacement,

num,

[seed])Return

array

with

arandom

sampleof

num

elements

the

dataset,

with

withoutreplacement,

optionally

pre-specifying

randomnumber

generator

seed.takeOrdered(n,

[ordering])Return

the

nelements

the

RDD

using

either

their

natural

orderor

custom

comparator.saveAsTextFile(path)Write

the

elements

the

dataset

text

file

(or

set

text

files)

agiven

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解

文档简介

温馨提示

最新文档

评论

spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解

文档简介

温馨提示

最新文档

评论

相关文档