spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解_第1页
spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解_第2页
spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解_第3页
spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解_第4页
spark大数据实战班5位架构师只教企业实际应用场景-第三课api讲解_第5页
免费预览已结束,剩余16页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

静态视图static

view

:即RDD,

transformation

and

action动态视图dynamic

view:即lifeof

ajob,每一个job又分为多个stage,每一个stage中可以包含多个rdd及其transformation,这些stage又是如何

成为task被distributed到cluster中Transformations:Spark

API的两种类型之一,Transformations返回值还是一个R

DD,Actions:

Actions返回值不是一个RDD,而是一个Scala集合所有的Transformations都是采用的Lazy策略,如果只是将Transformation提交是不会执行计算的,计算只会在Action提交时才会被触发TransformationMeaningmap(func)Return

a

new

distributed

dataset

formed

by

passing

each

element

of

the

sourcethrough

a

function

func.filter(func)Return

a

new

dataset

formedby

selecting

those

elements

of

the

source

onwhich

funcreturns

true.flatMap(func)Similar

to

map,

but

each

input

item

can

be

mapped

to

0

or

more

output

items(so

funcshould

return

a

Seq

rather

than

a

single

item).mapPartitions(func)Similar

to

map,

but

runs

separa y

on

eachpartition

(block)

of

the

RDD,so

func

must

be

of

type

Iterator<T>

=>

Iterator<U>

when

running

on

an

RDD

oftypeT.mapPartitionsWithIndex(func)Similar

to

mapPartitions,

but

also

provides

func

with

an

integer

value

representingthe

index

of

the

partition,

so

func

must

be

of

type

(Int,

Iterator<T>)

=>Iterator<U>

when

running

on

an

RDD

of

type

T.sample(withReplacement,

fraction,

seed)Sample

a

fraction

fraction

of

the

data,

with

or

without

replacement,

using

a

givenrandom

numbergenerator

seed.union(otherDataset)Return

a

new

dataset

that

contains

the

union

of

the

elements

in

the

source

datasetand

the

ersection(otherDataset)Return

a

new

RDD

that

contains

the

intersection

of

elements

in

the

source

datasetand

the

argument.groupByKey([numTasks])Whencalled

on

a

dataset

of

(K,

V)pairs,

returns

a

dataset

of

(K,

Iterable<V>)pairs.Note:

If

you

are

grou

in

orderto

perform

an

aggregation

(such

as

a

sumor

average)

over

each

key,

using

reduceByKey

or

aggregateByKey

will

yieldmuch

better

performance.Note:

By

default,

the

level

of

parallelism

in

the

output

depends

on

thenumber

of

partitions

of

the

parent

RDD.

You

can

pass

anoptional

numTasks

argument

to

set

a

different

number

of

tasks.reduceByKey(func,

[numTasks])When

called

on

a

dataset

of

(K,

V)

pairs,

returns

a

dataset

of

(K,

V)pairs

wherethe

values

for

each

key

are

aggregated

using

the

given

reduce

function

func,whi ust

be

of

type

(V,V)

=>

V.

Like

in

groupByKey,

the

number

of

reducetasks

is

configurable

through

an

optional

second

argument.aggregateByKey(zeroValue)Op,

[numTasks])When

called

on

a

dataset

of

(K,

V)

pairs,

returns

a

dataset

of

(K,

U)

pairs

wherethe

values

for

each

key

are

aggregated

using

the

given

combine

functionsanda

neutral

"zero"

value.

Allows

an

aggregated

value

type

that

is

different

thanthe

input

value

type,while

avoidingunnecessary

allocations.

Likein

groupByKey,

the

number

of

reduce

tasks

is

configurable

through

anoptional

second

argument.sortByKey([ascending],

[numTasks])When

called

ona

dataset

of

(K,

V)

pairs

where

K

implements

Ordered,

returnsa

dataset

of

(K,

V)

pairs

sorted

by

keys

in

ascending

or

descending

order,

asspecified

in

the

boolean

ascending

argument.distinct([numTasks]))ements

of

the

sourceReturn

a

new

dataset

that

contains

the

distincdataset.join(otherDataset,

[numTasks])When

called

on

datasets

of

type

(K,

V)

and

(K,

W),

returnsa

dataset

of(K,

(V,

W))

pairs

with

all

pairs

of

elements

for

each

key.

Outer

joinsaresupported

through

leftOuterJoin,

rightOuterJoin,

and

fullOuterJoin.cogroup(otherDataset,

[numTasks])When

called

on

datasets

of

type

(K,

V)

and

(K,

W),

returns

a

dataset

of(K,

(Iterable<V>,

Iterable<W>))

tuples.

This

operation

is

alsocalled

groupWith.cartesian(otherDataset)When

called

on

datasets

of

types

T

and

U,

returns

a

dataset

of

(T,

U)pairs

(all

pairs

of

elements).pipe(command,

[envVars])Pipe

each

partition

of

the

RDDthrough

a

s

command,

e.g.

rlor

bash

script.

RDD

elementsare

written

to

the

process's

stdin

andlines

output

to

its

stdout

are

returned

as

an

RDD

of

strings.coalesce(numPartitions)Decrease

the

number

of

partitions

in

the

RDD

to

numPartitions.Useful

for

running

operations

more

efficiently

after

filtering

down

alarge

dataset.repartition(numPartitions)Reshuffle

the

data

in

the

RDD

randomly

to

create

either

more

orfewer

partitions

and

balance

it

across

them.

This

always

shuffles

alldata

over

the

network.reduce(func)Aggregate

the

elements

of

the

dataset

using

afunction

func

(which

takes

two

arguments

andreturns

one).

The

function

should

be

commutativeand

associative

so

thatit

can

be

computedcorrectly

in

parallel.collect()Return

all

the

elements

of

the

dataset

as

an

arrayat

the

driver

program.

This

is

usually

useful

after

afilter

or

other

operation

that

returns

a

sufficientlysmall

subset

of

the

data.count()Return

the

number

of

elements

in

thedataset.Returnthe element

of

the

dataset

(similar

totake(1)).Return

an

array

with

the

n

elements

of

thedataset.()take(n)takeSample(withReplacement,

num,

[seed])Return

an

array

with

arandom

sampleof

num

elements

of

the

dataset,

with

or

withoutreplacement,

optionally

pre-specifying

a

randomnumber

generator

seed.takeOrdered(n,

[ordering])Return

the

nelements

of

the

RDD

using

either

their

natural

orderor

a

custom

comparator.saveAsTextFile(path)Write

the

elements

of

the

dataset

as

a

text

file

(or

set

of

text

files)

in

agiven

directory

inthe

local

filesystem,

HDFS

or

any

other

Hadoopsupported

file

system.

Sparkwill

call

toString

on

each

element

toconvert

it

to

a

line

of

text

in

the

file.saveAsSequenceFile(path)(Java

and

Scala)Write

the

elements

of

the

dataset

as

a

Hadoop

SequenceFile

in

a

givenpath

in

the

local

filesystem,

HDFS

or

any

other

Hadoop

supported

filesystem.

This

is

available

on

RDDs

of

key

value

pairs

that

implementHadoop's

Writable

interface.

In

Scala,

it

isalso

available

on

types

thatare

implicitly

convertible

to

Writable

(Spark

includes

conversions

forbasic

types

like

Int,

Double,

String,

etc).saveAsObjectFile(path)(Java

and

Scala)Write

the

elements

of

the

dataset

in

a

simple

format

usingJavaserialization,

which

can

then

be

loaded

usingSparkContext.objectFile().countByKey()Only

available

on

RDDs

of

type

(K,

V).

Returns

a

hashmap

of

(K,

Int)pairs

with

the

count

of

each

key.foreach(func)Run

a

function

func

on

each

element

of

the

dataset.

This

is

usuallydone

for

side

effects

such

as

updating

an

Accumulator

or

interactingwith

external

storage

systems.Note:

modifying

variables

other

than

Accumulators

outside

ofthe

foreach()

may

result

in

undefined

behavior.

See

Understanding

closures

for

more

details.val

conf

= new

SparkConf().setAppName(appName).setMaster(master)new

SparkContext(conf)ParallelizetextFile:Text

file

RDDs

can

be

created

using

SparkContext’s

textFile

method.

This

method

takes

an

URI

for

thefile

(either

a

local

path

on

themachine,

or

a

hdfs://,

s3n://,

etc

URI)

and

reads

it

as

a

collection

of

lines.

Here

is

anexample

invocation:SparkContext.wholeTextFiles

lets

you

read

a

directory

containing

multiple

small

text

files,

and

returns

each

ofthem

as

(filename,

content)

pairs.For

SequenceFiles,

use

SparkContext’s

sequenceFile[K,

V]

methodwhere

K

and

V

are

the

types

of

key

andvaluesinthe

file.

These

should

besubclasses

of

Hadoop’s

Writable

interface,

like

IntWritable

and

Text.

In

addition,Spark

allows

you

to

specify

native

types

for

a

few

common

Writables;

for

example,

sequenceFile[Int,

String]

willautomatically

readIntWritables

and

Texts.Forother

Hadoop

InputFormats,

you

can

use

the

SparkContext.hadoopRDD

method,

which

takes

anarbitrary

JobConf

and

input

format

class,

key

class

and

value

class.

Set

these

the

same

way

youwould

for

aHadoop

job

with

your

input

source.

You

c so

useSparkContext.newAPIHadoopRDD

for

InputFormats

basedon

the

“new”

MapReduce

API(

.apache.hadoop.mapreduce).RDD.saveAsObjectFile

and

SparkContext.objectFilesupport

saving

an

RDD

in

a

simple

format

consisting

ofserialized

Java

objects.

While

this

is

not

as

efficient

as

specialized

formats

like

Avro,

it

offers

an

easy

way

to

saveany

RDD.saveAsTextFile(path)Write

the

elements

of

the

dataset

as

a

text

file

(or

set

of

text

files)in

a

given

director

y

in

the

local

filesystem,

HDFS

or

anyotherHadoop-supported

file

system.

Spark

will

cal

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论