下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
静态视图static
view
:即RDD,
transformation
and
action动态视图dynamic
view:即lifeof
ajob,每一个job又分为多个stage,每一个stage中可以包含多个rdd及其transformation,这些stage又是如何
成为task被distributed到cluster中Transformations:Spark
API的两种类型之一,Transformations返回值还是一个R
DD,Actions:
Actions返回值不是一个RDD,而是一个Scala集合所有的Transformations都是采用的Lazy策略,如果只是将Transformation提交是不会执行计算的,计算只会在Action提交时才会被触发TransformationMeaningmap(func)Return
a
new
distributed
dataset
formed
by
passing
each
element
of
the
sourcethrough
a
function
func.filter(func)Return
a
new
dataset
formedby
selecting
those
elements
of
the
source
onwhich
funcreturns
true.flatMap(func)Similar
to
map,
but
each
input
item
can
be
mapped
to
0
or
more
output
items(so
funcshould
return
a
Seq
rather
than
a
single
item).mapPartitions(func)Similar
to
map,
but
runs
separa y
on
eachpartition
(block)
of
the
RDD,so
func
must
be
of
type
Iterator<T>
=>
Iterator<U>
when
running
on
an
RDD
oftypeT.mapPartitionsWithIndex(func)Similar
to
mapPartitions,
but
also
provides
func
with
an
integer
value
representingthe
index
of
the
partition,
so
func
must
be
of
type
(Int,
Iterator<T>)
=>Iterator<U>
when
running
on
an
RDD
of
type
T.sample(withReplacement,
fraction,
seed)Sample
a
fraction
fraction
of
the
data,
with
or
without
replacement,
using
a
givenrandom
numbergenerator
seed.union(otherDataset)Return
a
new
dataset
that
contains
the
union
of
the
elements
in
the
source
datasetand
the
ersection(otherDataset)Return
a
new
RDD
that
contains
the
intersection
of
elements
in
the
source
datasetand
the
argument.groupByKey([numTasks])Whencalled
on
a
dataset
of
(K,
V)pairs,
returns
a
dataset
of
(K,
Iterable<V>)pairs.Note:
If
you
are
grou
in
orderto
perform
an
aggregation
(such
as
a
sumor
average)
over
each
key,
using
reduceByKey
or
aggregateByKey
will
yieldmuch
better
performance.Note:
By
default,
the
level
of
parallelism
in
the
output
depends
on
thenumber
of
partitions
of
the
parent
RDD.
You
can
pass
anoptional
numTasks
argument
to
set
a
different
number
of
tasks.reduceByKey(func,
[numTasks])When
called
on
a
dataset
of
(K,
V)
pairs,
returns
a
dataset
of
(K,
V)pairs
wherethe
values
for
each
key
are
aggregated
using
the
given
reduce
function
func,whi ust
be
of
type
(V,V)
=>
V.
Like
in
groupByKey,
the
number
of
reducetasks
is
configurable
through
an
optional
second
argument.aggregateByKey(zeroValue)Op,
[numTasks])When
called
on
a
dataset
of
(K,
V)
pairs,
returns
a
dataset
of
(K,
U)
pairs
wherethe
values
for
each
key
are
aggregated
using
the
given
combine
functionsanda
neutral
"zero"
value.
Allows
an
aggregated
value
type
that
is
different
thanthe
input
value
type,while
avoidingunnecessary
allocations.
Likein
groupByKey,
the
number
of
reduce
tasks
is
configurable
through
anoptional
second
argument.sortByKey([ascending],
[numTasks])When
called
ona
dataset
of
(K,
V)
pairs
where
K
implements
Ordered,
returnsa
dataset
of
(K,
V)
pairs
sorted
by
keys
in
ascending
or
descending
order,
asspecified
in
the
boolean
ascending
argument.distinct([numTasks]))ements
of
the
sourceReturn
a
new
dataset
that
contains
the
distincdataset.join(otherDataset,
[numTasks])When
called
on
datasets
of
type
(K,
V)
and
(K,
W),
returnsa
dataset
of(K,
(V,
W))
pairs
with
all
pairs
of
elements
for
each
key.
Outer
joinsaresupported
through
leftOuterJoin,
rightOuterJoin,
and
fullOuterJoin.cogroup(otherDataset,
[numTasks])When
called
on
datasets
of
type
(K,
V)
and
(K,
W),
returns
a
dataset
of(K,
(Iterable<V>,
Iterable<W>))
tuples.
This
operation
is
alsocalled
groupWith.cartesian(otherDataset)When
called
on
datasets
of
types
T
and
U,
returns
a
dataset
of
(T,
U)pairs
(all
pairs
of
elements).pipe(command,
[envVars])Pipe
each
partition
of
the
RDDthrough
a
s
command,
e.g.
rlor
bash
script.
RDD
elementsare
written
to
the
process's
stdin
andlines
output
to
its
stdout
are
returned
as
an
RDD
of
strings.coalesce(numPartitions)Decrease
the
number
of
partitions
in
the
RDD
to
numPartitions.Useful
for
running
operations
more
efficiently
after
filtering
down
alarge
dataset.repartition(numPartitions)Reshuffle
the
data
in
the
RDD
randomly
to
create
either
more
orfewer
partitions
and
balance
it
across
them.
This
always
shuffles
alldata
over
the
network.reduce(func)Aggregate
the
elements
of
the
dataset
using
afunction
func
(which
takes
two
arguments
andreturns
one).
The
function
should
be
commutativeand
associative
so
thatit
can
be
computedcorrectly
in
parallel.collect()Return
all
the
elements
of
the
dataset
as
an
arrayat
the
driver
program.
This
is
usually
useful
after
afilter
or
other
operation
that
returns
a
sufficientlysmall
subset
of
the
data.count()Return
the
number
of
elements
in
thedataset.Returnthe element
of
the
dataset
(similar
totake(1)).Return
an
array
with
the
n
elements
of
thedataset.()take(n)takeSample(withReplacement,
num,
[seed])Return
an
array
with
arandom
sampleof
num
elements
of
the
dataset,
with
or
withoutreplacement,
optionally
pre-specifying
a
randomnumber
generator
seed.takeOrdered(n,
[ordering])Return
the
nelements
of
the
RDD
using
either
their
natural
orderor
a
custom
comparator.saveAsTextFile(path)Write
the
elements
of
the
dataset
as
a
text
file
(or
set
of
text
files)
in
agiven
directory
inthe
local
filesystem,
HDFS
or
any
other
Hadoopsupported
file
system.
Sparkwill
call
toString
on
each
element
toconvert
it
to
a
line
of
text
in
the
file.saveAsSequenceFile(path)(Java
and
Scala)Write
the
elements
of
the
dataset
as
a
Hadoop
SequenceFile
in
a
givenpath
in
the
local
filesystem,
HDFS
or
any
other
Hadoop
supported
filesystem.
This
is
available
on
RDDs
of
key
value
pairs
that
implementHadoop's
Writable
interface.
In
Scala,
it
isalso
available
on
types
thatare
implicitly
convertible
to
Writable
(Spark
includes
conversions
forbasic
types
like
Int,
Double,
String,
etc).saveAsObjectFile(path)(Java
and
Scala)Write
the
elements
of
the
dataset
in
a
simple
format
usingJavaserialization,
which
can
then
be
loaded
usingSparkContext.objectFile().countByKey()Only
available
on
RDDs
of
type
(K,
V).
Returns
a
hashmap
of
(K,
Int)pairs
with
the
count
of
each
key.foreach(func)Run
a
function
func
on
each
element
of
the
dataset.
This
is
usuallydone
for
side
effects
such
as
updating
an
Accumulator
or
interactingwith
external
storage
systems.Note:
modifying
variables
other
than
Accumulators
outside
ofthe
foreach()
may
result
in
undefined
behavior.
See
Understanding
closures
for
more
details.val
conf
= new
SparkConf().setAppName(appName).setMaster(master)new
SparkContext(conf)ParallelizetextFile:Text
file
RDDs
can
be
created
using
SparkContext’s
textFile
method.
This
method
takes
an
URI
for
thefile
(either
a
local
path
on
themachine,
or
a
hdfs://,
s3n://,
etc
URI)
and
reads
it
as
a
collection
of
lines.
Here
is
anexample
invocation:SparkContext.wholeTextFiles
lets
you
read
a
directory
containing
multiple
small
text
files,
and
returns
each
ofthem
as
(filename,
content)
pairs.For
SequenceFiles,
use
SparkContext’s
sequenceFile[K,
V]
methodwhere
K
and
V
are
the
types
of
key
andvaluesinthe
file.
These
should
besubclasses
of
Hadoop’s
Writable
interface,
like
IntWritable
and
Text.
In
addition,Spark
allows
you
to
specify
native
types
for
a
few
common
Writables;
for
example,
sequenceFile[Int,
String]
willautomatically
readIntWritables
and
Texts.Forother
Hadoop
InputFormats,
you
can
use
the
SparkContext.hadoopRDD
method,
which
takes
anarbitrary
JobConf
and
input
format
class,
key
class
and
value
class.
Set
these
the
same
way
youwould
for
aHadoop
job
with
your
input
source.
You
c so
useSparkContext.newAPIHadoopRDD
for
InputFormats
basedon
the
“new”
MapReduce
API(
.apache.hadoop.mapreduce).RDD.saveAsObjectFile
and
SparkContext.objectFilesupport
saving
an
RDD
in
a
simple
format
consisting
ofserialized
Java
objects.
While
this
is
not
as
efficient
as
specialized
formats
like
Avro,
it
offers
an
easy
way
to
saveany
RDD.saveAsTextFile(path)Write
the
elements
of
the
dataset
as
a
text
file
(or
set
of
text
files)in
a
given
director
y
in
the
local
filesystem,
HDFS
or
anyotherHadoop-supported
file
system.
Spark
will
cal
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年高职助产(母婴护理)试题及答案
- 2025年高职学前教育(学前教育应用技能)试题及答案
- 2025福建龙岩市人力资源服务有限公司招聘就业见习人员3人考试笔试备考试题及答案解析
- 2025年中职畜牧兽医(动物检疫)试题及答案
- 宠物健康追踪合同(2025数据监测协议)
- 2025年中职(人力资源管理事务)员工招聘阶段测试试题及答案
- 2025年大学(食品科学与工程)食品微生物学真题及答案
- 2025河北省人民医院选聘19人考试笔试模拟试题及答案解析
- 2025四川广安顾县镇招聘城镇公益性岗位笔试考试参考题库及答案解析
- 2025陆军军医大学西南医院护士长招聘9人考试笔试备考题库及答案解析
- 招标代理机构遴选投标方案(技术标)
- Unit 1 People of Achievement Vocabulary 单词讲解课件高二英语人教版(2019)选择性必修第一册
- 广东事业单位工作人员聘用体检表
- NB-T+10488-2021水电工程砂石加工系统设计规范
- 建设法规 课件全套 项目1-8 建设工程法规基础- 建设工程其他相关法律制度
- 2024年RM机器人大赛规则测评笔试历年真题荟萃含答案
- 头颈肿瘤知识讲座
- 小学二年级体育教案全册表格式
- 储能收益统计表
- 手术室经腹腔镜胆囊切除术护理查房
- GB/T 17451-1998技术制图图样画法视图
评论
0/150
提交评论