版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
序列模式序列模式简介GSP算法
PrefixSpan算法本讲内容序列模式简介序列模式简介序列模式简介项目集(Itemset):是各种项目组成的集合序列(Sequence):是不同项目集(ItemSet)的有序排列,序列s可以表示为s=<s1s2…sl>,sj(1<=j<=l)为项目集(Itemset),也称为序列s的元素序列的长度:一个序列所包含的项目集(ItemSet)的个数。序列模式简介设a
=<a1a2…an>,b
=<b1b2…bm>,如果存在整数
1<=j1
<j2
<…<jn
<=m,使得a1
˝
bj1,a2
˝
bj2,…,an˝
bjn,则称序列a为序列b的子序列,又称序列b包含序列a,记为a
˝
b例如,序列<(3)(4
5)(8)>是序列<(7)(3
8)(9)(4
5
6)(8)>的子序列。但,<(3)(5)>不是<(3
5)>的子序列,反之亦然。给定一个序列集合,如果序列s不包含于任何一个其它的序列中,则称s是最大的(Maximal
Sequence)。项集与序列一条项集数据是一个集合,如{a,b,d},没有顺序关系一条序列数据是由多个项集依次排成的序列,项集之间有顺序关系如<a(abc)(ac)d(cf)>表示<{a}{a,
b,c}{a,
c}{d}{c,
f}>多元项集需加上括号,无括号的字母视作单项集子序列若一条序列T的每个字母都能在S中依次找到对应字母则称T是S的子序列如:<(ab)c>是序列10和30的子序列<dc>是序列10、20、30的子序列<abc>是序列10、40的子序列<cg>不是任何序列的子序列小练习<(ab)d>是序列
的子序列<cc>是序列
的子序列给定最小支持度是50%<aa>,支持度是
,
(是or
不是)频繁序列<(ad)e>,支持度是
,(是or
不是)频繁序列频繁序列若一条序列T在序列数据集中作为子序列的频率≥α则称序列T是序列数据集的频繁序列如:α称为阈值,可以人为设定*简单起见,在接下来的所有内容中,α都设为50%<(ab)c>,50%(10、30),是频繁序列<ac>,100%(10、20、30、40),是频繁序列<(cf)>,25%(10),不是频繁序列<cab>,0%,不是频繁序列SIDsequence10<a(abc)(ac)d(cf)>20<(ad)c(bc)(ae)>30<(ef)(ab)(df)cb>40<eg(af)cbc>寻找频繁序列用途:
货架摆放内容推送搜索联想……序列模式简介Customer-sequenceAll
the
transactions
of
a
customer,
ordered
byincreasing
transaction-time,
corresponds
to
asequence.序列a在序列数据库S中的支持数为序列数据库S中包含序列a的序列个数,记为Support(a)序列模式挖掘:找出所有最大的频繁子序列。itemset的支持度:itemset作为一个整体,在整个数据库中出现的比例,比如,单次购物时同时购买该itemset中所有项目的顾客占全部顾客的百分比itemset的支持度等于1-sequence的支持度litemset
(Large
itemset):A
itemsetwith
minimum
supportCandidate
sequenceLarge
sequence序列模式简介ExampleQ.
How
to
find
the
sequential
patterns?Example以Customer_Id
及TransactionTime
排序TransactionItemItemsetExample
(cont.)Sequence3-Sequence<(30)
(90)>
is
supported
by
customer
1
and
4<30
(40
70)>
is
supported
by
customer
2
and
4With
minimum
support
of
2
customers:The
large
itemset
(litemset):(30),
(40),
(70),
(40
70),
(90)GSP算法GSP算法Sequence
phaseFive
phasesSort
phaseLarge
itemset
phaseTransformation
phaseAprioriMaximal
phaseSort
the
database
with
customer-id
asthe
major
key
and
transaction-time
asthe
minor
keySortphaseFind
the
large
itemset.Itemsets
mappingLitemset
phaseReason:
可把litemset当作一个整体对待。比较两个litemsets是否相等用的时间是恒定的,在判断某个序列是否包含于另一个序列时可以省时间。Transformation
phaseDeleting
non-large
itemsetsMapping
large
itemsets
to
integersSequence
phaseUse
the
set
of
litemsets
to
find
thelarge
sequence
(frequent
sequence).MaximalphaseFind
the
maximum
sequences
amongthe
set
of
large
sequences.In
some
algorithms,
this
phase
iscombined
with
the
sequence
phase.MaximalphaseAlgorithm:S
thesetof
all
large
sequencesn thelength
of
the
longest
sequencefor
(k
=
n;
k
>
1;
k--)
doforeach
k-sequence
sk
doDelete
from
S
all
subsequences
of
skAprioriAll算法The
basic
method
to
mine
sequentialpatternsBased
on
the
Apriori
algorithm.Count
all
the
large
sequences,
includingnon-maximal
sequences.Use
Apriori-generate
function
togenerate
candidate
sequence.Apriori
Candidate
Generationgenerate
candidates
for
pass
using
onlythe
large
sequences
found
in
theprevious
pass
and
then
makes
a
passover
the
data
to
find
their
support.Algorithm:Lk-1thesetof
all
large
(k-1)-sequencesApriori
CandidateGenerationCk
the
set
of
candidate
k-sequencesJoin
Phase:insert
intoCkselect
p.litemset1,
p.litemset2,…,
p.litemsetk-1,
q.litemsetk-1from
Lk-1
p,
Lk-1
qwhere
p.litemset1=q.litemset1,…,
p.litemsetk-2=q.litemsetk-2;Prung
Phase:forall
sequences
c˛
Ck
doforall
(k-1)-subsequences
s
of
c
doif
(sˇLk-1)
thendelete
cfrom
Ck;Example关联规则与序列模式生成候选集时的区别join
序列模式需要关注顺序例如,有如下三条序列100
<(10)
(20)>200<(20)
(10)>200<(20)
(10)>假设minsupp=1,(10)(20)都是litemset,分别map为1,2则生成C2时需要同时包含<12>和<21>可以发现,序列<(10)(20)>和<(20)(10)>都是所需要的序列模式。如果join时不考虑顺序的话,将丢掉<(20)(10)>。AprioriAll
(cont.)L1
=
{large
1-sequences};
//
Result
of
litemsetphasefor(
k=2;
Lk-1≠Φ;
k++)
dobeginCk
=
New
candidate
generate
from
Lk-1foreach
customer-sequence
c
in
the
database
doIncrementthe
countofallcandidatesin
Ck
that
are
contained
in
cLk
=
Candidates
in
Ck
withminimumsupport.EndAnswer=Maximal
Sequences
in
Lk;Example关联规则与序列模式对itmeset计数时的区别如果相同的项目集合在同一个序列里出现多次的话,只记一次例如,有如下三条序列100<(10)
(20)
(10)>200
<(20)>200<(20)
(30)>假设minsuff=2,则(20)是litemset,而(10)只能算出现了一次,因而不是litemset.Apriori
Candidate
GenerationExample:
(Customer
Sequences)<{1
5}{2}{3}{4}><{1}{3}{4}{3
5}><{1}{2}{3}{4}><{1}{3}{5}><{4}{5}>With
minimum
set
to
25%next
step:
find
the
large
1-sequencesnext
step:
find
the
large
2-sequences1111ExampleLarge
1-Sequence<{
5}{2}{3}{4}><{ }{3}{4}{3
5}><{
}{2}{3}{4}><{
}{3}{5}><{4}{5}>SequenceSupport<1>4<2>2<3>4<4>4<5>4Example<12><21><13><31><14><41><15><51><23><32><24><42><25><52><34><43><35><53><45><54>L1joinL2SequenceSupport<1>4<2>2<3>4<4>4<5>4<12>
pr<21><13><31><14><41><15><51><23><32><24><42><25><52><34><43><35><53><45><54>ungSequenceSupport<1
2>2<1
3>4<1
4>3<1
5>3<2
3>2<2
4>2<3
4>3<3
5>2<4
5>2<{1
5}{2}{3}{4}><{1}{3}{4}{3
5}><{1}{2}{3}{4}><{1}{3}{5}><{4}{5}><123><132><124><142><125><152><134><143><135><153><145><154><234><243><345><354><123><132><124><142><125><152><134><143><135><153><145><154><234><243><345><354><1
2>2<1
3>4<1
4>3<1
5>3<2
3>2<2
4>2<3
4>3<3
5>2<4
5>2L2joinL3prung<{1
5}{2}{3}{4}><{1}{3}{4}{3
5}><{1}{2}{3}{4}><{1}{3}{5}><{4}{5}><1
2
3>2<1
2
4>2<1
3
4>3<1
3
5>2<2
3
4>2<1234><1243><1345><1354>1234<1
2
3>2<1
2
4>2<1
3
4>3<1
3
5>2<2
3
4>2L3joinL4prung<{1
5}{2}{3}{4}><{1}{3}{4}{3
5}><{1}{2}{3}{4}><{1}{3}{5}><{4}{5}><1
2
3
4>
2<1
2
3
4>SequenceSupport2ExampleSequenceSupport<1>4<2>2<3>4<4>4<5>2<4
5>SequenceSupport<1
2>2<1
3>4<1
4>3<1
5>3<2
3>2<2
4>2<3
4>3<3
5>22<1
3
5>SequenceSupport<1
2
3>2<1
2
4>2<1
3
4>32<2
3
4>2Find
the
maximal
large
sequencesjudgmentWaste
too
much
time
in
countingnon-maximal
sequence,
which
isimpossible
to
be
a
sequential
pattern.AprioriSomeIt
generates
candidates
for
a
pass
using
only
the
large
sequences
found
inthe
previous
pass.Divided
into
2
phase:forward
vs.
backwardAdvantage:Reduce
counting
time
wasted
in
counting
non-maximal
ones.Nextfunctionhits
=|
Lk
|
/
|Ck
|AprioriSome
(cont.)//Forward
PhaseL1
=
{Large
1-sequences};
//
Result
of
the
litemset
phaseC1
=
L1;
//
so
that
we
have
a
nice
loop
conditionLast
=1;
//
we
last
counted
Clastfor
(k=2;
Ck-1≠φ
and
Llast≠φ;k++)
dobeginif
(Lk-1
known)
thenCk=New
candidates
generated
from
Lk-1;ElseCk
=New
candidates
generated
from
Ck-1;if
(k==next(last))
then
beginForeach
customer-sequence
c
in
the
database
doIncrement
the
count
of
all
candidates
in
Ck
that
are
contained
in
cLk=Candidates
in
Ck
with
minimum
support.Last=k;endendAprioriSome
(cont.)//Backward
Phasefor
(k--;
k>=1;
k--
)
doif
(Lk
was
not
determined
in
the
forward
phase)
then
beginDelete
all
sequences
in
Ck
contained
in
some
Li,
i>k;foreach
customer-sequence
c
in
DT
doIncrement
the
count
of
all
candidates
in
Ck
that
are
contained
in
c.Lk=Candidates
in
Ck
with
minimum
support.endelse
begin
//
Lk
already
knownDelete
all
sequences
in
Lk
contained
in
some
Li,
i>k;endendAnswer=UkLk;CkLkC1L1f(k)=2kExample
-
AprioriSomeC2---L3L2C3C5C4L4So
far,
wecoveredRead-basedWrite-basedPoint-basedAssociationMiningApriori[AgSr94]VIPER[SHSB00]FPGrowth
[HaPe00]Sub-GraphMiningAGM[PKDD’00]FSG[ICDM’01]Mofa[ICDM’02]gSpan[ICDM’02]Sequential
PatternDiscoveryGSP[AgSr96]PrefixSpan
[PHPC01]Iceberg
CubeApriori[AgSr94]BUC[BeRa99],H-cubing[HPDW01]References[AgSr94]
R.
Agrawal,
R.
Srikant,
“Fast
Algorithms
for
Mining
AssociationRules”,
Proc.of
the
20th
Int'l
Conference
on
Very
Large
Databases,
Santiago,
Chile,
Sept.
1994.[AgSr96]
R.
Srikant,
R.
Agrawal:
"Mining
Sequential
Patterns:Generalizations
and
Performance
Improvements",
to
appear
in
Proc.
of
the
Fifth
Int'l
Conference
onExtending
DatabaseTechnulogy
(EDBT),
Avignon,
France,
March
1996.[BeRa99]
Kevin
S.
Beyerand
Raghu
Ramakrishnan.
“Bottom-UpComputation
ofSparse
and
Iceberg
CUBEs”.
In
Proceedings
of
the
ACM
SIGMOD
International
Conference
onManagement
ofData,
pages
359-370,June
1999.[HPDW01]J.Han,J.Pei,G.Dong,
andK.Wang,
“Efficient
Computation
of
Iceberg
Cubes
with
Complex
Measures”,
Proc.
2001
ACM-SIGMOD
Int.
Conf.
on
Managementof
Data
(SIGMOD'01),
Santa
Barbara,
CA,
May2001.[HPY00]
J.
Han,
J.
Pei,
and
Y.
Yin,
``
Mining
Frequent
Patterns
withoutCandidate
Generation'',,
Proc.
2000
ACM-SIGMOD
Int.
Conf.
on
Management
of
Data(SIGMOD'00),
Dallas,
TX,
May2000.[HaPe00]
J.
Han
and
J.
Pei``Mining
Frequent
Patterns
by
Pattern-Growth:
Methodology
and
Implications
'',
ACM
SIGKDD
Explorations
(Special
Issue
on
ScalebleDataMiningAlgorithms),
2(2),
December
2000.[PHPC01]J.Pei,J.Han,H.Pinto,
Q.
Chen,
U.
Dayal,
and
M.-C.
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026黑龙江哈尔滨“丁香人才周”(春季)事业单位引才招聘考试备考试题及答案解析
- 2026年河南黄金叶投资管理有限公司所属企业大学生招聘29人(第一批次)考试备考试题及答案解析
- 2026川投(达州)燃气发电有限公司招聘3人笔试模拟试题及答案解析
- 2026广东惠州市大亚湾开发区霞涌街道办事处工作人员招聘5人考试备考试题及答案解析
- 2026年中国烟草总公司四川省公司校园招聘笔试备考题库及答案解析
- 2026重庆幼儿师范高等专科学校招聘8人笔试参考题库及答案解析
- 2026年江西铜业集团建设有限公司春季社会招聘3人考试备考试题及答案解析
- 2026年石家庄能源投资集团有限公司校园招聘笔试备考题库及答案解析
- 2026四川成都师范学院考核招聘高层次人才29人笔试模拟试题及答案解析
- 东星印刷厂印刷工程合同模板合同三篇
- 物业设备巡检计划方案(3篇)
- 快递业安全生产培训课件
- 化工工艺设计培训
- 2025年血透室血传播疾病阴转阳的应急演练脚本
- 应急管理通论(第二版)课件 第9章 应急沟通职能
- 乙酰半胱氨酸的用药护理
- 要素式民事起诉状(侵害著作权及邻接权纠纷)
- 2025年新疆中考化学真题(原卷版)
- 2025年内江市中考地理试题(含答案解析)
- 皮肤外科进修汇报
- 2025年贵州省中考英语一模试题无答案
评论
0/150
提交评论