版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DataStreamMiningandQueryingSlidestakenfromanexcellentTutorialonDataStreamMiningandQueryingbyMinosGarofalakis,
JohannesGehrkeandRajeevRastogi
AndbyMino’slecturesslidesfromthere:/cs286sp07/ProcessingDataStreams:MotivationAgrowingnumberofapplicationsgeneratestreamsofdataPerformancemeasurementsinnetworkmonitoringandtrafficmanagementCalldetailrecordsintelecommunicationsTransactionsinretailchains,ATMoperationsinbanksLogrecordsgeneratedbyWebServersSensornetworkdataApplicationcharacteristicsMassivevolumesofdata(severalterabytes)RecordsarriveatarapidrateGoal:Minepatterns,processqueriesandcomputestatisticsondatastreamsinreal-timeDataStreams:ComputationModelAdatastreamisa(massive)sequenceofelements:
StreamprocessingrequirementsSinglepass:EachrecordisexaminedatmostonceBoundedstorage:LimitedMemory(M)forstoringsynopsisReal-time:Perrecordprocessingtime(tomaintainsynopsis)mustbelowStreamProcessingEngine(Approximate)AnswerSynopsisinMemoryDataStreamsDataStreamProcessingAlgorithmsGenerally,algorithmscomputeapproximateanswersDifficulttocomputeanswersaccuratelywithlimitedmemoryApproximateanswers-DeterministicboundsAlgorithmsonlycomputeanapproximateanswer,butboundsonerrorApproximateanswers-ProbabilisticboundsAlgorithmscomputeanapproximateanswerwithhighprobabilityWithprobabilityatleast,thecomputedansweriswithinafactoroftheactualanswerSingle-passalgorithmsforprocessingstreamsalsoapplicableto(massive)terabytedatabases!
Sampling:BasicsIdea:AsmallrandomsampleSofthedataoftenwell-representsallthedataForafastapproxanswer,apply“modified”querytoSExample:selectaggfromRwhereR.eisodd
(n=12)
Ifaggisavg,returnaverageofoddelementsinSIfaggiscount,returnaverageoverallelementseinSofnifeisodd0ifeisevenUnbiased:Forexpressionsinvolvingcount,sum,avg:theestimatorisunbiased,i.e.,theexpectedvalueoftheansweristheactualanswerDatastream:935271658491SampleS:9518answer:5answer:12*3/4=9ProbabilisticGuaranteesExample:Actualansweriswithin5±1withprob0.9UseTailInequalitiestogiveprobabilisticboundsonreturnedanswerMarkovInequalityChebyshev’sInequalityHoeffding’sInequalityChernoffBoundTailInequalitiesGeneralboundsontailprobabilityofarandomvariable(thatis,probabilitythatarandomvariabledeviatesfarfromitsexpectation)
BasicInequalities:LetXbearandomvariablewithexpectationandvarianceVar[X].ThenforanyProbabilitydistributionTailprobabilityMarkov:Chebyshev:TailInequalitiesforSumsPossibletoderivestrongerboundsontailprobabilitiesforthesumofindependentrandomvariablesHoeffding’sInequality:LetX1,...,Xmbeindependentrandomvariableswith0<=Xi<=r.Letandbetheexpectationof.Then,foranyApplicationtoavgqueries:missizeofsubsetofsampleSsatisfyingpredicate(3inexample)risrangeofelementvaluesinsample(8inexample)TailInequalitiesforSums(Contd.)PossibletoderiveevenstrongerboundsontailprobabilitiesforthesumofindependentBernoullitrialsChernoffBound:LetX1,...,XmbeindependentBernoullitrialssuchthatPr[Xi=1]=p(Pr[Xi=0]=1-p).Letandbetheexpectationof.Then,forany,Applicationtocountqueries:missizeofsampleS(4inexample)pisfractionofoddelementsinstream(2/3inexample)
Remark:ChernoffboundresultsintighterboundsforcountqueriescomparedtoHoeffding’sinequalityComputingStreamSampleReservoirSampling[Vit85]:
MaintainsasampleSofafixed-sizeMAddeachnewelementtoSwithprobabilityM/n,wherenisthecurrentnumberofstreamelementsIfaddanelement,evictarandomelementfromSInsteadofflippingacoinforeachelement,determinethenumberofelementstoskipbeforethenexttobeaddedtoSConcisesampling[GM98]:DuplicatesinsampleSstoredas<value,count>pairs(thus,potentiallyboostingactualsamplesize)AddeachnewelementtoSwithprobability1/T(simplyincrementcountifelementalreadyinS)IfsamplesizeexceedsMSelectnewthresholdT’>TEvicteachelement(decrementcount)fromSwithprobability1-T/T’AddsubsequentelementstoSwithprobability1/T’StreamingModel:SpecialCases
Time-SeriesModelOnlyj-thupdateupdatesA[j](i.e.,A[j]:=c[j])
Cash-RegisterModelc[j]isalways>=0(i.e.,increment-only)Typically,c[j]=1,soweseeamulti-setofitemsinonepass
TurnstileModelMostgeneralstreamingmodelc[j]canbe>0or<0(i.e.,incrementordecrement)
ProblemdifficultyvariesdependingonthemodelE.g.,MIN/MAXinTime-Seriesvs.Turnstile!Linear-Projection(akaAMS)SketchSynopsesGoal:Buildsmall-spacesummaryfordistributionvectorf(i)(i=1,...,N)seenasastreamofi-valuesBasicConstruct:
RandomizedLinearProjectionoff()=projectontoinner/dotproductoff-vectorSimpletocomputeoverthestream:Addwheneverthei-thvalueisseenGenerate‘sinsmall(logN)spaceusingpseudo-randomgeneratorsTunableprobabilisticguaranteesonapproximationerrorDelete-Proof:Justsubtracttodeleteani-thvalueoccurrenceComposable:Simplyaddindependently-builtprojectionsDatastream:3,1,2,4,2,3,5,...Datastream:3,1,2,4,2,3,5,...f(1)f(2)f(3)f(4)f(5)11122where=vectorofrandomvaluesfromanappropriatedistributionExample:Binary-JoinCOUNTQueryProblem:ComputeanswerforthequeryCOUNT(RAS)Example:Exactsolution:tooexpensive,requiresO(N)space!N=sizeof(domain(A))DatastreamR.A:41241412032134DatastreamS.A:31242412213421=10(2+2+0+6)BasicAMSSketchingTechnique[AMS96]KeyIntuition:Userandomizedlinearprojectionsoff()todefinerandomvariableXsuchthatXiseasilycomputedoverthestream(insmallspace)E[X]=COUNT(RAS)Var[X]issmall
BasicIdea:Defineafamilyof4-wiseindependent{-1,+1}randomvariables
Pr[=+1]=Pr[=-1]=1/2Expectedvalueofeach,E[]=0Variablesare4-wiseindependentExpectedvalueofproductof4distinct=0Variablescanbegeneratedusingpseudo-randomgeneratorusingonlyO(logN)space(forseeding)!Probabilisticerrorguarantees(e.g.,actualansweris10±1withprobability0.9)AMSSketchConstructionComputerandomvariables:andSimplyaddtoXR(XS)wheneverthei-thvalueisobservedintheR.A(S.A)streamDefineX=XRXStobeestimateofCOUNTqueryExample:DatastreamR.A:412414DatastreamS.A:3124241202134122134213Binary-JoinAMSSketchingAnalysisExpectedvalueofX=COUNT(RAS)
Using4-wiseindependence,possibletoshowthat
isself-joinsizeofR(second/L2moment)01BoostingAccuracyChebyshev’sInequality:
BoostaccuracytobyaveragingoverseveralindependentcopiesofX(reducesvariance)
ByChebyshev:xxxAverageyBoostingConfidenceBoostconfidencetobytakingmedianof2log(1/)independentcopiesofYEachY=BernoulliTrial
Pr[|median(Y)-COUNT|COUNT](ByChernoffBound)=Pr[#failuresin2log(1/)trials>=log(1/)]yyycopiesmedian2log(1/)“FAILURE”:SummaryofBinary-JoinAMSSketchingStep1:Computerandomvariables:andStep2:DefineX=XRXSSteps3&4:AverageindependentcopiesofX;Returnmedianofaverages
MainTheorem(AGMS99):SketchingapproximatesCOUNTtowithinarelativeerrorofwithprobabilityusingspace
Remember:O(logN)spacefor“seeding”theconstructionofeachX
xxxAverageyxxxAverageyxxxAverageycopiescopiesmedian2log(1/)ASpecialCase:Self-joinSizeEstimateCOUNT(RAR)(originalAMSpaper)Second(L2)momentofdatadistribution,Giniindexofheterogeneity,measureofskewinthedata
Inthiscase,COUNT=SJ(R),sowegetan(e,d)-estimateusingspaceonly
Best-caseforAMSstreamingjoin-sizeestimationDistinctValueEstimationProblem:Findthenumberofdistinctvaluesinastreamofvalueswithdomain[0,...,N-1]Zerothfrequencymoment,L0(Hamming)streamnormStatistics:numberofspeciesorclassesinapopulationImportantforqueryoptimizersNetworkmonitoring:distinctdestinationIPaddresses,source/destinationpairs,requestedURLs,etc.Example(N=64)Datastream:305301751037Numberofdistinctvalues:5Assumeahashfunctionh(x)thatmapsincomingvaluesxin[0,…,N-1]uniformlyacross[0,…,2^L-1],whereL=O(logN)Letlsb(y)denotethepositionoftheleast-significant1bitinthebinaryrepresentationofyAvaluexismappedtolsb(h(x))MaintainHashSketch=BITMAParrayofLbits,initializedto0Foreachincomingvaluex,setBITMAP[lsb(h(x))]=1Hash(akaFM)SketchesforDistinctValueEstimation[FM85]x=5h(x)=101100lsb(h(x))=2000001BITMAP543210Hash(akaFM)SketchesforDistinctValueEstimation[FM85]Byuniformitythroughh(x):Prob[BITMAP[k]=1]=Prob[]=Assumingddistinctvalues:expectd/2tomaptoBITMAP[0],d/4tomaptoBITMAP[1],...LetR=positionofrightmostzeroinBITMAPUseasindicatoroflog(d)[FM85]provethatE[R]=,whereEstimated=Averageseveraliidinstances(differenthashfunctions)toreduceestimatorvariancefringeof0/1saroundlog(d)000001BITMAP000111111111position<<log(d)position>>log(d)0L-1[FM85]assume“ideal”hashfunctionsh(x)(N-wiseindependence) [AMS96]:pairwiseindependenceissufficienth(x)=,wherea,barerandombinaryvectorsin[0,…,2^L-1]Small-spaceestimatesfordistinctvaluesproposedbasedonFMideasDelete-Proof:Justusecountersinsteadofbitsinthesketchlocations+1forinserts,-1fordeletesComposable:Compon
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 身边的人身边的事演讲稿参考7篇
- 下肢静脉曲张的护理查房
- 信息技术2.0教学设计案例
- 急性淤胆型乙型病毒性肝炎的个案护理
- 冲压件加工协议
- 河南省焦作市2023-2024学年八年级上学期期末英语试题(含听力)
- 河北省沧州市沧州十校2023-2024学年高二下学期3月月考英语试题含听力
- 电梯维修合同范本
- 英语七年级下册第一单元测试题
- 教案模板(教案模板)
- 德育教育(教育学专业)PPT完整全套教学课件
- 配料作业指导书
- GB/T 23961-2023低碳脂肪胺含量的测定气相色谱法
- 选矿厂施工方案
- 小学语文多文本阅读教学策略
- 单台套物流方案
- 制作竹蜻蜓(教案)二年级上册综合实践活动通用版
- 2023年同等学力申硕-同等学力(教育学)考试历年高频考点试题含答案
- 课堂游戏惩罚-课件
- 西昌志能实业有限责任公司大陆槽稀土矿③号矿体开采技改工程环境影响报告
评论
0/150
提交评论