版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Mining of Massive DatasetsLeskovec, Rajaraman, and UllmanStanford University¡ For cosine distance, there is a techniqueanalogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180)-sensitive family for any d1 and d2.¡ Called random hyperplanes.20¡ Each vector v determines a ha
2、sh function hvwith two buckets.¡ hv(x) = +1 if v.x > 0; = -1 if v.x < 0.¡ LS-family H = set of all functions derived fromany vector.¡ Claim: Probh(x)=h(y) = 1 (angle between xand y divided by 180).21vLook in theplane of xand y.xHyperplanes(normal to v )for which h(x) h(y)yProbRe
3、d case= /180Hyperplanesfor whichh(x) = h(y)22¡ Pick somber of vectors, and hash yourdata for each vector.¡ The result is a signature (sketch) of +1s and1s that can be used for LSH like theminhash signatures for Jaccard distance.¡ But you dont have to think this way.¡ The existenc
4、e of the LSH-family is sufficient for amplification by AND/OR.23¡ We need not pick from among all possiblevectors vto form a component of a sketch.¡ It suffices to consider only vectors vof +1 and 1 components.consisting24¡ Simple idea: hash functions correspond to lines.¡ Partit
5、ion the line into buckets of size a.¡ Hash each point to the bucket containing its projection onto the line.¡ Nearby points are alwaysare rarely in same bucket.; distant points25Points atdistance dIf d << a, thenthe chance the points are in the same bucket isIf d >> a, mustto 9
6、0obefor there to beany chance points go to the same bucket.at least 1 d/a.d cos Randomlychosen lineBucketwidth a26¡ If points are distance > 2a apart, then60 < < 90 for there to be a chance that the points go in the same bucket.§ I.e., at most 1/3 probability.¡ If points are
7、distance < a/2, then there is at least ½ chance they share a bucket.¡ Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.27¡ For previous distance measures, we could startwith a (d, e, p, q)-sensitive family for any d < e,and drive pand qto 1 and 0 by AND/ORconstructions.¡ Here, we seem to need e > 4d.28¡ But as long as d < e, the probability of points atdistance d falling in the same bucket is greater than the probability of points at distance e doi
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年湘潭辅警招聘考试题库含答案详解(新)
- 2025年重庆辅警招聘考试真题附答案详解(完整版)
- 2025年遂宁辅警招聘考试真题附答案详解(完整版)
- 2025年阿里辅警招聘考试真题附答案详解(培优)
- 2025综合物品承包合同模板
- 2025年潍坊辅警协警招聘考试备考题库含答案详解(b卷)
- 2025年萍乡辅警招聘考试题库完整答案详解
- 2025年黄石辅警协警招聘考试真题及答案详解(网校专用)
- 2025年蚌埠辅警协警招聘考试真题附答案详解(培优a卷)
- 2025关于广告代理合同的范本
- 2026年畜牧业养殖公司屠宰设备使用与维护管理制度
- 19 中国石拱桥 课件 2025-2026学年统编版语文八年级上册
- 2025高中英语短文改错专项训练80篇
- 2026年合肥合燃华润燃气有限公司校园招聘25人笔试考试备考试题及答案解析
- 北师大版(2024)2025-2026学年三年级下册期中调研试卷(含解析)
- 2025年四季度湖南海利高新技术产业集团有限公司招聘100人笔试考试参考试题及答案解析
- 2025成都农商银行社会招聘(综合柜员)模拟试卷附答案详解(综合卷)
- 2025江苏省大学生安全知识竞赛题库及答案
- 圆通快递车辆管理制度
- 2025年新教科版三年级上册科学期中测试卷
- 2025中国水利报社公开招聘工作人员12人笔试历年参考题库附带答案详解
评论
0/150
提交评论