




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Mining of Massive DatasetsLeskovec, Rajaraman, and UllmanStanford University¡ For cosine distance, there is a techniqueanalogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180)-sensitive family for any d1 and d2.¡ Called random hyperplanes.20¡ Each vector v determines a ha
2、sh function hvwith two buckets.¡ hv(x) = +1 if v.x > 0; = -1 if v.x < 0.¡ LS-family H = set of all functions derived fromany vector.¡ Claim: Probh(x)=h(y) = 1 (angle between xand y divided by 180).21vLook in theplane of xand y.xHyperplanes(normal to v )for which h(x) h(y)yProbRe
3、d case= /180Hyperplanesfor whichh(x) = h(y)22¡ Pick somber of vectors, and hash yourdata for each vector.¡ The result is a signature (sketch) of +1s and1s that can be used for LSH like theminhash signatures for Jaccard distance.¡ But you dont have to think this way.¡ The existenc
4、e of the LSH-family is sufficient for amplification by AND/OR.23¡ We need not pick from among all possiblevectors vto form a component of a sketch.¡ It suffices to consider only vectors vof +1 and 1 components.consisting24¡ Simple idea: hash functions correspond to lines.¡ Partit
5、ion the line into buckets of size a.¡ Hash each point to the bucket containing its projection onto the line.¡ Nearby points are alwaysare rarely in same bucket.; distant points25Points atdistance dIf d << a, thenthe chance the points are in the same bucket isIf d >> a, mustto 9
6、0obefor there to beany chance points go to the same bucket.at least 1 d/a.d cos Randomlychosen lineBucketwidth a26¡ If points are distance > 2a apart, then60 < < 90 for there to be a chance that the points go in the same bucket.§ I.e., at most 1/3 probability.¡ If points are
7、distance < a/2, then there is at least ½ chance they share a bucket.¡ Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.27¡ For previous distance measures, we could startwith a (d, e, p, q)-sensitive family for any d < e,and drive pand qto 1 and 0 by AND/ORconstructions.¡ Here, we seem to need e > 4d.28¡ But as long as d < e, the probability of points atdistance d falling in the same bucket is greater than the probability of points at distance e doi
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 低碳新能源企业工会代表选举合规性咨询与监督合同
- 外企员工工作签证担保及福利保障协议
- 生物实验动物代谢笼租赁及实验方案定制协议
- 航空公司飞行员岗位全职聘用与飞行经验积累合同
- 国际自行车赛事电子计时系统租赁及保养服务协议
- 游艇码头泊位租赁及配套设施管理合同
- 影视动画色彩调整与渲染软件租赁协议
- 文化创意商业街区租赁与管理承包合同
- 生物制药行业专用冻干机真空泵油租赁及保养服务合同
- 房地产销售派遣与客户关系管理合同
- 2022年四川省巴中市中考英语真题卷(含答案与解析)
- 小学生主题班会《学会感恩与爱同行》
- 维克多高中英语3500词汇
- 烟台某公寓电气设计毕业论文
- 2022全国高考真题化学汇编:专题 烃 卤代烃
- 脑血管病介入诊疗并发症及其处理课件
- 家校共育一年级家长会ppt
- 《微电子学概论》第八章-光电子器件课件
- 化学分析送样单2
- 化工原理教案:6 吸收
- 【高考真题】2022年新高考浙江语文高考真题试卷(Word版含答案)
评论
0/150
提交评论