云计算时代的社交网络平台和技术_谷歌中国_张智威.ppt_第1页
云计算时代的社交网络平台和技术_谷歌中国_张智威.ppt_第2页
云计算时代的社交网络平台和技术_谷歌中国_张智威.ppt_第3页
云计算时代的社交网络平台和技术_谷歌中国_张智威.ppt_第4页
云计算时代的社交网络平台和技术_谷歌中国_张智威.ppt_第5页
已阅读5页,还剩65页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、9/24/2020,Ed Chang,1,云计算时代的社交网络平台和技术,张智威 副院长, 研究院, 谷歌中国 教授, 电机工程系, 加州大学,9/24/2020,Ed Chang,2,180 million ( 25%),208 million ( 3%),60 million ( 90%),60 million ( 29%),500 million,180 million,600 k,Engineering,Graduates,Mobile Phones,Broadband Users,Internet,Population,China,U.S.,China Opportunity Chi

2、na Resnik et al. 1994; Konstant et al. 1997,Pros Simplicity, avoid model-building stage Cons Memory and Time consuming, uses the entire database every time to make a prediction Cannot make prediction if the user has no items in common with other users,9/24/2020,Ed Chang,34,Model-Based ModelBreese et

3、 al. 1998; Hoffman 1999; Blei et al. 2004,Pros Scalability, model is much smaller than the actual dataset Faster prediction, query the model instead of the entire dataset Cons Model-building takes time,9/24/2020,Ed Chang,35,Algorithm Selection Criteria,Near-real-time Recommendation Scalable Training

4、 Incremental Training is Desirable Can deal with data scarcity Cloud Computing!,9/24/2020,Ed Chang,36,Model-based Prior Work,Latent Semantic Analysis (LSA) Probabilistic LSA (PLSA) Latent Dirichlet Allocation (LDA),9/24/2020,Ed Chang,37,Latent Semantic Analysis (LSA) Deerwester et al. 1990,Map high-

5、dimensional count vectors to lower dimensional representation called latent semantic space By SVD decomposition: A = U VT,A = Word-document co-occurrence matrix Uij = How likely word i belongs to topic j jj = How significant topic j is VijT= How likely topic i belongs to doc j,9/24/2020,Ed Chang,38,

6、Latent Semantic Analysis (cont.),LSA keeps k-largest singular values Low-rank approximation to the original matrix Save space, de-noisified and reduce sparsity Make recommendations using Word-word similarity: T Doc-doc similarity: T Word-doc relationship: ,9/24/2020,Ed Chang,39,Probabilistic Latent

7、Semantic Analysis (PLSA) Hoffman 1999; Hoffman 2004,Document is viewed as a bag of words A latent semantic layer is constructed in between documents and words P(w, d) = P(d) P(w|d) = P(d)zP(w|z)P(z|d) Probability delivers explicit meaning P(w|w), P(d|d), P(d, w) Model learning via EM algorithm,9/24/

8、2020,Ed Chang,40,PLSA extensions,PHITS Cohn & Chang 2000 Model document-citation co-occurrence A linear combination of PLSA and PHITS Cohn & Hoffmann 2001 Model contents (words) and inter-connectivity of documents LDA Blei et al. 2003 Provide a complete generative model with Dirichlet prior AT Griff

9、iths & Steyvers 2004 Include authorship information Document is categorized by authors and topics ART McCallum 2004 Include email recipient as additional information Email is categorized by author, recipients and topics,9/24/2020,Ed Chang,41,Combinational Collaborative Filtering (CCF),Fuse multiple

10、information Alleviate the information sparsity problem Hybrid training scheme Gibbs sampling as initializations for EM algorithm Parallelization Achieve linear speedup with the number of machines,9/24/2020,Ed Chang,42,Notations,Given a collection of co-occurrence data Community: C = c1, c2, , cN Use

11、r: U = u1, u2, , uM Description: D = d1, d2, , dV Latent aspect: Z = z1, z2, , zK Models Baseline models Community-User (C-U) model Community-Description (C-D) model CCF: Combinational Collaborative Filtering Combines both baseline models,9/24/2020,Ed Chang,43,Baseline Models,Community-User (C-U) mo

12、del,Community-Description (C-D) model,Community is viewed as a bag of users c and u are rendered conditionally independent by introducing z Generative process, for each user u 1. A community c is chosen uniformly 2. A topic z is selected from P(z|c) 3. A user u is generated from P(u|z),Community is

13、viewed as a bag of words c and d are rendered conditionally independent by introducing z Generative process, for each word d 1. A community c is chosen uniformly 2. A topic z is selected from P(z|c) 3. A word d is generated from P(d|z),9/24/2020,Ed Chang,44,Baseline Models (cont.),Community-User (C-

14、U) model,Community-Description (C-D) model,Pros 1. Personalized community suggestion Cons 1. C-U matrix is sparse, may suffer from information sparsity problem 2. Cannot take advantage of content similarity between communities,Pros 1. Cluster communities based on community content (description words

15、) Cons 1. No personalized recommendation 2. Do not consider the overlapped users between communities,9/24/2020,Ed Chang,45,CCF Model,Combinational Collaborative Filtering (CCF) model,CCF combines both baseline models A community is viewed as - a bag of users AND a bag of words By adding C-U, CCF can

16、 perform personalized recommendation which C-D alone cannot By adding C-D, CCF can perform better personalized recommendation than C-U alone which may suffer from sparsity Things CCF can do that C-U and C-D cannot - P(d|u), relate user to word - Useful for user targeting ads,9/24/2020,Ed Chang,46,Al

17、gorithm Requirements,Near-real-time Recommendation Scalable Training Incremental Training is Desirable,9/24/2020,Ed Chang,47,Parallelizing CCF,Details omitted,9/24/2020,Ed Chang,48,picture source: ,(1)数据在云端 不怕丢失 不必备份 (2)软件在云端 不必下载 自动升级,(3)无所不在的云计算 任何设备 登录后就是你的 (4)无限强大的云计算 无限空间

18、无限速度,业界趋势:云计算时代的到来,9/24/2020,Ed Chang,49,Experiments on Orkut Dataset,Data description Collected on July 26, 2007 Two types of data were extracted Community-user, community-description 312,385 users 109,987 communities 191,034 unique English words Community recommendation Community similarity/cluste

19、ring User similarity Speedup,9/24/2020,Ed Chang,50,Community Recommendation,Evaluation Method No ground-truth, no user clicks available Leave-one-out: randomly delete one community for each user Whether the deleted community can be recovered Evaluation metric Precision and Recall,9/24/2020,Ed Chang,

20、51,Results,Observations: CCF outperforms C-U For top20, precision/recall of CCF are twice higher than those of C-U The more communities a user has joined, the better CCF/C-U can predict,9/24/2020,Ed Chang,52,Runtime Speedup,The Orkut dataset enjoys a linear speedup when the number of machines is up

21、to 100 Reduces the training time from one day to less than 14 minutes But, what makes the speedup slow down after 100 machines?,9/24/2020,Ed Chang,53,Runtime Speedup (cont.),Training time consists of two parts: Computation time (Comp) Communication time (Comm),9/24/2020,Ed Chang,54,CCF Summary,Combi

22、national Collaborative Filtering Fuse bags of words and bags of users information Hybrid training provides better initializations for EM rather than random seeding Parallelize to handle large-scale datasets,9/24/2020,Ed Chang,55,Chinas Contributions on/to Cloud Computing,Parallel CCF Parallel SVMs (

23、Kernel Machines) Parallel SVD Parallel Spectral Clustering Parallel Expectation Maximization Parallel Association Mining Parallel LDA,9/24/2020,Ed Chang,56,Speeding up SVMs NIPS 2007,Approximate Matrix Factorization Parallelization Open source 350+ downloads since December 07 A task that takes 7 day

24、s on 1 machine takes 1 hours on 500 machines,9/24/2020,Ed Chang,57,Incomplete Cholesky Factorization (ICF),n x n,n x p,p x n,p n Conserve Storage,9/24/2020,Ed Chang,58,Matrix Product,=,p x n,n x p,p x p,9/24/2020,Ed Chang,59,Organizing the Worlds Information, Socially,社区平台 (Social Platform) 云运算 (Clo

25、ud Computing) 结论与前瞻 (Concluding Remarks),9/24/2020,Ed Chang,60,Web With People,.htm,.htm,.htm,.jpg,.jpg,.doc,.xls,.msg,.msg,.htm,9/24/2020,Ed Chang,61,What Next for Web Search?,Personalization Return query results considering personal preferences Example: Disambiguate synonym like fuji Oops: several

26、 tried, the problem is hard Training data difficult to collect enough (for collaborative filtering) Computational intensive to support personalization (e.g., for personalizing page rank) User profile may be incomplete, erroneous,9/24/2020,Ed Chang,62,个人搜索 智能搜索,搜索“富士” 可返回 富士山 富士苹果 富士相机,9/24/2020,Ed C

27、hang,63,9/24/2020,Ed Chang,64,9/24/2020,Ed Chang,65,9/24/2020,Ed Chang,66,9/24/2020,Ed Chang,67,Organizing Worlds Information , Socially,Web is a Collection of Documents and People Recommendation is a Personalized, Push Model of Search Collaborative Filtering Requires Dense Information to be Effecti

28、ve Cloud Computing is Essential,9/24/2020,Ed Chang,68,References,1 Alexa internet. 2 D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international conference on Machine learning, pages 373-380, 2004. 3 D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent di

29、richlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003. 4 D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth International Conference on Machine Learning, pages 167-174, 2000. 5 D. Cohn and T. Hofmann. The missing lin

30、k - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, pages 430-436, 2001. 6 S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American So

31、ciety of Information Science, 41(6):391-407, 1990. 7 A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977. 8 S. Geman and D. Geman. Stochastic relaxation, gibb

32、s distributions, and the bayesian restoration of images. IEEE Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984. 9 T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Articial Intelligence, pages 289-296, 1999. 10 T. Hofmann. Latent semantic mod

33、els for collaborative filtering. ACM Transactions on Information System, 22(1):89-115, 2004. 11 A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. Technical report, Computer Scien

34、ce, University of Massachusetts Amherst, 2004. 12 D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems 20, 2007. 13 M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machin

35、e Learning, 47(1):91-121, 2002.,9/24/2020,Ed Chang,69,References (cont.),14 R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative ltering. In Proc. Of the 24th international conference on Machine learning, pages 791-798, 2007. 15 E. Spertus, M. Sahami, and O. Buyu

36、kkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 678-684, 2005. 16 M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griths. Probabilistic author-topic models for

37、information discovery. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315, 2004. 17 A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research (JMLR), 3:58

38、3-617, 2002. 18 T. Zhang and V. S. Iyengar. Recommender systems using linear classiers. Journal of Machine Learning Research, 2:313-334, 2002. 19 S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and Information Systems (KAIS), 8:374-384, 2005. 20

39、L. Admic and E. Adar. How to search a social network. 2004 21 T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages 5228-5235, 2004. 22 H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering. C

40、ommunitcations of the ACM, 3:63-65, 1997. 23 R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD Rec., 22:207-116, 1993. 24 J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In P

41、roceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998. 25 M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-177, 2004.,9/24/2020,Ed Chang,70,References (cont.),26 B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Rei

42、dl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001. 27 M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-177, 2004. 28 B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001. 29 M. B

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论