云计算时代的社交网络平台和技术_谷歌中国.ppt_第1页
云计算时代的社交网络平台和技术_谷歌中国.ppt_第2页
云计算时代的社交网络平台和技术_谷歌中国.ppt_第3页
云计算时代的社交网络平台和技术_谷歌中国.ppt_第4页
云计算时代的社交网络平台和技术_谷歌中国.ppt_第5页
免费预览已结束,剩余65页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

2/5/2019,ed chang,1,云计算时代的社交网络 平台和技术,张智威 副院长, 研究院, 谷歌中国 教授, 电机工程系, 加州大学,2/5/2019,ed chang,2,180 million ( 25%),208 million ( 3%),60 million ( 90%),60 million ( 29%),500 million,180 million,600 k,engineering,graduates,mobile phones,broadband users,internet,population,china,u.s.,china opportunity china & us in 2006-07,72 k,72000,2/5/2019,ed chang,3,google china,size (700) 200 engineers 400 other employees almost 100 interns locations beijing (2005) taipei (2006) shanghai (2007),2/5/2019,ed chang,4,organizing the worlds information, socially,社区平台 (social platform) 云运算 (cloud computing) 结论与前瞻 (concluding remarks),2/5/2019,ed chang,5,web 1.0,.htm,.htm,.htm,.jpg,.jpg,.doc,.htm,.msg,.htm,.htm,2/5/2019,ed chang,6,web with people (2.0),.htm,.jpg,.doc,.xls,.msg,2/5/2019,ed chang,7,+ social platforms,.htm,.jpg,.doc,.xls,.msg,app (gadget),app (gadget),2/5/2019,ed chang,8,2/5/2019,ed chang,9,2/5/2019,ed chang,10,2/5/2019,ed chang,11,2/5/2019,ed chang,12,开放社区平台,2/5/2019,ed chang,13,2/5/2019,ed chang,14,2/5/2019,ed chang,15,2/5/2019,ed chang,16,2/5/2019,ed chang,17,开放社区平台,社区平台,2/5/2019,ed chang,18,2/5/2019,ed chang,19,2/5/2019,ed chang,20,开放社区平台,社区平台,2/5/2019,ed chang,21,2/5/2019,ed chang,22,social graph,2/5/2019,ed chang,23,2/5/2019,ed chang,24,what users want?,people care about other people care about people they know connect to people they do not know discover interesting information based on other people about who other people are about what other people are doing,2/5/2019,ed chang,25,information overflow challenge,too many people, too many choices of forums and apps “i soon need to hire a full-time to manage my online social networks” desiring a social network recommendation system,2/5/2019,ed chang,26,recommendation system,friend recommendation community/forum recommendation application suggestion ads matching,2/5/2019,ed chang,27,organizing the worlds information, socially,社区平台 (social platform) 云运算 (cloud computing) 结论与前瞻 (concluding remarks),2/5/2019,ed chang,28,picture source: ,(1)数据在云端 不怕丢失 不必备份 (2)软件在云端 不必下载 自动升级,(3)无所不在的云计算 任何设备 登录后就是你的 (4)无限强大的云计算 无限空间 无限速度,业界趋势:云计算时代的到来,2/5/2019,ed chang,29,互联网搜索:云计算的例子,1. 用户输入查询关键字,cloud computing,2. 分布式预处理数据以便为搜索提供服务: google infrastructure (thousands of commodity servers around the world) mapreduce for mass data processing google file system,3. 返回搜索结果,2/5/2019,ed chang,30,given a matrix that “encodes” data,collaborative filtering,2/5/2019,ed chang,31,given a matrix that “encodes” data,many applications (collaborative filtering): user community user user ads user ads community etc.,users,communities,2/5/2019,ed chang,32,collaborative filtering (cf) breese, heckerman and kadie 1998,memory-based given user u, find “similar” users (k nearest neighbors) bought similar items, saw similar movies, similar profiles, etc. different similarity measures yield different techniques make predictions based on the preferences of these “similar” users model-based build a model of relationship between subject matters make predictions based on the constructed model,2/5/2019,ed chang,33,memory-based model goldbert et al. 1992; resnik et al. 1994; konstant et al. 1997,pros simplicity, avoid model-building stage cons memory and time consuming, uses the entire database every time to make a prediction cannot make prediction if the user has no items in common with other users,2/5/2019,ed chang,34,model-based model breese et al. 1998; hoffman 1999; blei et al. 2004,pros scalability, model is much smaller than the actual dataset faster prediction, query the model instead of the entire dataset cons model-building takes time,2/5/2019,ed chang,35,algorithm selection criteria,near-real-time recommendation scalable training incremental training is desirable can deal with data scarcity cloud computing!,2/5/2019,ed chang,36,model-based prior work,latent semantic analysis (lsa) probabilistic lsa (plsa) latent dirichlet allocation (lda),2/5/2019,ed chang,37,latent semantic analysis (lsa) deerwester et al. 1990,map high-dimensional count vectors to lower dimensional representation called latent semantic space by svd decomposition: a = u vt,a = word-document co-occurrence matrix uij = how likely word i belongs to topic j jj = how significant topic j is vijt= how likely topic i belongs to doc j,2/5/2019,ed chang,38,latent semantic analysis (cont.),lsa keeps k-largest singular values low-rank approximation to the original matrix save space, de-noisified and reduce sparsity make recommendations using word-word similarity: t doc-doc similarity: t word-doc relationship: ,2/5/2019,ed chang,39,probabilistic latent semantic analysis (plsa) hoffman 1999; hoffman 2004,document is viewed as a bag of words a latent semantic layer is constructed in between documents and words p(w, d) = p(d) p(w|d) = p(d)zp(w|z)p(z|d) probability delivers explicit meaning p(w|w), p(d|d), p(d, w) model learning via em algorithm,2/5/2019,ed chang,40,plsa extensions,phits cohn & chang 2000 model document-citation co-occurrence a linear combination of plsa and phits cohn & hoffmann 2001 model contents (words) and inter-connectivity of documents lda blei et al. 2003 provide a complete generative model with dirichlet prior at griffiths & steyvers 2004 include authorship information document is categorized by authors and topics art mccallum 2004 include email recipient as additional information email is categorized by author, recipients and topics,2/5/2019,ed chang,41,combinational collaborative filtering (ccf),fuse multiple information alleviate the information sparsity problem hybrid training scheme gibbs sampling as initializations for em algorithm parallelization achieve linear speedup with the number of machines,2/5/2019,ed chang,42,notations,given a collection of co-occurrence data community: c = c1, c2, , cn user: u = u1, u2, , um description: d = d1, d2, , dv latent aspect: z = z1, z2, , zk models baseline models community-user (c-u) model community-description (c-d) model ccf: combinational collaborative filtering combines both baseline models,2/5/2019,ed chang,43,baseline models,community-user (c-u) model,community-description (c-d) model,community is viewed as a bag of users c and u are rendered conditionally independent by introducing z generative process, for each user u 1. a community c is chosen uniformly 2. a topic z is selected from p(z|c) 3. a user u is generated from p(u|z),community is viewed as a bag of words c and d are rendered conditionally independent by introducing z generative process, for each word d 1. a community c is chosen uniformly 2. a topic z is selected from p(z|c) 3. a word d is generated from p(d|z),2/5/2019,ed chang,44,baseline models (cont.),community-user (c-u) model,community-description (c-d) model,pros 1. personalized community suggestion cons 1. c-u matrix is sparse, may suffer from information sparsity problem 2. cannot take advantage of content similarity between communities,pros 1. cluster communities based on community content (description words) cons 1. no personalized recommendation 2. do not consider the overlapped users between communities,2/5/2019,ed chang,45,ccf model,combinational collaborative filtering (ccf) model,ccf combines both baseline models a community is viewed as - a bag of users and a bag of words by adding c-u, ccf can perform personalized recommendation which c-d alone cannot by adding c-d, ccf can perform better personalized recommendation than c-u alone which may suffer from sparsity things ccf can do that c-u and c-d cannot - p(d|u), relate user to word - useful for user targeting ads,2/5/2019,ed chang,46,algorithm requirements,near-real-time recommendation scalable training incremental training is desirable,2/5/2019,ed chang,47,parallelizing ccf,details omitted,2/5/2019,ed chang,48,picture source: ,(1)数据在云端 不怕丢失 不必备份 (2)软件在云端 不必下载 自动升级,(3)无所不在的云计算 任何设备 登录后就是你的 (4)无限强大的云计算 无限空间 无限速度,业界趋势:云计算时代的到来,2/5/2019,ed chang,49,experiments on orkut dataset,data description collected on july 26, 2007 two types of data were extracted community-user, community-description 312,385 users 109,987 communities 191,034 unique english words community recommendation community similarity/clustering user similarity speedup,2/5/2019,ed chang,50,community recommendation,evaluation method no ground-truth, no user clicks available leave-one-out: randomly delete one community for each user whether the deleted community can be recovered evaluation metric precision and recall,2/5/2019,ed chang,51,results,observations: ccf outperforms c-u for top20, precision/recall of ccf are twice higher than those of c-u the more communities a user has joined, the better ccf/c-u can predict,2/5/2019,ed chang,52,runtime speedup,the orkut dataset enjoys a linear speedup when the number of machines is up to 100 reduces the training time from one day to less than 14 minutes but, what makes the speedup slow down after 100 machines?,2/5/2019,ed chang,53,runtime speedup (cont.),training time consists of two parts: computation time (comp) communication time (comm),2/5/2019,ed chang,54,ccf summary,combinational collaborative filtering fuse bags of words and bags of users information hybrid training provides better initializations for em rather than random seeding parallelize to handle large-scale datasets,2/5/2019,ed chang,55,chinas contributions on/to cloud computing,parallel ccf parallel svms (kernel machines) parallel svd parallel spectral clustering parallel expectation maximization parallel association mining parallel lda,2/5/2019,ed chang,56,speeding up svms nips 2007,approximate matrix factorization parallelization open source /p/psvm 350+ downloads since december 07 a task that takes 7 days on 1 machine takes 1 hours on 500 machines,2/5/2019,ed chang,57,incomplete cholesky factorization (icf),n x n,n x p,p x n,p n conserve storage,2/5/2019,ed chang,58,matrix product,=,p x n,n x p,p x p,2/5/2019,ed chang,59,organizing the worlds information, socially,社区平台 (social platform) 云运算 (cloud computing) 结论与前瞻 (concluding remarks),2/5/2019,ed chang,60,web with people,.htm,.htm,.htm,.jpg,.jpg,.doc,.xls,.msg,.msg,.htm,2/5/2019,ed chang,61,what next for web search?,personalization return query results considering personal preferences example: disambiguate synonym like fuji oops: several tried, the problem is hard training data difficult to collect enough (for collaborative filtering) computational intensive to support personalization (e.g., for personalizing page rank) user profile may be incomplete, erroneous,2/5/2019,ed chang,62,个人搜索 智能搜索,搜索“富士” 可返回 富士山 富士苹果 富士相机,2/5/2019,ed chang,63,2/5/2019,ed chang,64,2/5/2019,ed chang,65,2/5/2019,ed chang,66,2/5/2019,ed chang,67,organizing worlds information , socially,web is a collection of documents and people recommendation is a personalized, push model of search collaborative filtering requires dense information to be effective cloud computing is essential,2/5/2019,ed chang,68,references,1 alexa internet. /. 2 d. m. blei and m. i. jordan. variational methods for the dirichlet process. in proc. of the 21st international conference on machine learning, pages 373-380, 2004. 3 d. m. blei, a. y. ng, and m. i. jordan. latent dirichlet allocation. journal of machine learning research, 3:993-1022, 2003. 4 d. cohn and h. chang. learning to probabilistically identify authoritative documents. in proc. of the seventeenth international conference on machine learning, pages 167-174, 2000. 5 d. cohn and t. hofmann. the missing link - a probabilistic model of document content and hypertext connectivity. in advances in neural information processing systems 13, pages 430-436, 2001. 6 s. c. deerwester, s. t. dumais, t. k. landauer, g. w. furnas, and r. a. harshman. indexing by latent semantic analysis. journal of the american society of information science, 41(6):391-407, 1990. 7 a. p. dempster, n. m. laird, and d. b. rubin. maximum likelihood from incomplete data via the em algorithm. journal of the royal statistical society. series b (methodological), 39(1):1-38, 1977. 8 s. geman and d. geman. stochastic relaxation, gibbs distributions, and the bayesian restoration of images. ieee transactions on pattern recognition and machine intelligence, 6:721-741, 1984. 9 t. hofmann. probabilistic latent semantic indexing. in proc. of uncertainty in arti cial intelligence, pages 289-296, 1999. 10 t. hofmann. latent semantic models for collaborative filtering. acm transactions on information system, 22(1):89-115, 2004. 11 a. mccallum, a. corrada-emmanuel, and x. wang. the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email. technical report, computer science, university of massachusetts amherst, 2004. 12 d. newman, a. asuncion, p. smyth, and m. welling. distributed inference for latent dirichlet allocation. in advances in neural information processing systems 20, 2007. 13 m. ramoni, p. sebastiani, and p. cohen. bayesian clustering by dynamics. machine learning, 47(1):91-121, 2002.,2/5/2019,ed chang,69,references (cont.),14 r. salakhutdinov, a. mnih, and g. hinton. restricted boltzmann machines for collaborative ltering. in proc. of the 24th international conference on machine learning, pages 791-798, 2007. 15 e. spertus, m. sahami, and o. buyukkokten. evaluating similarity measures: a large-scale study in the orkut social network. in proc. of the 11th acm sigkdd international conference on knowledge discovery in data mining, pages 678-684, 2005. 16 m. steyvers, p. smyth, m. rosen-zvi, and t. griths. probabilistic author-topic models for information discovery. in proc. of the 10th acm sigkdd international conference on knowledge discovery and data mining, pages 306-315, 2004. 17 a. strehl and j. ghosh. cluster ensembles - a knowledge reuse framework for combining multiple partitions. journal on machine learning research (jmlr), 3:583-617, 2002. 18 t. zhang and v. s. iyengar. recommender systems using linear classi ers. journal of machine learning research, 2:313-334, 2002. 19 s. zhong and j. ghosh. generative model-based clustering of documents: a comparative study. knowledge and information systems (kais), 8:374-384, 2005. 20 l. admic and e. adar. how to search a social network. 2004 21 t.l. griffiths and m. steyvers. finding scientific topics. proceedings of the national academy of sciences, pages 5228-5235, 2004. 22 h. kautz, b. selman, and m. shah. referral web: combining social networks and collaborative filtering. communitcations of the acm, 3:63-65, 1997. 23 r. agrawal, t. imielnski, a. swami. mining association rules between sets of items in large databses. sigmod rec., 22:207-116, 1993. 24 j. s. breese, d. heckerman, and c. kadie. empirical analysis of predictive algorithms for collaborative filtering. in proceedings of the fourteenth conference on uncertainty in artifical intelligence, 1998. 25 m.deshpande and g. karypis. item-based top-n recommendation algorithms. acm trans. inf. syst., 22(1):143-177, 2004.,2/5/2019,ed chang,70,references (cont.),26 b.m. sarwar, g. karypis, j.a. konstan, and j. reidl. item-based collaborative filtering recommendation algorithms. in proceedings of the 10th international world wide web conference, pages 285-295, 2001. 27 m.deshpande and g. karypis. item-based top-n recommenda

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论