Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅

上传人：扣*** IP属地：宁夏上传时间：2022-02-10 格式：PPT 页数：27 大小：227KB 积分：15 举报 版权申诉

Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅_第2页

Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅_第3页

Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅_第4页

Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅_第5页

已阅读5页，还剩22页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、600.465 - intro to nlp - j. eisner1the maximum-entropy stewpot600.465 - intro to nlp - j. eisner2probability is useful we love probability distributions! weve learned how to define & use p() functions. pick best output text t from a set of candidates speech recognition (hw2); machine translation

2、; ocr; spell correction. maximize p1(t) for some appropriate distribution p1 pick best annotation t for a fixed input i text categorization; parsing; part-of-speech tagging maximize p(t | i); equivalently maximize joint probability p(i,t) often define p(i,t) by noisy channel: p(i,t) = p(t) * p(i | t

3、) speech recognition & other tasks above are cases of this too: were maximizing an appropriate p1(t) defined by p(t | i) pick best probability distribution (a meta-problem!) really, pick best parameters : train hmm, pcfg, n-grams, clusters maximum likelihood; smoothing; em if unsupervised (incom

4、plete data) bayesian smoothing: max p(|data) = max p(, data) =p()p(data|)summary of half of the course (statistics)600.465 - intro to nlp - j. eisner3probability is flexible we love probability distributions! weve learned how to define & use p() functions. we want p() to define probability of li

5、nguistic objects trees of (non)terminals (pcfgs; cky, earley, pruning, inside-outside) sequences of words, tags, morphemes, phonemes (n-grams, fsas, fsts; regex compilation, best-paths, forward-backward, collocations) vectors (decis.lists, gaussians, nave bayes; yarowsky, clustering/k-nn) weve also

6、seen some not-so-probabilistic stuff syntactic features, semantics, morph., gold. could be stochasticized? methods can be quantitative & data-driven but not fully probabilistic: transf.-based learning, bottom-up clustering, lsa, competitive linking but probabilities have wormed their way into mo

7、st things p() has to capture our intuitions about the ling. datasummary of other half of the course (linguistics)600.465 - intro to nlp - j. eisner4an alternative tradition old ai hacking technique: possible parses (or whatever) have scores. pick the one with the best score. how do you define the sc

8、ore? completely ad hoc! throw anything you want into the stew add a bonus for this, a penalty for that, etc. “learns” over time as you adjust bonuses and penalties by hand to improve performance. total kludge, but totally flexible too can throw in any intuitions you might havereally so alternative?6

9、00.465 - intro to nlp - j. eisner5an alternative tradition old ai hacking technique: possible parses (or whatever) have scores. pick the one with the best score. how do you define the score? completely ad hoc! throw anything you want into the stew add a bonus for this, a penalty for that, etc. “lear

10、ns” over time as you adjust bonuses and penalties by hand to improve performance. total kludge, but totally flexible too can throw in any intuitions you might havereally so alternative?expos at 9probabilistic revolutionnot really a revolution, critics saylog-probabilities no more than scores in disg

13、isner7nuthin but adding weights n-grams: + log p(w7 | w5,w6) + log(w8 | w6, w7) + pcfg: log p(np vp | s) + log p(papa | np) + log p(vp pp | vp) can regard any linguistic object as a collection of features (here, tree = a collection of context-free rules) weight of the object = total weight of featur

14、es our weights have always been conditional log-probs ( 0) but that is going to change in a few minutes! hmm tagging: + log p(t7 | t5, t6) + log p(w7 | t7) + noisy channel: log p(source) + log p(data | source) cascade of fsts: log p(a) + log p(b | a) + log p(c | b) + nave bayes: log(class) + log(fea

15、ture1 | class) + log(feature2 | class) + 600.465 - intro to nlp - j. eisner8probabilists rally behind paradigm“.2, .4, .6, .8! were not gonna take your bait!”1. can estimate our parameters automatically e.g., log p(t7 | t5, t6) (trigram tag probability)from supervised or unsupervised data2. our resu

16、lts are more meaningfulcan use probabilities to place bets, quantify riske.g., how sure are we that this is the correct parse?3. our results can be meaningfully combined modularity! multiply indep. conditional probs normalized, unlike scoresp(english text) * p(english phonemes | english text) * p(ja

18、 consider e.g. nave bayes for text categorization: buy this supercalifragilistic ginsu knife set for only $39 today some useful features: contains buy contains supercalifragilistic contains a dollar amount under $100 contains an imperative sentence reading level = 8th grade mentions money (use word

19、classes and/or regexp to detect this) nave bayes: pick c maximizing p(c) * p(feat 1 | c) * what assumption does nave bayes make? true here?.5 .02 .9 .1spamling600.465 - intro to nlp - j. eisner10probabilists regret being bound by principle ad-hoc approach does have one advantage consider e.g. nave b

20、ayes for text categorization: buy this supercalifragilistic ginsu knife set for only $39 today some useful features: contains a dollar amount under $100 mentions money nave bayes: pick c maximizing p(c) * p(feat 1 | c) * what assumption does nave bayes make? true here?.5 .02 .9 .1spamlingnave bayes

21、claims .5*.9=45% of spam has both features 25*9=225x more likely than in ling.50% of spam has this 25x more likely than in ling90% of spam has this 9x more likely than in lingbut here are the emails with both features only 25x!600.465 - intro to nlp - j. eisner11 but ad-hoc approach does have one ad

22、vantage can adjust scores to compensate for feature overlap some useful features of this message: contains a dollar amount under $100 mentions money nave bayes: pick c maximizing p(c) * p(feat 1 | c) * what assumption does nave bayes make? true here?probabilists regret being bound by principle.5 .02

23、 .9 .1spamling-1 -5.6 -.15 -3.3spamlinglog prob-.85 -2.3 -.15 -3.3spamlingadjusted600.465 - intro to nlp - j. eisner12revolution corrupted by bourgeois values nave bayes needs overlapping but independent features but not clear how to restructure these features like that: contains buy contains superc

24、alifragilistic contains a dollar amount under $100 contains an imperative sentence reading level = 7th grade mentions money (use word classes and/or regexp to detect this) boy, wed like to be able to throw all that useful stuff in without worrying about feature overlap/independence. well, maybe we c

25、an add up scores and pretend like we got a log probability:600.465 - intro to nlp - j. eisner13revolution corrupted by bourgeois values nave bayes needs overlapping but independent features but not clear how to restructure these features like that: contains buy contains supercalifragilistic contains

26、 a dollar amount under $100 contains an imperative sentence reading level = 7th grade mentions money (use word classes and/or regexp to detect this) boy, wed like to be able to throw all that useful stuff in without worrying about feature overlap/independence. well, maybe we can add up scores and pr

27、etend like we got a log probability: log p(feats | spam) = 5.77 +4+0.2+1+2 -3+5 total: 5.77 oops, then p(feats | spam) = exp 5.77 = 320.5600.465 - intro to nlp - j. eisner14renormalize by 1/z to get a log-linear model p(m | spam) = (1/z() exp i i fi(m) wherem is the email messagei is weight of featu

28、re ifi(m)0,1 according to whether m has feature imore generally, allow fi(m) = count or strength of feature.1/z() is a normalizing factor making m p(m | spam)=1(summed over all possible messages m! hard to find!) the weights we add up are basically arbitrary. they dont have to mean anything, so long

29、 as they give us a good probability. why is it called “log-linear”? p(feats | spam) = exp 5.77 = 320.5scale down soeverything 1 and sums to 1!600.465 - intro to nlp - j. eisner15why bother? gives us probs, not just scores. can use em to bet, or combine w/ other probs. we can now learn weights from d

30、ata! choose weights i that maximize logprob of labeled training data = log j p(cj) p(mj | cj) where cjling,spam is classification of message mj and p(mj | cj) is log-linear model from previous slide convex function easy to maximize! (why?) but: p(mj | cj) for a given requires z(): hard!600.465 - int

31、ro to nlp - j. eisner16attempt to cancel out z set weights to maximize j p(cj) p(mj | cj) where p(m | spam) = (1/z() exp i i fi(m) but normalizer z() is awful sum over all possible emails so instead: maximize j p(cj | mj) doesnt model the emails mj, only their classifications cj makes more sense any

32、way given our feature setp(spam | m) = p(spam)p(m|spam) / (p(spam)p(m|spam)+p(ling)p(m|ling)z appears in both numerator and denominatoralas, doesnt cancel out because z differs for the spam and ling modelsbut we can fix this 600.465 - intro to nlp - j. eisner17so: modify setup a bit instead of havin

33、g separate models p(m|spam)*p(spam) vs. p(m|ling)*p(ling) have just one joint model p(m,c) gives us both p(m,spam) and p(m,ling) equivalent to changing feature set to: spam spam and contains buy spam and contains supercalifragilistic ling ling and contains buy ling and contains supercalifragilistic

34、no real change, but 2 categories now share single feature set and single value of z() weight of this feature is log p(spam) + a constant weight of this feature is log p(ling) + a constantold spam models weight for “contains buy”old ling models weight for “contains buy”600.465 - intro to nlp - j. eis

35、ner18now we can cancel out znow p(m,c) = (1/z() exp i i fi(m,c) where cling, spam old: choose weights i that maximize prob of labeled training data = j p(mj, cj) new: choose weights i that maximize prob of labels given messages = j p(cj | mj) now z cancels out of conditional probability! p(spam | m)

36、 = p(m,spam) / (p(m,spam) + p(m,ling) = exp i i fi(m,spam) / (exp i i fi(m,spam) + exp i i fi(m,ling) easy to compute now j p(cj | mj) is still convex, so easy to maximize too600.465 - intro to nlp - j. eisner19maximum entropy suppose there are 10 classes, a through j. i dont give you any other info

37、rmation. question: given message m: what is your guess for p(c | m)? suppose i tell you that 55% of all messages are in class a. question: now what is your guess for p(c | m)? suppose i also tell you that 10% of all messages contain buy and 80% of these are in class a or c. question: now what is you

38、r guess for p(c | m), if m contains buy? ouch!600.465 - intro to nlp - j. eisner20maximum entropyabcdefghijbuy.051.0025.029.0025.0025.0025.0025.0025.0025.0025other.499.0446.0446.0446.0446.0446.0446.0446.0446.0446 column a sums to 0.55 (“55% of all messages are in class a”)600.465 - intro to nlp - j.

39、 eisner21maximum entropyabcdefghijbuy.051.0025.029.0025.0025.0025.0025.0025.0025.0025other.499.0446.0446.0446.0446.0446.0446.0446.0446.0446 column a sums to 0.55 row buy sums to 0.1 (“10% of all messages contain buy”)600.465 - intro to nlp - j. eisner22maximum entropyabcdefghijbuy.051.0025.029.0025.0025.0025.0025.0025.0025.0025other.499.0446.0446.0446.0446

人人文库> 全部分类> 应用文书 > 事务文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅

文档简介

温馨提示

最新文档

评论

Lecture 34 TheMaximumEntropy Stewpot讲座34最大熵炖锅

文档简介

温馨提示

最新文档

评论

相关文档