版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
NaturalLanguageProcessing
withDeepLearning
CS224N/Ling284
ChristopherManning
Lecture6:SimpleandLSTMRecurrentNeuralNetworks
LecturePlan
1.RNNLanguageModels,continued(20mins)
2.OtherusesofRNNs(10mins)
3.Explodingandvanishinggradients(15mins)
4.LSTMs(20mins)
5.Bidirectionalandmulti-layerRNNs(15mins)
.FinalProjects
.NextThursday:alectureaboutchoosingfinalprojects
.It’sfinetodelaythinkingaboutprojectsuntilnextweek
.Butifyou’realreadythinkingaboutprojects,youcanviewsomeinfo/inspirationonthewebsite.It’sstilllastyear’sinformationatpresent!
.It’sgreatifyoucanlineupyourownmentor;wealsoliningupsomementors
2
Overview
.Lastlecturewelearned:
.Languagemodels,n-gramlanguagemodels,andRecurrentNeuralNetworks(RNNs)
.Todaywe’lllearnhowtogetRNNstoworkforyou
.TrainingandgeneratingfromRNNs
.UsesofRNNs
.ProblemswithRNNs(explodingandvanishinggradients)andhowtofixthem
.TheseproblemsmotivateamoresophisticatedRNNarchitecture:LSTMs
.AndothermorecomplexRNNoptions:bidirectionalRNNsandmulti-layerRNNs
.Nextlecturewe’lllearn:
.HowwecandoNeuralMachineTranslation(NMT)usinganRNN-basedarchitecture
calledsequence-to-sequencewithattention(whichisAss4!)
3
h(1)
h3)
h(0)
4
h(4)
1.TheSimpleRNNLanguageModelbooks
laptops
outputdistribution
a
zoo
U
hiddenstates
h(0)istheinitialhiddenstate
e(1)
wordembeddings
0
↑E
↑E
↑E
↑E
thestudentsopened
words/one-hotvectors
their
Note:thisinputsequencecouldbemuch
longernow!
equals
equals
equals
equals
equals
h(t-2)
TrainingtheparametersofRNNs:BackpropagationforRNNs
h(0)
…
…
Question:
Howdowecalculatethis?
Answer:
Backpropagateovertimesteps
i=t,…,0,summinggradientsasyougo.
5
Thisalgorithmiscalled“backpropagation
throughtime”[Werbos,P.G.,1988,Neural
Networks1,andothers]
GeneratingtextwithaRNNLanguageModel
Justlikean-gramLanguageModel,youcanuseanRNNLanguageModelto
generatetextbyrepeatedsampling.Sampledoutputbecomesnextstep’sinput.
favorite
sample
season
sample
is
sample
spring
sample
U
U
U
U
h(2)
h(4)
h3)
h(0)
h(1)
…
we
we
we
↑we
e98
e(1)
e98
6
myfavoriteseasonisspring
GeneratingtextwithanRNNLanguageModel
Let’shavesomefun!
.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.
.RNN-LMtrainedonObamaspeeches:
Source:
/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0
7
GeneratingtextwithanRNNLanguageModel
Let’shavesomefun!
.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.
.RNN-LMtrainedonHarryPotter:
Source:
/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6
8
GeneratingtextwithanRNNLanguageModel
Let’shavesomefun!
.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.
.RNN-LMtrainedonrecipes:
Source:
/nylki/1efbaa36635956d35bcc
9
GeneratingtextwithaRNNLanguageModel
Let’shavesomefun!
.YoucantrainaRNN-LMonanykindoftext,thengeneratetextinthatstyle.
.RNN-LMtrainedonpaintcolornames:
Thisisanexampleofacharacter-levelRNN-LM(predictswhatcharactercomesnext)
Source:
/post/160776374467/new-paint-colors-invented-by-neural-network
10
EvaluatingLanguageModels
.ThestandardevaluationmetricforLanguageModelsisperplexity.
Normalizedby
numberofwords
Inverseprobabilityofcorpus,accordingtoLanguageModel
.Thisisequaltotheexponentialofthecross-entropyloss:
Lowerperplexityisbetter!
11
RNNshavegreatlyimprovedperplexity
n-grammodel—一
IncreasinglycomplexRNNs
12
Perplexityimproves(lowerisbetter)
Source:
/building-an-efficient-neural-language-model-over-a-billion-words/
Recap
.LanguageModel:Asystemthatpredictsthenextword
.RecurrentNeuralNetwork:Afamilyofneuralnetworksthat:
.Takesequentialinputofanylength
.Applythesameweightsoneachstep
.Canoptionallyproduceoutputoneachstep
.RecurrentNeuralNetwork≠LanguageModel
.We’veshownthatRNNsareagreatwaytobuildaLM
.ButRNNsareusefulformuchmore!
13
Terminologyandalookforward
TheRNNwe’veseensofar=simple/vanilla/ElmanRNN
Latertoday:YouwilllearnaboutotherRNNflavors
likeGRU
andLSTM
andmulti-layerRNNs
Bytheendofthecourse:Youwillunderstandphraseslike
“stackedbidirectionalLSTMwithresidualconnectionsandself-attention”
14
2.OtherRNNuses:RNNscanbeusedforsequencetagging
e.g.,part-of-speechtagging,namedentityrecognition
DTJJNNVBNINDTNN
thestartledcatknockedoverthevase
15
RNNscanbeusedforsentenceclassification
e.g.,sentimentclassification
Howtocompute
sentenceencoding?
positive
Sentence
encoding
overallIenjoyedthemoviealot
16
equals
RNNscanbeusedforsentenceclassification
e.g.,sentimentclassification
Howtocompute
sentenceencoding?
positive
Sentence
encoding
Basicway:
Usefinalhidden
state
overallIenjoyedthemoviealot
17
RNNscanbeusedforsentenceclassification
e.g.,sentimentclassification
Howtocompute
sentenceencoding?
positive
Sentence
encoding
Usuallybetter:
Takeelement-wise
maxormeanofall
hiddenstates
overallIenjoyedthemoviealot
18
lotsofneural
architecture
lotsofneural
architecture
Answer:German
Context:Ludwig
vanBeethovenwas
aGerman
composerand
pianist.Acrucialfigure…
RNNscanbeusedasalanguageencodermodule
e.g.,questionanswering,machinetranslation,manyothertasks!
HeretheRNNactsasan
encoderfortheQuestion(the
hiddenstatesrepresentthe
Question).Theencoderispart
ofalargerneuralsystem.
Question:whatnationalitywasBeethoven?
19
RNN-LMscanbeusedtogeneratetext
e.g.,speechrecognition,machinetranslation,summarization
RNN-LM
Input(audio)
what’stheweather
conditioning
<START>what’sthe
Thisisanexampleofaconditionallanguagemodel.We’llseeMachineTranslationinmuchmoredetailnextclass.
20
3.ProblemswithRNNs:VanishingandExplodingGradients
JJ4)(θ)
h(1)h3)h(4)
www
21
w
w
JJ4)(θ)
Vanishinggradientintuition
h(1)h3)h(4)
?
22
w
w
JJ4)(θ)
Vanishinggradientintuition
h(1)h3)h(4)
chainrule!
23
JJ4)(θ)
h(1)h3)h(4)
Vanishinggradientintuition
w
w
chainrule!
24
h3)
h(4)
JJ4)(θ)
Vanishinggradientintuition
w
w
chainrule!
25
Whathappensifthesearesmall?
Vanishinggradientintuition
JJ4)(θ)
h3)h(4)
Vanishinggradientproblem:
Whenthesearesmall,thegradient
signalgetssmallerandsmallerasit
backpropagatesfurther
26
Vanishinggradientproofsketch(linearcase)
.Recall:
.Whatifweretheidentityfunction,?
(chainrule)
=Iwa=
.Considerthegradientofthelossonstep,withrespect
tothehiddenstateonsomepreviousstep.Let
(chainrule)
(valueof)
t
IfWhis“small”,thenthistermgets
exponentiallyproblematicasbecomeslarge
Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.
http://proceedings.mlr.press/v28/pascanu13.pdf
27(andsupplementalmaterials),at
http://proceedings.mlr.press/v28/pascanu13-supp.pdf
Vanishinggradientproofsketch(linearcase)
sufficientbut
.What’swrongwith?notnecessary
.Consideriftheeigenvaluesofarealllessthan1:
(eigenvectors)
.Wecanwriteusingtheeigenvectorsofasabasis:
Approaches0asgrows,sogradientvanishes
.Whataboutnonlinearactivations(i.e.,whatweuse?)
.Prettymuchthesamething,excepttheproofrequiresi<
forsomedependentondimensionalityand
28
Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.
http://proceedings.mlr.press/v28/pascanu13.pdf
(andsupplementalmaterials),at
http://proceedings.mlr.press/v28/pascanu13-supp.pdf
h3)
h(4)
Whyisvanishinggradientaproblem?
w
w
Gradientsignalfromfarawayislostbecauseit’smuchsmallerthangradientsignalfromclose-by.
So,modelweightsareupdatedonlywithrespecttoneareffects,notlong-termeffects.
29
EffectofvanishinggradientonRNN-LM
.LMtask:Whenshetriedtoprinthertickets,shefoundthattheprinterwasoutoftoner.
Shewenttothestationerystoretobuymoretoner.Itwasveryoverpriced.After
installingthetonerintotheprinter,shefinallyprintedher
.Tolearnfromthistrainingexample,theRNN-LMneedstomodelthedependencybetween“tickets”onthe7thstepandthetargetword“tickets”attheend.
.Butifthegradientissmall,themodelcan’tlearnthisdependency
.So,themodelisunabletopredictsimilarlong-distancedependenciesattesttime
30
Whyisexplodinggradientaproblem?
.Ifthegradientbecomestoobig,thentheSGDupdatestepbecomestoobig:
learningrate
gradient
.Thiscancausebadupdates:wetaketoolargeastepandreachaweirdandbad
parameterconfiguration(withlargeloss)
.Youthinkyou’vefoundahilltoclimb,butsuddenlyyou’reinIowa
.Intheworstcase,thiswillresultinInforNaNinyournetwork
(thenyouhavetorestarttrainingfromanearliercheckpoint)
31
Gradientclipping:solutionforexplodinggradient
.Gradientclipping:ifthenormofthegradientisgreaterthansomethreshold,scaleit
downbeforeapplyingSGDupdate
.Intuition:takeastepinthesamedirection,butasmallerstep
.Inpractice,rememberingtoclipgradientsisimportant,butexplodinggradientsarean
easyproblemtosolve
32Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.
http://proceedings.mlr.press/v28/pascanu13.pdf
Howtofixthevanishinggradientproblem?
.Themainproblemisthatit’stoodifficultfortheRNNtolearntopreserveinformation
overmanytimesteps.
.InavanillaRNN,thehiddenstateisconstantlybeingrewritten
.HowaboutanRNNwithseparatememorywhichisaddedto?
33
4.LongShort-TermMemoryRNNs(LSTMs)
.AtypeofRNNproposedbyHochreiterandSchmidhuberin1997asasolutiontothevanishinggradientsproblem.
.EveryonecitesthatpaperbutreallyacrucialpartofthemodernLSTMisfromGersetal.(2000)
.Onstept,thereisahiddenstateh9)c9andacellstate
.Botharevectorslengthn
.Thecellstoreslong-terminformation
.TheLSTMcanread,erase,andwriteinformationfromthecell
.ThecellbecomesconceptuallyratherlikeRAMinacomputer
.Theselectionofwhichinformationiserased/written/readiscontrolledbythreecorrespondinggates
.Thegatesarealsovectorsoflengthn
.Oneachtimestep,eachelementofthegatescanbeopen(1),closed(0),orsomewherein-between
.Thegatesaredynamic:theirvalueiscomputedbasedonthecurrentcontext
“Longshort-termmemory”,HochreiterandSchmidhuber,1997.
https://www.bioinf.jku.at/publications/older/2604.pdf
“LearningtoForget:ContinualPredictionwithLSTM”,Gers,Schmidhuber,andCummins,2000.
/doi/10.1162/089976600300015015
34
Allthesearevectorsofsamelengthn
c(t).Ontimestept:
Forgetgate:controlswhatiskeptvsforgotten,frompreviouscellstate
Gatesareappliedusingelement-wise
(orHadamard)product:。
LongShort-TermMemory(LSTM)
Wehaveasequenceofinputsx(t),andwewillcomputeasequenceofhiddenstatesh(t)andcellstates
Sigmoidfunction:allgatevaluesarebetween0and1
Inputgate:controlswhatpartsofthenewcellcontentarewrittentocell
Outputgate:controlswhatpartsofcellareoutputtohiddenstate
Newcellcontent:thisisthenewcontenttobewrittentothecell
Cellstate:erase(“forget”)some
contentfromlastcellstate,andwrite(“input”)somenewcellcontent
Hiddenstate:read(“output”)somecontentfromthecell
35
LongShort-TermMemory(LSTM)
YoucanthinkoftheLSTMequationsvisuallylikethis:
36Source:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
The+signisthesecret!
ct
ct-1ct
it
ot
Outputsomecellcontenttothehiddenstate
ft
~
ct
ht-1
ht
Computetheinputgate
Computetheoutputgate
LongShort-TermMemory(LSTM)
YoucanthinkoftheLSTMequationsvisuallylikethis:
Writesomenewcellcontent
Forgetsomecellcontent
Computetheforgetgate
Computethenewcellcontent
37Source:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Howdoesthissolvevanishinggradient?
LikeLSTM,GRUmakesiteasiertoretaininfolong-term(e.g.,bysettingupdategateto0)
Updategate:controlswhatpartsof
hiddenstateareupdatedvspreserved
GatedRecurrentUnits(GRU)
.ProposedbyChoetal.in2014asasimpleralternativetotheLSTM.
.Oneachtimesteptwehaveinputandhiddenstate(nocellstate).
Resetgate:controlswhatpartsofprevioushiddenstateareusedtocomputenewcontent
Newhiddenstatecontent:resetgateselectsusefulpartsofprevhidden
state.Usethisandcurrentinputtocomputenewhiddencontent.
Hiddenstate:updategate
simultaneouslycontrolswhatiskept
fromprevioushiddenstate,andwhatisupdatedtonewhiddenstatecontent
38
"LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation",Choetal.2014,
/pdf/1406.1078v3.pdf
???
LSTMvsGRU
.ResearchershaveproposedmanygatedRNNvariants,butLSTMandGRUarethemost
widely-used
.Ruleofthumb:LSTMisagooddefaultchoice(especiallyifyourdatahasparticularlylongdependencies,oryouhavelotsoftrainingdata);SwitchtoGRUsforspeedand
fewerparameters.
.Note:LSTMscanstoreunboundedly*largevaluesinmemorycelldimensions,and
relativelyeasilylearntocount.(UnlikeGRUs.)
GRU
LSTM
Singledimensionusedascounter
39
*boundedifassumingfiniteprecision,butstill,largeSource:“OnthePracticalComputationalPowerofFinitePrecisionRNNsforLanguageRecognition”,Weissetal.,2018.
/pdf/1805.04908.pdf
HowdoesLSTMsolvevanishinggradients?
.TheLSTMarchitecturemakesiteasierfortheRNNto
preserveinformationovermanytimesteps
.e.g.,iftheforgetgateissetto1foracelldimensionandtheinput
gatesetto0,thentheinformationofthatcellispreserved
indefinitely.
.Incontrast,it’sharderforavanillaRNNtolearnarecurrent
weightmatrixWhthatpreservesinfointhehiddenstate
.Inpractice,yougetabout100timestepsratherthanabout7
.LSTMdoesn’tguaranteethatthereisnovanishing/exploding
gradient,butitdoesprovideaneasierwayforthemodeltolearn
long-distancedependencies
40
LSTMs:real-worldsuccess
.In2013–2015,LSTMsstartedachievingstate-of-the-artresults
.Successfultasksincludehandwritingrecognition,speechrecognition,machine
translation,parsing,andimagecaptioning,aswellaslanguagemodels
.LSTMsbecamethedominantapproachformostNLPtasks
.Now(2019–2022),otherapproaches(e.g.,Transformers)havebecomedominantformanytasks
.Forexample,inWMT(aMachineTranslationconference+competition):
.InWMT2014,therewere0neuralmachinetranslationsystems(!)
.InWMT2016,thesummaryreportcontains“RNN”44times(andthesesystemswon)
.InWMT2019:“RNN”7times,”Transformer”105times
Source:"Findingsofthe2016ConferenceonMachineTranslation(WMT16)",Bojaretal.2016,
/wmt16/pdf/W16-2301.pdf
Source:"Findingsofthe2018ConferenceonMachineTranslation(WMT18)",Bojaretal.2018,
/wmt18/pdf/WMT028.pdf
Source:"Findingsofthe2019ConferenceonMachineTranslation(WMT19)",Barraultetal.2019,
/wmt18/pdf/WMT028.pdf
41
Isvanishing/explodinggradientjustaRNNproblem?
.No!Itcanbeaproblemforallneuralarchitectures(includingfeed-forwardandconvolutional),especiallyverydeepones.
.Duetochainrule/choiceofnonlinearityfunction,gradientcanbecomevanishinglysmallasitbackpropagates
.Thus,lowerlayersarelearnedveryslowly(hardtotrain)
.Solution:lotsofnewdeepfeedforward/convolutionalarchitecturesaddmoredirectconnections(thusallowingthegradienttoflow)
Forexample:
.Residualconnectionsaka“ResNet”
.Alsoknownasskip-connections
.Theidentityconnection
preservesinformationbydefault
.Thismakesdeepnetworksmuch
easiertotrain
"DeepResidualLearningforImageRecognition",Heetal,2015.
/pdf/1512.03385.pdf
42
Isvanishing/explodinggradientjustaRNNproblem?
Othermethods:
•Highwayconnectionsaka“HighwayNet”
•Similartoresidualconnections,buttheidentityconnectionvsthetransformationlayeris
controlledbyadynamicgate
•InspiredbyLSTMs,butappliedtodeepfeedforward/convolutionalnetworks
.Denseconnectionsaka“DenseNet”
.Directlyconnecteachlayertoallfuturelayers!
.Conclusion:Thoughvanishing/explodinggradientsareageneralproblem,RNNsareparticularlyunstableduetotherepeatedmultiplicationbythesameweightmatrix[Bengioetal,1994]
43
”DenselyConnectedConvolutionalNetworks",Huangetal,2017.
/pdf/1608.06993.pdf
”HighwayNetworks",Srivastavaetal,2015.
/pdf/1505.00387.pdf
”LearningLong-TermDependencieswithGradientDescentisDifficult",Bengioetal.1994,
http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf
element-wisemean/max
element-wisemean/max
Sentence
encoding
moviewasterriblyexciting!
the
5.BidirectionalandMulti-layerRNNs:motivation
Task:SentimentClassification
Wecanregardthishiddenstateasa
representationoftheword“terribly”inthecontextofthissentence.Wecallthisa
contextualrepresentation.
positive
Thesecontextual
representationsonly
containinformation
abouttheleftcontext(e.g.“themoviewas”).
Whataboutright
context?
Inthisexample,
“exciting”isintherightcontextandthis
modifiesthemeaningof“terribly”(fromnegativetopositive)
44
BidirectionalRNNs
Thiscontextualrepresentationof“terribly”hasbothleftandrightcontext!
Concatenated
hiddenstates
BackwardRNN
ForwardRNN
themoviewasterriblyexciting!
45
46
BidirectionalRNNs
Thisisageneralnotationtomean
“computeoneforwardstepofthe
RNN”–itcouldbeasimple,LSTM,orother(e.g.,GRU)RNNcomputation.
Ontimestept:
Generally,thesetwoRNNshave
separateweights
ForwardRNN
BackwardRNN
Concatenatedhiddenstates
Weregardthisas“thehiddenstate”ofabidirectionalRNN.Thisiswhatwepassontothe
nextpartsofthenetwork.
BidirectionalRNNs:simplifieddiagram
themoviewasterriblyexciting!
Thetwo-wayarrowsindicatebid
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 心肌炎护理中的静脉输液管理与护理要点
- 水痘患儿的日常活动管理
- 疼痛护理中的疼痛缓解
- 生态沟渠施工设计方案
- 护理妆容健康妆容理念
- 2026年长护险待遇按护理服务实际天数计发规则
- 2026年现代化首都都市圈空间协同规划核心要点解析
- 2026年工厂数字化设计与数字孪生交付
- 2026年智慧交通边缘RSU车路协同信号优先绿波通行
- 2026年虚拟电厂参与电力交易:充电运营商新利润增长点
- 2025-2026 学年下学期八年级英语下册教学计划
- 幼儿园春季育儿知识分享:守护成长健康同行
- 2026年六安职业技术学院单招职业适应性考试题库附答案详解(预热题)
- 2026年春节后复工复产“开工第一课”安全生产培训课件
- 二年级下册生命生态安全课件
- 微积分学课件:3-1微分中值定理
- 第二语言习得入门完整共7units课件
- 多媒体技术ppt课件(完整版)
- 碳中和承诺对化工意味着什么
- 2022年新教科版六年级下册科学知识点总结与归纳 (期末复习专用)
- 视频图解新能源汽车构造与原理课件
评论
0/150
提交评论