斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第1页
斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第2页
斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第3页
斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第4页
斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第5页
已阅读5页,还剩98页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

NaturalLanguageProcessing

withDeepLearning

CS224N/Ling284

ChristopherManning

Lecture6:SimpleandLSTMRecurrentNeuralNetworks

LecturePlan

1.RNNLanguageModels,continued(20mins)

2.OtherusesofRNNs(10mins)

3.Explodingandvanishinggradients(15mins)

4.LSTMs(20mins)

5.Bidirectionalandmulti-layerRNNs(15mins)

.FinalProjects

.NextThursday:alectureaboutchoosingfinalprojects

.It’sfinetodelaythinkingaboutprojectsuntilnextweek

.Butifyou’realreadythinkingaboutprojects,youcanviewsomeinfo/inspirationonthewebsite.It’sstilllastyear’sinformationatpresent!

.It’sgreatifyoucanlineupyourownmentor;wealsoliningupsomementors

2

Overview

.Lastlecturewelearned:

.Languagemodels,n-gramlanguagemodels,andRecurrentNeuralNetworks(RNNs)

.Todaywe’lllearnhowtogetRNNstoworkforyou

.TrainingandgeneratingfromRNNs

.UsesofRNNs

.ProblemswithRNNs(explodingandvanishinggradients)andhowtofixthem

.TheseproblemsmotivateamoresophisticatedRNNarchitecture:LSTMs

.AndothermorecomplexRNNoptions:bidirectionalRNNsandmulti-layerRNNs

.Nextlecturewe’lllearn:

.HowwecandoNeuralMachineTranslation(NMT)usinganRNN-basedarchitecture

calledsequence-to-sequencewithattention(whichisAss4!)

3

h(1)

h3)

h(0)

4

h(4)

1.TheSimpleRNNLanguageModelbooks

laptops

outputdistribution

a

zoo

U

hiddenstates

h(0)istheinitialhiddenstate

e(1)

wordembeddings

0

↑E

↑E

↑E

↑E

thestudentsopened

words/one-hotvectors

their

Note:thisinputsequencecouldbemuch

longernow!

equals

equals

equals

equals

equals

h(t-2)

TrainingtheparametersofRNNs:BackpropagationforRNNs

h(0)

Question:

Howdowecalculatethis?

Answer:

Backpropagateovertimesteps

i=t,…,0,summinggradientsasyougo.

5

Thisalgorithmiscalled“backpropagation

throughtime”[Werbos,P.G.,1988,Neural

Networks1,andothers]

GeneratingtextwithaRNNLanguageModel

Justlikean-gramLanguageModel,youcanuseanRNNLanguageModelto

generatetextbyrepeatedsampling.Sampledoutputbecomesnextstep’sinput.

favorite

sample

season

sample

is

sample

spring

sample

U

U

U

U

h(2)

h(4)

h3)

h(0)

h(1)

we

we

we

↑we

e98

e(1)

e98

6

myfavoriteseasonisspring

GeneratingtextwithanRNNLanguageModel

Let’shavesomefun!

.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonObamaspeeches:

Source:

/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0

7

GeneratingtextwithanRNNLanguageModel

Let’shavesomefun!

.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonHarryPotter:

Source:

/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6

8

GeneratingtextwithanRNNLanguageModel

Let’shavesomefun!

.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonrecipes:

Source:

/nylki/1efbaa36635956d35bcc

9

GeneratingtextwithaRNNLanguageModel

Let’shavesomefun!

.YoucantrainaRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonpaintcolornames:

Thisisanexampleofacharacter-levelRNN-LM(predictswhatcharactercomesnext)

Source:

/post/160776374467/new-paint-colors-invented-by-neural-network

10

EvaluatingLanguageModels

.ThestandardevaluationmetricforLanguageModelsisperplexity.

Normalizedby

numberofwords

Inverseprobabilityofcorpus,accordingtoLanguageModel

.Thisisequaltotheexponentialofthecross-entropyloss:

Lowerperplexityisbetter!

11

RNNshavegreatlyimprovedperplexity

n-grammodel—一

IncreasinglycomplexRNNs

12

Perplexityimproves(lowerisbetter)

Source:

/building-an-efficient-neural-language-model-over-a-billion-words/

Recap

.LanguageModel:Asystemthatpredictsthenextword

.RecurrentNeuralNetwork:Afamilyofneuralnetworksthat:

.Takesequentialinputofanylength

.Applythesameweightsoneachstep

.Canoptionallyproduceoutputoneachstep

.RecurrentNeuralNetwork≠LanguageModel

.We’veshownthatRNNsareagreatwaytobuildaLM

.ButRNNsareusefulformuchmore!

13

Terminologyandalookforward

TheRNNwe’veseensofar=simple/vanilla/ElmanRNN

Latertoday:YouwilllearnaboutotherRNNflavors

likeGRU

andLSTM

andmulti-layerRNNs

Bytheendofthecourse:Youwillunderstandphraseslike

“stackedbidirectionalLSTMwithresidualconnectionsandself-attention”

14

2.OtherRNNuses:RNNscanbeusedforsequencetagging

e.g.,part-of-speechtagging,namedentityrecognition

DTJJNNVBNINDTNN

thestartledcatknockedoverthevase

15

RNNscanbeusedforsentenceclassification

e.g.,sentimentclassification

Howtocompute

sentenceencoding?

positive

Sentence

encoding

overallIenjoyedthemoviealot

16

equals

RNNscanbeusedforsentenceclassification

e.g.,sentimentclassification

Howtocompute

sentenceencoding?

positive

Sentence

encoding

Basicway:

Usefinalhidden

state

overallIenjoyedthemoviealot

17

RNNscanbeusedforsentenceclassification

e.g.,sentimentclassification

Howtocompute

sentenceencoding?

positive

Sentence

encoding

Usuallybetter:

Takeelement-wise

maxormeanofall

hiddenstates

overallIenjoyedthemoviealot

18

lotsofneural

architecture

lotsofneural

architecture

Answer:German

Context:Ludwig

vanBeethovenwas

aGerman

composerand

pianist.Acrucialfigure…

RNNscanbeusedasalanguageencodermodule

e.g.,questionanswering,machinetranslation,manyothertasks!

HeretheRNNactsasan

encoderfortheQuestion(the

hiddenstatesrepresentthe

Question).Theencoderispart

ofalargerneuralsystem.

Question:whatnationalitywasBeethoven?

19

RNN-LMscanbeusedtogeneratetext

e.g.,speechrecognition,machinetranslation,summarization

RNN-LM

Input(audio)

what’stheweather

conditioning

<START>what’sthe

Thisisanexampleofaconditionallanguagemodel.We’llseeMachineTranslationinmuchmoredetailnextclass.

20

3.ProblemswithRNNs:VanishingandExplodingGradients

JJ4)(θ)

h(1)h3)h(4)

www

21

w

w

JJ4)(θ)

Vanishinggradientintuition

h(1)h3)h(4)

?

22

w

w

JJ4)(θ)

Vanishinggradientintuition

h(1)h3)h(4)

chainrule!

23

JJ4)(θ)

h(1)h3)h(4)

Vanishinggradientintuition

w

w

chainrule!

24

h3)

h(4)

JJ4)(θ)

Vanishinggradientintuition

w

w

chainrule!

25

Whathappensifthesearesmall?

Vanishinggradientintuition

JJ4)(θ)

h3)h(4)

Vanishinggradientproblem:

Whenthesearesmall,thegradient

signalgetssmallerandsmallerasit

backpropagatesfurther

26

Vanishinggradientproofsketch(linearcase)

.Recall:

.Whatifweretheidentityfunction,?

(chainrule)

=Iwa=

.Considerthegradientofthelossonstep,withrespect

tothehiddenstateonsomepreviousstep.Let

(chainrule)

(valueof)

t

IfWhis“small”,thenthistermgets

exponentiallyproblematicasbecomeslarge

Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.

http://proceedings.mlr.press/v28/pascanu13.pdf

27(andsupplementalmaterials),at

http://proceedings.mlr.press/v28/pascanu13-supp.pdf

Vanishinggradientproofsketch(linearcase)

sufficientbut

.What’swrongwith?notnecessary

.Consideriftheeigenvaluesofarealllessthan1:

(eigenvectors)

.Wecanwriteusingtheeigenvectorsofasabasis:

Approaches0asgrows,sogradientvanishes

.Whataboutnonlinearactivations(i.e.,whatweuse?)

.Prettymuchthesamething,excepttheproofrequiresi<

forsomedependentondimensionalityand

28

Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.

http://proceedings.mlr.press/v28/pascanu13.pdf

(andsupplementalmaterials),at

http://proceedings.mlr.press/v28/pascanu13-supp.pdf

h3)

h(4)

Whyisvanishinggradientaproblem?

w

w

Gradientsignalfromfarawayislostbecauseit’smuchsmallerthangradientsignalfromclose-by.

So,modelweightsareupdatedonlywithrespecttoneareffects,notlong-termeffects.

29

EffectofvanishinggradientonRNN-LM

.LMtask:Whenshetriedtoprinthertickets,shefoundthattheprinterwasoutoftoner.

Shewenttothestationerystoretobuymoretoner.Itwasveryoverpriced.After

installingthetonerintotheprinter,shefinallyprintedher

.Tolearnfromthistrainingexample,theRNN-LMneedstomodelthedependencybetween“tickets”onthe7thstepandthetargetword“tickets”attheend.

.Butifthegradientissmall,themodelcan’tlearnthisdependency

.So,themodelisunabletopredictsimilarlong-distancedependenciesattesttime

30

Whyisexplodinggradientaproblem?

.Ifthegradientbecomestoobig,thentheSGDupdatestepbecomestoobig:

learningrate

gradient

.Thiscancausebadupdates:wetaketoolargeastepandreachaweirdandbad

parameterconfiguration(withlargeloss)

.Youthinkyou’vefoundahilltoclimb,butsuddenlyyou’reinIowa

.Intheworstcase,thiswillresultinInforNaNinyournetwork

(thenyouhavetorestarttrainingfromanearliercheckpoint)

31

Gradientclipping:solutionforexplodinggradient

.Gradientclipping:ifthenormofthegradientisgreaterthansomethreshold,scaleit

downbeforeapplyingSGDupdate

.Intuition:takeastepinthesamedirection,butasmallerstep

.Inpractice,rememberingtoclipgradientsisimportant,butexplodinggradientsarean

easyproblemtosolve

32Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.

http://proceedings.mlr.press/v28/pascanu13.pdf

Howtofixthevanishinggradientproblem?

.Themainproblemisthatit’stoodifficultfortheRNNtolearntopreserveinformation

overmanytimesteps.

.InavanillaRNN,thehiddenstateisconstantlybeingrewritten

.HowaboutanRNNwithseparatememorywhichisaddedto?

33

4.LongShort-TermMemoryRNNs(LSTMs)

.AtypeofRNNproposedbyHochreiterandSchmidhuberin1997asasolutiontothevanishinggradientsproblem.

.EveryonecitesthatpaperbutreallyacrucialpartofthemodernLSTMisfromGersetal.(2000)

.Onstept,thereisahiddenstateh9)c9andacellstate

.Botharevectorslengthn

.Thecellstoreslong-terminformation

.TheLSTMcanread,erase,andwriteinformationfromthecell

.ThecellbecomesconceptuallyratherlikeRAMinacomputer

.Theselectionofwhichinformationiserased/written/readiscontrolledbythreecorrespondinggates

.Thegatesarealsovectorsoflengthn

.Oneachtimestep,eachelementofthegatescanbeopen(1),closed(0),orsomewherein-between

.Thegatesaredynamic:theirvalueiscomputedbasedonthecurrentcontext

“Longshort-termmemory”,HochreiterandSchmidhuber,1997.

https://www.bioinf.jku.at/publications/older/2604.pdf

“LearningtoForget:ContinualPredictionwithLSTM”,Gers,Schmidhuber,andCummins,2000.

/doi/10.1162/089976600300015015

34

Allthesearevectorsofsamelengthn

c(t).Ontimestept:

Forgetgate:controlswhatiskeptvsforgotten,frompreviouscellstate

Gatesareappliedusingelement-wise

(orHadamard)product:。

LongShort-TermMemory(LSTM)

Wehaveasequenceofinputsx(t),andwewillcomputeasequenceofhiddenstatesh(t)andcellstates

Sigmoidfunction:allgatevaluesarebetween0and1

Inputgate:controlswhatpartsofthenewcellcontentarewrittentocell

Outputgate:controlswhatpartsofcellareoutputtohiddenstate

Newcellcontent:thisisthenewcontenttobewrittentothecell

Cellstate:erase(“forget”)some

contentfromlastcellstate,andwrite(“input”)somenewcellcontent

Hiddenstate:read(“output”)somecontentfromthecell

35

LongShort-TermMemory(LSTM)

YoucanthinkoftheLSTMequationsvisuallylikethis:

36Source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The+signisthesecret!

ct

ct-1ct

it

ot

Outputsomecellcontenttothehiddenstate

ft

~

ct

ht-1

ht

Computetheinputgate

Computetheoutputgate

LongShort-TermMemory(LSTM)

YoucanthinkoftheLSTMequationsvisuallylikethis:

Writesomenewcellcontent

Forgetsomecellcontent

Computetheforgetgate

Computethenewcellcontent

37Source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Howdoesthissolvevanishinggradient?

LikeLSTM,GRUmakesiteasiertoretaininfolong-term(e.g.,bysettingupdategateto0)

Updategate:controlswhatpartsof

hiddenstateareupdatedvspreserved

GatedRecurrentUnits(GRU)

.ProposedbyChoetal.in2014asasimpleralternativetotheLSTM.

.Oneachtimesteptwehaveinputandhiddenstate(nocellstate).

Resetgate:controlswhatpartsofprevioushiddenstateareusedtocomputenewcontent

Newhiddenstatecontent:resetgateselectsusefulpartsofprevhidden

state.Usethisandcurrentinputtocomputenewhiddencontent.

Hiddenstate:updategate

simultaneouslycontrolswhatiskept

fromprevioushiddenstate,andwhatisupdatedtonewhiddenstatecontent

38

"LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation",Choetal.2014,

/pdf/1406.1078v3.pdf

???

LSTMvsGRU

.ResearchershaveproposedmanygatedRNNvariants,butLSTMandGRUarethemost

widely-used

.Ruleofthumb:LSTMisagooddefaultchoice(especiallyifyourdatahasparticularlylongdependencies,oryouhavelotsoftrainingdata);SwitchtoGRUsforspeedand

fewerparameters.

.Note:LSTMscanstoreunboundedly*largevaluesinmemorycelldimensions,and

relativelyeasilylearntocount.(UnlikeGRUs.)

GRU

LSTM

Singledimensionusedascounter

39

*boundedifassumingfiniteprecision,butstill,largeSource:“OnthePracticalComputationalPowerofFinitePrecisionRNNsforLanguageRecognition”,Weissetal.,2018.

/pdf/1805.04908.pdf

HowdoesLSTMsolvevanishinggradients?

.TheLSTMarchitecturemakesiteasierfortheRNNto

preserveinformationovermanytimesteps

.e.g.,iftheforgetgateissetto1foracelldimensionandtheinput

gatesetto0,thentheinformationofthatcellispreserved

indefinitely.

.Incontrast,it’sharderforavanillaRNNtolearnarecurrent

weightmatrixWhthatpreservesinfointhehiddenstate

.Inpractice,yougetabout100timestepsratherthanabout7

.LSTMdoesn’tguaranteethatthereisnovanishing/exploding

gradient,butitdoesprovideaneasierwayforthemodeltolearn

long-distancedependencies

40

LSTMs:real-worldsuccess

.In2013–2015,LSTMsstartedachievingstate-of-the-artresults

.Successfultasksincludehandwritingrecognition,speechrecognition,machine

translation,parsing,andimagecaptioning,aswellaslanguagemodels

.LSTMsbecamethedominantapproachformostNLPtasks

.Now(2019–2022),otherapproaches(e.g.,Transformers)havebecomedominantformanytasks

.Forexample,inWMT(aMachineTranslationconference+competition):

.InWMT2014,therewere0neuralmachinetranslationsystems(!)

.InWMT2016,thesummaryreportcontains“RNN”44times(andthesesystemswon)

.InWMT2019:“RNN”7times,”Transformer”105times

Source:"Findingsofthe2016ConferenceonMachineTranslation(WMT16)",Bojaretal.2016,

/wmt16/pdf/W16-2301.pdf

Source:"Findingsofthe2018ConferenceonMachineTranslation(WMT18)",Bojaretal.2018,

/wmt18/pdf/WMT028.pdf

Source:"Findingsofthe2019ConferenceonMachineTranslation(WMT19)",Barraultetal.2019,

/wmt18/pdf/WMT028.pdf

41

Isvanishing/explodinggradientjustaRNNproblem?

.No!Itcanbeaproblemforallneuralarchitectures(includingfeed-forwardandconvolutional),especiallyverydeepones.

.Duetochainrule/choiceofnonlinearityfunction,gradientcanbecomevanishinglysmallasitbackpropagates

.Thus,lowerlayersarelearnedveryslowly(hardtotrain)

.Solution:lotsofnewdeepfeedforward/convolutionalarchitecturesaddmoredirectconnections(thusallowingthegradienttoflow)

Forexample:

.Residualconnectionsaka“ResNet”

.Alsoknownasskip-connections

.Theidentityconnection

preservesinformationbydefault

.Thismakesdeepnetworksmuch

easiertotrain

"DeepResidualLearningforImageRecognition",Heetal,2015.

/pdf/1512.03385.pdf

42

Isvanishing/explodinggradientjustaRNNproblem?

Othermethods:

•Highwayconnectionsaka“HighwayNet”

•Similartoresidualconnections,buttheidentityconnectionvsthetransformationlayeris

controlledbyadynamicgate

•InspiredbyLSTMs,butappliedtodeepfeedforward/convolutionalnetworks

.Denseconnectionsaka“DenseNet”

.Directlyconnecteachlayertoallfuturelayers!

.Conclusion:Thoughvanishing/explodinggradientsareageneralproblem,RNNsareparticularlyunstableduetotherepeatedmultiplicationbythesameweightmatrix[Bengioetal,1994]

43

”DenselyConnectedConvolutionalNetworks",Huangetal,2017.

/pdf/1608.06993.pdf

”HighwayNetworks",Srivastavaetal,2015.

/pdf/1505.00387.pdf

”LearningLong-TermDependencieswithGradientDescentisDifficult",Bengioetal.1994,

http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf

element-wisemean/max

element-wisemean/max

Sentence

encoding

moviewasterriblyexciting!

the

5.BidirectionalandMulti-layerRNNs:motivation

Task:SentimentClassification

Wecanregardthishiddenstateasa

representationoftheword“terribly”inthecontextofthissentence.Wecallthisa

contextualrepresentation.

positive

Thesecontextual

representationsonly

containinformation

abouttheleftcontext(e.g.“themoviewas”).

Whataboutright

context?

Inthisexample,

“exciting”isintherightcontextandthis

modifiesthemeaningof“terribly”(fromnegativetopositive)

44

BidirectionalRNNs

Thiscontextualrepresentationof“terribly”hasbothleftandrightcontext!

Concatenated

hiddenstates

BackwardRNN

ForwardRNN

themoviewasterriblyexciting!

45

46

BidirectionalRNNs

Thisisageneralnotationtomean

“computeoneforwardstepofthe

RNN”–itcouldbeasimple,LSTM,orother(e.g.,GRU)RNNcomputation.

Ontimestept:

Generally,thesetwoRNNshave

separateweights

ForwardRNN

BackwardRNN

Concatenatedhiddenstates

Weregardthisas“thehiddenstate”ofabidirectionalRNN.Thisiswhatwepassontothe

nextpartsofthenetwork.

BidirectionalRNNs:simplifieddiagram

themoviewasterriblyexciting!

Thetwo-wayarrowsindicatebid

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论