斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn

上传人：策*** IP属地：山西上传时间：2023-07-18 格式：DOCX 页数：103 大小：2.63MB 积分：19.9 举报 版权申诉

斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第2页

斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第3页

斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第4页

斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn_第5页

已阅读5页，还剩98页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

NaturalLanguageProcessing

withDeepLearning

CS224N/Ling284

ChristopherManning

Lecture6:SimpleandLSTMRecurrentNeuralNetworks

LecturePlan

1.RNNLanguageModels,continued(20mins)

2.OtherusesofRNNs(10mins)

3.Explodingandvanishinggradients(15mins)

4.LSTMs(20mins)

5.Bidirectionalandmulti-layerRNNs(15mins)

.FinalProjects

.NextThursday:alectureaboutchoosingfinalprojects

.It’sfinetodelaythinkingaboutprojectsuntilnextweek

.Butifyou’realreadythinkingaboutprojects,youcanviewsomeinfo/inspirationonthewebsite.It’sstilllastyear’sinformationatpresent!

.It’sgreatifyoucanlineupyourownmentor;wealsoliningupsomementors

Overview

.Lastlecturewelearned:

.Languagemodels,n-gramlanguagemodels,andRecurrentNeuralNetworks(RNNs)

.Todaywe’lllearnhowtogetRNNstoworkforyou

.TrainingandgeneratingfromRNNs

.UsesofRNNs

.ProblemswithRNNs(explodingandvanishinggradients)andhowtofixthem

.TheseproblemsmotivateamoresophisticatedRNNarchitecture:LSTMs

.AndothermorecomplexRNNoptions:bidirectionalRNNsandmulti-layerRNNs

.Nextlecturewe’lllearn:

.HowwecandoNeuralMachineTranslation(NMT)usinganRNN-basedarchitecture

calledsequence-to-sequencewithattention(whichisAss4!)

h(1)

h3)

h(0)

h(4)

1.TheSimpleRNNLanguageModelbooks

laptops

outputdistribution

zoo

hiddenstates

h(0)istheinitialhiddenstate

e(1)

wordembeddings

↑E

thestudentsopened

words/one-hotvectors

their

Note:thisinputsequencecouldbemuch

longernow!

equals

h(t-2)

TrainingtheparametersofRNNs:BackpropagationforRNNs

h(0)

…

Question:

Howdowecalculatethis?

Answer:

Backpropagateovertimesteps

i=t,…,0,summinggradientsasyougo.

Thisalgorithmiscalled“backpropagation

throughtime”[Werbos,P.G.,1988,Neural

Networks1,andothers]

GeneratingtextwithaRNNLanguageModel

Justlikean-gramLanguageModel,youcanuseanRNNLanguageModelto

generatetextbyrepeatedsampling.Sampledoutputbecomesnextstep’sinput.

favorite

sample

season

sample

spring

sample

h(2)

h(4)

h3)

h(0)

h(1)

…

↑we

e98

e(1)

e98

myfavoriteseasonisspring

GeneratingtextwithanRNNLanguageModel

Let’shavesomefun!

.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonObamaspeeches:

Source:

/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0

GeneratingtextwithanRNNLanguageModel

Let’shavesomefun!

.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonHarryPotter:

Source:

/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6

GeneratingtextwithanRNNLanguageModel

Let’shavesomefun!

.YoucantrainanRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonrecipes:

Source:

/nylki/1efbaa36635956d35bcc

GeneratingtextwithaRNNLanguageModel

Let’shavesomefun!

.YoucantrainaRNN-LMonanykindoftext,thengeneratetextinthatstyle.

.RNN-LMtrainedonpaintcolornames:

Thisisanexampleofacharacter-levelRNN-LM(predictswhatcharactercomesnext)

Source:

/post/160776374467/new-paint-colors-invented-by-neural-network

EvaluatingLanguageModels

.ThestandardevaluationmetricforLanguageModelsisperplexity.

Normalizedby

numberofwords

Inverseprobabilityofcorpus,accordingtoLanguageModel

.Thisisequaltotheexponentialofthecross-entropyloss:

Lowerperplexityisbetter!

RNNshavegreatlyimprovedperplexity

n-grammodel—一

IncreasinglycomplexRNNs

Perplexityimproves(lowerisbetter)

Source:

/building-an-efficient-neural-language-model-over-a-billion-words/

Recap

.LanguageModel:Asystemthatpredictsthenextword

.RecurrentNeuralNetwork:Afamilyofneuralnetworksthat:

.Takesequentialinputofanylength

.Applythesameweightsoneachstep

.Canoptionallyproduceoutputoneachstep

.RecurrentNeuralNetwork≠LanguageModel

.We’veshownthatRNNsareagreatwaytobuildaLM

.ButRNNsareusefulformuchmore!

Terminologyandalookforward

TheRNNwe’veseensofar=simple/vanilla/ElmanRNN

Latertoday:YouwilllearnaboutotherRNNflavors

likeGRU

andLSTM

andmulti-layerRNNs

Bytheendofthecourse:Youwillunderstandphraseslike

“stackedbidirectionalLSTMwithresidualconnectionsandself-attention”

2.OtherRNNuses:RNNscanbeusedforsequencetagging

e.g.,part-of-speechtagging,namedentityrecognition

DTJJNNVBNINDTNN

thestartledcatknockedoverthevase

RNNscanbeusedforsentenceclassification

e.g.,sentimentclassification

Howtocompute

sentenceencoding?

positive

Sentence

encoding

overallIenjoyedthemoviealot

equals

RNNscanbeusedforsentenceclassification

e.g.,sentimentclassification

Howtocompute

sentenceencoding?

positive

Sentence

encoding

Basicway:

Usefinalhidden

state

overallIenjoyedthemoviealot

RNNscanbeusedforsentenceclassification

e.g.,sentimentclassification

Howtocompute

sentenceencoding?

positive

Sentence

encoding

Usuallybetter:

Takeelement-wise

maxormeanofall

hiddenstates

overallIenjoyedthemoviealot

lotsofneural

architecture

lotsofneural

architecture

Answer:German

Context:Ludwig

vanBeethovenwas

aGerman

composerand

pianist.Acrucialfigure…

RNNscanbeusedasalanguageencodermodule

e.g.,questionanswering,machinetranslation,manyothertasks!

HeretheRNNactsasan

encoderfortheQuestion(the

hiddenstatesrepresentthe

Question).Theencoderispart

ofalargerneuralsystem.

Question:whatnationalitywasBeethoven?

RNN-LMscanbeusedtogeneratetext

e.g.,speechrecognition,machinetranslation,summarization

RNN-LM

Input(audio)

what’stheweather

conditioning

<START>what’sthe

Thisisanexampleofaconditionallanguagemodel.We’llseeMachineTranslationinmuchmoredetailnextclass.

3.ProblemswithRNNs:VanishingandExplodingGradients

JJ4)(θ)

h(1)h3)h(4)

www

JJ4)(θ)

Vanishinggradientintuition

h(1)h3)h(4)

JJ4)(θ)

Vanishinggradientintuition

h(1)h3)h(4)

chainrule!

JJ4)(θ)

h(1)h3)h(4)

Vanishinggradientintuition

chainrule!

h3)

h(4)

JJ4)(θ)

Vanishinggradientintuition

chainrule!

Whathappensifthesearesmall?

Vanishinggradientintuition

JJ4)(θ)

h3)h(4)

Vanishinggradientproblem:

Whenthesearesmall,thegradient

signalgetssmallerandsmallerasit

backpropagatesfurther

Vanishinggradientproofsketch(linearcase)

.Recall:

.Whatifweretheidentityfunction,?

(chainrule)

=Iwa=

.Considerthegradientofthelossonstep,withrespect

tothehiddenstateonsomepreviousstep.Let

(chainrule)

(valueof)

IfWhis“small”,thenthistermgets

exponentiallyproblematicasbecomeslarge

Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.

http://proceedings.mlr.press/v28/pascanu13.pdf

27(andsupplementalmaterials),at

http://proceedings.mlr.press/v28/pascanu13-supp.pdf

Vanishinggradientproofsketch(linearcase)

sufficientbut

.What’swrongwith?notnecessary

.Consideriftheeigenvaluesofarealllessthan1:

(eigenvectors)

.Wecanwriteusingtheeigenvectorsofasabasis:

Approaches0asgrows,sogradientvanishes

.Whataboutnonlinearactivations(i.e.,whatweuse?)

.Prettymuchthesamething,excepttheproofrequiresi＜

forsomedependentondimensionalityand

Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.

http://proceedings.mlr.press/v28/pascanu13.pdf

(andsupplementalmaterials),at

http://proceedings.mlr.press/v28/pascanu13-supp.pdf

h3)

h(4)

Whyisvanishinggradientaproblem?

Gradientsignalfromfarawayislostbecauseit’smuchsmallerthangradientsignalfromclose-by.

So,modelweightsareupdatedonlywithrespecttoneareffects,notlong-termeffects.

EffectofvanishinggradientonRNN-LM

.LMtask:Whenshetriedtoprinthertickets,shefoundthattheprinterwasoutoftoner.

Shewenttothestationerystoretobuymoretoner.Itwasveryoverpriced.After

installingthetonerintotheprinter,shefinallyprintedher

.Tolearnfromthistrainingexample,theRNN-LMneedstomodelthedependencybetween“tickets”onthe7thstepandthetargetword“tickets”attheend.

.Butifthegradientissmall,themodelcan’tlearnthisdependency

.So,themodelisunabletopredictsimilarlong-distancedependenciesattesttime

Whyisexplodinggradientaproblem?

.Ifthegradientbecomestoobig,thentheSGDupdatestepbecomestoobig:

learningrate

gradient

.Thiscancausebadupdates:wetaketoolargeastepandreachaweirdandbad

parameterconfiguration(withlargeloss)

.Youthinkyou’vefoundahilltoclimb,butsuddenlyyou’reinIowa

.Intheworstcase,thiswillresultinInforNaNinyournetwork

(thenyouhavetorestarttrainingfromanearliercheckpoint)

Gradientclipping:solutionforexplodinggradient

.Gradientclipping:ifthenormofthegradientisgreaterthansomethreshold,scaleit

downbeforeapplyingSGDupdate

.Intuition:takeastepinthesamedirection,butasmallerstep

.Inpractice,rememberingtoclipgradientsisimportant,butexplodinggradientsarean

easyproblemtosolve

32Source:“Onthedifficultyoftrainingrecurrentneuralnetworks”,Pascanuetal,2013.

http://proceedings.mlr.press/v28/pascanu13.pdf

Howtofixthevanishinggradientproblem?

.Themainproblemisthatit’stoodifficultfortheRNNtolearntopreserveinformation

overmanytimesteps.

.InavanillaRNN,thehiddenstateisconstantlybeingrewritten

.HowaboutanRNNwithseparatememorywhichisaddedto?

4.LongShort-TermMemoryRNNs(LSTMs)

.AtypeofRNNproposedbyHochreiterandSchmidhuberin1997asasolutiontothevanishinggradientsproblem.

.EveryonecitesthatpaperbutreallyacrucialpartofthemodernLSTMisfromGersetal.(2000)

.Onstept,thereisahiddenstateh9)c9andacellstate

.Botharevectorslengthn

.Thecellstoreslong-terminformation

.TheLSTMcanread,erase,andwriteinformationfromthecell

.ThecellbecomesconceptuallyratherlikeRAMinacomputer

.Theselectionofwhichinformationiserased/written/readiscontrolledbythreecorrespondinggates

.Thegatesarealsovectorsoflengthn

.Oneachtimestep,eachelementofthegatescanbeopen(1),closed(0),orsomewherein-between

.Thegatesaredynamic:theirvalueiscomputedbasedonthecurrentcontext

“Longshort-termmemory”,HochreiterandSchmidhuber,1997.

https://www.bioinf.jku.at/publications/older/2604.pdf

“LearningtoForget:ContinualPredictionwithLSTM”,Gers,Schmidhuber,andCummins,2000.

/doi/10.1162/089976600300015015

Allthesearevectorsofsamelengthn

c(t).Ontimestept:

Forgetgate:controlswhatiskeptvsforgotten,frompreviouscellstate

Gatesareappliedusingelement-wise

(orHadamard)product:。

LongShort-TermMemory(LSTM)

Wehaveasequenceofinputsx(t),andwewillcomputeasequenceofhiddenstatesh(t)andcellstates

Sigmoidfunction:allgatevaluesarebetween0and1

Inputgate:controlswhatpartsofthenewcellcontentarewrittentocell

Outputgate:controlswhatpartsofcellareoutputtohiddenstate

Newcellcontent:thisisthenewcontenttobewrittentothecell

Cellstate:erase(“forget”)some

contentfromlastcellstate,andwrite(“input”)somenewcellcontent

Hiddenstate:read(“output”)somecontentfromthecell

LongShort-TermMemory(LSTM)

YoucanthinkoftheLSTMequationsvisuallylikethis:

36Source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The+signisthesecret!

ct-1ct

Outputsomecellcontenttothehiddenstate

ht-1

Computetheinputgate

Computetheoutputgate

LongShort-TermMemory(LSTM)

YoucanthinkoftheLSTMequationsvisuallylikethis:

Writesomenewcellcontent

Forgetsomecellcontent

Computetheforgetgate

Computethenewcellcontent

37Source:

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Howdoesthissolvevanishinggradient?

LikeLSTM,GRUmakesiteasiertoretaininfolong-term(e.g.,bysettingupdategateto0)

Updategate:controlswhatpartsof

hiddenstateareupdatedvspreserved

GatedRecurrentUnits(GRU)

.ProposedbyChoetal.in2014asasimpleralternativetotheLSTM.

.Oneachtimesteptwehaveinputandhiddenstate(nocellstate).

Resetgate:controlswhatpartsofprevioushiddenstateareusedtocomputenewcontent

Newhiddenstatecontent:resetgateselectsusefulpartsofprevhidden

state.Usethisandcurrentinputtocomputenewhiddencontent.

Hiddenstate:updategate

simultaneouslycontrolswhatiskept

fromprevioushiddenstate,andwhatisupdatedtonewhiddenstatecontent

"LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation",Choetal.2014,

/pdf/1406.1078v3.pdf

???

LSTMvsGRU

.ResearchershaveproposedmanygatedRNNvariants,butLSTMandGRUarethemost

widely-used

.Ruleofthumb:LSTMisagooddefaultchoice(especiallyifyourdatahasparticularlylongdependencies,oryouhavelotsoftrainingdata);SwitchtoGRUsforspeedand

fewerparameters.

.Note:LSTMscanstoreunboundedly*largevaluesinmemorycelldimensions,and

relativelyeasilylearntocount.(UnlikeGRUs.)

GRU

LSTM

Singledimensionusedascounter

*boundedifassumingfiniteprecision,butstill,largeSource:“OnthePracticalComputationalPowerofFinitePrecisionRNNsforLanguageRecognition”,Weissetal.,2018.

/pdf/1805.04908.pdf

HowdoesLSTMsolvevanishinggradients?

.TheLSTMarchitecturemakesiteasierfortheRNNto

preserveinformationovermanytimesteps

.e.g.,iftheforgetgateissetto1foracelldimensionandtheinput

gatesetto0,thentheinformationofthatcellispreserved

indefinitely.

.Incontrast,it’sharderforavanillaRNNtolearnarecurrent

weightmatrixWhthatpreservesinfointhehiddenstate

.Inpractice,yougetabout100timestepsratherthanabout7

.LSTMdoesn’tguaranteethatthereisnovanishing/exploding

gradient,butitdoesprovideaneasierwayforthemodeltolearn

long-distancedependencies

LSTMs:real-worldsuccess

.In2013–2015,LSTMsstartedachievingstate-of-the-artresults

.Successfultasksincludehandwritingrecognition,speechrecognition,machine

translation,parsing,andimagecaptioning,aswellaslanguagemodels

.LSTMsbecamethedominantapproachformostNLPtasks

.Now(2019–2022),otherapproaches(e.g.,Transformers)havebecomedominantformanytasks

.Forexample,inWMT(aMachineTranslationconference+competition):

.InWMT2014,therewere0neuralmachinetranslationsystems(!)

.InWMT2016,thesummaryreportcontains“RNN”44times(andthesesystemswon)

.InWMT2019:“RNN”7times,”Transformer”105times

Source:"Findingsofthe2016ConferenceonMachineTranslation(WMT16)",Bojaretal.2016,

/wmt16/pdf/W16-2301.pdf

Source:"Findingsofthe2018ConferenceonMachineTranslation(WMT18)",Bojaretal.2018,

/wmt18/pdf/WMT028.pdf

Source:"Findingsofthe2019ConferenceonMachineTranslation(WMT19)",Barraultetal.2019,

/wmt18/pdf/WMT028.pdf

Isvanishing/explodinggradientjustaRNNproblem?

.No!Itcanbeaproblemforallneuralarchitectures(includingfeed-forwardandconvolutional),especiallyverydeepones.

.Duetochainrule/choiceofnonlinearityfunction,gradientcanbecomevanishinglysmallasitbackpropagates

.Thus,lowerlayersarelearnedveryslowly(hardtotrain)

.Solution:lotsofnewdeepfeedforward/convolutionalarchitecturesaddmoredirectconnections(thusallowingthegradienttoflow)

Forexample:

.Residualconnectionsaka“ResNet”

.Alsoknownasskip-connections

.Theidentityconnection

preservesinformationbydefault

.Thismakesdeepnetworksmuch

easiertotrain

"DeepResidualLearningforImageRecognition",Heetal,2015.

/pdf/1512.03385.pdf

Isvanishing/explodinggradientjustaRNNproblem?

Othermethods:

•Highwayconnectionsaka“HighwayNet”

•Similartoresidualconnections,buttheidentityconnectionvsthetransformationlayeris

controlledbyadynamicgate

•InspiredbyLSTMs,butappliedtodeepfeedforward/convolutionalnetworks

.Denseconnectionsaka“DenseNet”

.Directlyconnecteachlayertoallfuturelayers!

.Conclusion:Thoughvanishing/explodinggradientsareageneralproblem,RNNsareparticularlyunstableduetotherepeatedmultiplicationbythesameweightmatrix[Bengioetal,1994]

”DenselyConnectedConvolutionalNetworks",Huangetal,2017.

/pdf/1608.06993.pdf

”HighwayNetworks",Srivastavaetal,2015.

/pdf/1505.00387.pdf

”LearningLong-TermDependencieswithGradientDescentisDifficult",Bengioetal.1994,

http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf

element-wisemean/max

Sentence

encoding

moviewasterriblyexciting!

the

5.BidirectionalandMulti-layerRNNs:motivation

Task:SentimentClassification

Wecanregardthishiddenstateasa

representationoftheword“terribly”inthecontextofthissentence.Wecallthisa

contextualrepresentation.

positive

Thesecontextual

representationsonly

containinformation

abouttheleftcontext(e.g.“themoviewas”).

Whataboutright

context?

Inthisexample,

“exciting”isintherightcontextandthis

modifiesthemeaningof“terribly”(fromnegativetopositive)

BidirectionalRNNs

Thiscontextualrepresentationof“terribly”hasbothleftandrightcontext!

Concatenated

hiddenstates

BackwardRNN

ForwardRNN

themoviewasterriblyexciting!

BidirectionalRNNs

Thisisageneralnotationtomean

“computeoneforwardstepofthe

RNN”–itcouldbeasimple,LSTM,orother(e.g.,GRU)RNNcomputation.

Ontimestept:

Generally,thesetwoRNNshave

separateweights

ForwardRNN

BackwardRNN

Concatenatedhiddenstates

Weregardthisas“thehiddenstate”ofabidirectionalRNN.Thisiswhatwepassontothe

nextpartsofthenetwork.

BidirectionalRNNs:simplifieddiagram

themoviewasterriblyexciting!

Thetwo-wayarrowsindicatebid

人人文库> 全部分类> 行业资料 > 管理策划

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn

文档简介

温馨提示

最新文档

评论

斯坦福深度学习自然语言处理课件 -cs224n-2022-lecture06-fancy-rnn

文档简介

温馨提示

最新文档

评论

相关文档