斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第1页
斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第2页
斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第3页
斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第4页
斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第5页
已阅读5页,还剩85页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

NaturalLanguageProcessing

withDeepLearning

CS224N/Ling284

JohnHewitt

Lecture10:Pretraining

2

LecturePlan

1.Abriefnoteonsubwordmodeling

2.Motivatingmodelpretrainingfromwordembeddings

3.Modelpretrainingthreeways

1.Decoders

2.Encoders

3.Encoder-Decoders

4.Interlude:whatdowethinkpretrainingisteaching?

5.Verylargemodelsandin-contextlearning

Reminders:

Assignment5isouttoday!Itcoverslecture9(Tuesday)andlecture10(Today)!Ithas~pedagogicallyrelevantmath~sogetstarted!

3

Wordstructureandsubwordmodels

Let’stakealookattheassumptionswe’vemadeaboutalanguage’s

vocabulary.

Weassumeafixedvocaboftensofthousandsofwords,builtfromthetrainingset.AllnovelwordsseenattesttimearemappedtoasingleUNK.

word

hatlearntaaaaasty

laern

→→→

vocabmapping

pizza(index)tasty(index)UNK(index)UNK(index)

UNK(index)

embedding

Commonwords

Variationsmisspellings

novelitems

Transformerify→

Wordstructureandsubwordmodels

Finitevocabularyassumptionsmakeevenlesssenseinmanylanguages.

•Manylanguagesexhibitcomplexmorphology,orwordstructure.

•Theeffectismorewordtypes,eachoccurringfewertimes.

Example:Swahiliverbscanhave

hundredsofconjugations,each

encodingawidevarietyof

information.(Tense,mood,

definiteness,negation,information

abouttheobject,++)

Here’sasmallfractionofthe

conjugationsforambia–totell.

4

[

Wiktionary

]

Thebyte-pairencodingalgorithm

SubwordmodelinginNLPencompassesawiderangeofmethodsforreasoningaboutstructurebelowthewordlevel.(Partsofwords,characters,bytes.)

•Thedominantmodernparadigmistolearnavocabularyofpartsofwords(subwordtokens).

•Attrainingandtestingtime,eachwordissplitintoasequenceofknownsubwords.

Byte-pairencodingisasimple,effectivestrategyfordefiningasubwordvocabulary.

1.Startwithavocabularycontainingonlycharactersandan“end-of-word”symbol.

2.Usingacorpusoftext,findthemostcommonadjacentcharacters“a,b”;add“ab”asasubword.

3.Replaceinstancesofthecharacterpairwiththenewsubword;repeatuntildesiredvocabsize.

OriginallyusedinNLPformachinetranslation;nowasimilarmethod(WordPiece)isusedinpretrainedmodels.

5[

Sennrichetal.,2016

,

Wuetal.,2016

]

6

Wordstructureandsubwordmodels

Commonwordsendupbeingapartofthesubwordvocabulary,whilerarerwordsaresplitinto(sometimesintuitive,sometimesnot)components.

Intheworstcase,wordsaresplitintoasmanysubwordsastheyhavecharacters.

word

hatlearntaaaaasty

laern

→→→

vocabmapping

hat

learn

taa##aaa##styla##ern##

Transformer##ify

embedding

Common

words

Transformerify→

Variations

misspellings

novelitems

7

Outline

1.Abriefnoteonsubwordmodeling

2.Motivatingmodelpretrainingfromwordembeddings

3.Modelpretrainingthreeways

1.Decoders

2.Encoders

3.Encoder-Decoders

4.Interlude:whatdowethinkpretrainingisteaching?

5.Verylargemodelsandin-contextlearning

Motivatingwordmeaningandcontext

Recalltheadagewementionedatthebeginningofthecourse:

“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)

Thisquoteisasummaryofdistributionalsemantics,andmotivatedword2vec.But:

“…thecompletemeaningofawordisalwayscontextual,

andnostudyofmeaningapartfromacompletecontext

canbetakenseriously.”(J.R.Firth1935)

ConsiderIrecordtherecord:thetwoinstancesofrecordmeandifferentthings.

8

[Thanksto

YoavGoldbergonTwitter

forpointingoutthe1935Firthquote.]

9

Wherewewere:pretrainedwordembeddings

Circa2017:

•Startwithpretrainedwordembeddings(nocontext!)

•LearnhowtoincorporatecontextinanLSTMorTransformerwhiletrainingonthetask.

Someissuestothinkabout:

•Thetrainingdatawehaveforourdownstreamtask(likequestionanswering)mustbesufficienttoteachallcontextualaspectsoflanguage.

•Mostoftheparametersinournetworkarerandomlyinitialized!

Notpretrained

pretrained(wordembeddings)

…themoviewas…

[Recall,moviegetsthesamewordembedding,nomatterwhatsentenceitshowsupin]

10

Wherewe’regoing:pretrainingwholemodels

InmodernNLP:

•All(oralmostall)parametersinNLPnetworksareinitializedviapretraining.

•Pretrainingmethodshidepartsoftheinputfromthemodel,andtrainthemodeltoreconstructthoseparts.

•Thishasbeenexceptionallyeffectiveatbuildingstrong:

•representationsoflanguage

•parameterinitializationsforstrongNLPmodels.

•Probabilitydistributionsoverlanguagethatwecansamplefrom

Pretrainedjointly

…themoviewas…

[Thismodelhaslearnedhowtorepresententiresentencesthroughpretraining]

11

Whatcanwelearnfromreconstructingtheinput?

StanfordUniversityislocatedin__________,California.

12

Whatcanwelearnfromreconstructingtheinput?

Iput___forkdownonthetable.

13

Whatcanwelearnfromreconstructingtheinput?

Thewomanwalkedacrossthestreet,

checkingfortrafficover___shoulder.

14

Whatcanwelearnfromreconstructingtheinput?

Iwenttotheoceantoseethefish,turtles,seals,and_____.

15

Whatcanwelearnfromreconstructingtheinput?

Overall,thevalueIgotfromthetwohourswatching

itwasthesumtotalofthepopcornandthedrink.

Themoviewas___.

16

Whatcanwelearnfromreconstructingtheinput?

Irohwentintothekitchentomakesometea.

StandingnexttoIroh,Zukoponderedhisdestiny.

Zukoleftthe______.

17

Whatcanwelearnfromreconstructingtheinput?

Iwasthinkingaboutthesequencethatgoes

1,1,2,3,5,8,13,21,____

18

TransformerDecoder

TransformerDecoder

Word

Embeddings

Word

Embeddings

TransformerEncoder

TransformerEncoder

TheTransformerEncoder-Decoder

[Vaswanietal.,2017]

Lookingbackatthewholemodel,zoominginonanEncoderblock:

[predictions!]

[decoderattends

toencoderstates]

PositionRepresentations

PositionRepresentations

+

+

[outputsequence]

[inputsequence]

19

[predictions!]

Transformer

Encoder

TransformerDecoder

Multi-HeadAttention

Residual+LayerNorm

Feed-Forward

Residual+LayerNorm

TransformerDecoder

Word

Embeddings

PositionRepresentations

Word

Embeddings

+

+

TheTransformerEncoder-Decoder

[Vaswanietal.,2017]

Lookingbackatthewholemodel,zoominginonanEncoderblock:

[decoderattendstoencoderstates]

PositionRepresentations

[inputsequence][outputsequence]

TransformerEncoder

TransformerEncoder

Residual+LayerNorm

Word

Embeddings

PositionRepresentations

+

Word

Embeddings

Transformer

Decoder

Residual+LayerNorm

Feed-Forward

Residual+LayerNorm

Multi-HeadCross-Attention

MaskedMulti-HeadSelf-Attention

TheTransformerEncoder-Decoder

[Vaswanietal.,2017]

Lookingbackatthewholemodel,

zoominginonaDecoderblock:

[predictions!]

PositionRepresentations

+

[inputsequence]

20

[outputsequence]

TransformerEncoder

Residual+LayerNorm

Multi-HeadCross-Attention

Residual+LayerNorm

PositionRepresentations

MaskedMulti-HeadSelf-Attention

+

Word

Embeddings

+

[inputsequence]

TransformerDecoder

TransformerEncoder

TheTransformerEncoder-Decoder

[Vaswanietal.,2017]

Theonlynewpartisattentionfromdecodertoencoder.

Likewesawlastweek!

[predictions!]

Residual+LayerNorm

Feed-Forward

Word

Embeddings

PositionRepresentations

[outputsequence]

21

22

goestomaketastyteaEND

Decoder(Transformer,LSTM,++)

Irohgoestomaketastytea

Pretrainingthroughlanguagemodeling

[DaiandLe,2015]

Recallthelanguagemodelingtask:

•Modelpewtw1:t−1),theprobabilitydistributionoverwordsgiventheirpastcontexts.

•There’slotsofdataforthis!(InEnglish.)

Pretrainingthroughlanguagemodeling:

•Trainaneuralnetworktoperformlanguagemodelingonalargeamountoftext.

•Savethenetworkparameters.

23

Decoder

(Transformer,LSTM,++)

ThePretraining/FinetuningParadigm

PretrainingcanimproveNLPapplicationsbyservingasparameterinitialization.

Step2:Finetune(onyourtask)

Notmanylabels;adapttothetask!

☺/

Step1:Pretrain(onlanguagemodeling)

Lotsoftext;learngeneralthings!

goes

make

tasty

tea

END

to

Decoder(Transformer,LSTM,++)

…themoviewas…

Irohgoestomaketastytea

24

Stochasticgradientdescentandpretrain/finetune

Whyshouldpretrainingandfinetuninghelp,froma“trainingneuralnets”perspective?

•Consider,providesparametersbyapproximatingnℒpretraine.

•(Thepretrainingloss.)

•Then,finetuningapproximatesnℒfinetunee,startingat.

•(Thefinetuningloss)

•Thepretrainingmaymatterbecausestochasticgradientdescentsticks(relatively)closetoduringfinetuning.

•So,maybethefinetuninglocalminimaneartendtogeneralizewell!

•And/or,maybethegradientsoffinetuninglossnearpropagatenicely!

25

LecturePlan

1.Abriefnoteonsubwordmodeling

2.Motivatingmodelpretrainingfromwordembeddings

3.Modelpretrainingthreeways

1.Decoders

2.Encoders

3.Encoder-Decoders

4.Interlude:whatdowethinkpretrainingisteaching?

5.Verylargemodelsandin-contextlearning

26

Pretrainingforthreetypesofarchitectures

Theneuralarchitectureinfluencesthetypeofpretraining,andnaturalusecases.

Decoders

Encoders

Encoder-

Decoders

•Languagemodels!Whatwe’veseensofar.

•Nicetogeneratefrom;can’tconditiononfuturewords

•Getsbidirectionalcontext–canconditiononfuture!

•Wait,howdowepretrainthem?

•Goodpartsofdecodersandencoders?

•What’sthebestwaytopretrainthem?

27

Pretrainingforthreetypesofarchitectures

Theneuralarchitectureinfluencesthetypeofpretraining,andnaturalusecases.

Decoders

•Languagemodels!Whatwe’veseensofar.

•Nicetogeneratefrom;can’tconditiononfuturewords

Encoders

Encoder-

Decoders

•Getsbidirectionalcontext–canconditiononfuture!

•Wait,howdowepretrainthem?

•Goodpartsofdecodersandencoders?

•What’sthebestwaytopretrainthem?

28

w

Pretrainingdecoders

Whenusinglanguagemodelpretraineddecoders,wecanignore

☺/

thattheyweretrainedtomodelpwtw1:t−1).

A,b

Wecanfinetunethembytrainingaclassifier

onthelastword’shiddenstate.

ℎ1,…,ℎT

ℎ1,…,ℎT=Decoderw1,…,wT

y∼AwT+b

1,

…,wT

WhereAandbarerandomlyinitializedand

specifiedbythedownstreamtask.

[Notehowthelinearlayerhasn’tbeenpretrainedandmustbelearnedfromscratch.]

Gradientsbackpropagatethroughthewhole

network.

29

w

Pretrainingdecoders

It’snaturaltopretraindecodersaslanguagemodelsandthen

usethemasgenerators,finetuningtheirpewtw1:t−1)!

w3w4w5w6

2

A,b

ℎ1,…,ℎT

Thisishelpfulintaskswheretheoutputisa

sequencewithavocabularylikethatat

pretrainingtime!

•Dialogue(context=dialoguehistory)

•Summarization(context=document)

w1w2w3w4w5

ℎ1,…,ℎT=Decoderw1,…,wT

wt∼Awt−1+b

[Notehowthelinearlayerhasbeenpretrained.]

WhereA,bwerepretrainedinthelanguage

model!

GenerativePretrainedTransformer(GPT)[

Radfordetal.,2018

]

2018’sGPTwasabigsuccessinpretrainingadecoder!

•Transformerdecoderwith12layers.

•768-dimensionalhiddenstates,3072-dimensionalfeed-forwardhiddenlayers.

•Byte-pairencodingwith40,000merges

•TrainedonBooksCorpus:over7000uniquebooks.

•Containslongspansofcontiguoustext,forlearninglong-distancedependencies.

•Theacronym“GPT”nevershowedupintheoriginalpaper;itcouldstandfor“GenerativePreTraining”or“GenerativePretrainedTransformer”

30[

Devlinetal.,2018

]

31

GenerativePretrainedTransformer(GPT)[

Radfordetal.,2018

]

Howdoweformatinputstoourdecoderforfinetuningtasks?

NaturalLanguageInference:Labelpairsofsentencesasentailing/contradictory/neutral

entailment

Hypothesis:Thepersonisnearthedoor

Premise:Themanisinthedoorway

Radfordetal.,2018evaluateonnaturallanguageinference.

Here’sroughlyhowtheinputwasformatted,asasequenceoftokensforthedecoder.[START]Themanisinthedoorway[DELIM]Thepersonisnearthedoor[EXTRACT]

Thelinearclassifierisappliedtotherepresentationofthe[EXTRACT]token.

32

GenerativePretrainedTransformer(GPT)[

Radfordetal.,2018

]

GPTresultsonvariousnaturallanguageinferencedatasets.

Increasinglyconvincinggenerations(GPT2)[

Radfordetal.,2018

]

Wementionedhowpretraineddecoderscanbeusedintheircapacitiesaslanguagemodels.GPT-2,alargerversionofGPTtrainedonmoredata,wasshowntoproducerelativelyconvincingsamplesofnaturallanguage.

34

Pretrainingforthreetypesofarchitectures

Theneuralarchitectureinfluencesthetypeofpretraining,andnaturalusecases.

Decoders

•Languagemodels!Whatwe’veseensofar.

•Nicetogeneratefrom;can’tconditiononfuturewords

•Getsbidirectionalcontext–canconditiononfuture!

Encoders

•Wait,howdowepretrainthem?

•Goodpartsofdecodersandencoders?

•What’sthebestwaytopretrainthem?

Encoder-

Decoders

35

Pretrainingencoders:whatpretrainingobjectivetouse?

Sofar,we’velookedatlanguagemodelpretraining.Butencodersgetbidirectionalcontext,sowecan’tdolanguagemodeling!

Idea:replacesomefractionofwordsinthe

inputwithaspecial[MASK]token;predict

thesewords.

ℎ1,…,ℎT=Encoderw1,…,wT

yi∼Awi+b

Onlyaddlosstermsfromwordsthatare“maskedout.”Ifisthemaskedversionofx,we’relearningpe(x|).CalledMaskedLM.

went

store

A,b

ℎ1,…,ℎT

I[M]tothe[M]

[

Devlinetal.,2018

]

BERT:BidirectionalEncoderRepresentationsfromTranformers

Devlinetal.,2018proposedthe“MaskedLM”objectiveandreleasedtheweightsofapretrainedTransformer,amodeltheylabeledBERT.

SomemoredetailsaboutMaskedLMforBERT:

•Predictarandom15%of(sub)wordtokens.

•Replaceinputwordwith[MASK]80%ofthetime

•Replaceinputwordwitharandomtoken10%ofthetime

•Leaveinputwordunchanged10%ofthetime(butstillpredictit!)

•Why?Doesn’tletthemodelgetcomplacentandnotbuildstrongrepresentationsofnon-maskedwords.(Nomasksareseenatfine-tuningtime!)

36

[Predictthese!]wenttostore

Transformer

Encoder

Ipizzatothe[M]

[Replaced][Notreplaced][Masked]

[

Devlinetal.,2018

]

BERT:BidirectionalEncoderRepresentationsfromTranformers

•ThepretraininginputtoBERTwastwoseparatecontiguouschunksoftext:

•BERTwastrainedtopredictwhetheronechunkfollowstheotherorisrandomlysampled.

•Laterworkhasarguedthis“nextsentenceprediction”isnotnecessary.

37[

Devlinetal.,2018

,

Liuetal.,2019

]

BERT:BidirectionalEncoderRepresentationsfromTranformers

DetailsaboutBERT

•Twomodelswerereleased:

•BERT-base:12layers,768-dimhiddenstates,12attentionheads,110millionparams.

•BERT-large:24layers,1024-dimhiddenstates,16attentionheads,340millionparams.

•Trainedon:

•BooksCorpus(800millionwords)

•EnglishWikipedia(2,500millionwords)

•PretrainingisexpensiveandimpracticalonasingleGPU.

•BERTwaspretrainedwith64TPUchipsforatotalof4days.

•(TPUsarespecialtensoroperationaccelerationhardware)

•FinetuningispracticalandcommononasingleGPU

•“Pretrainonce,finetunemanytimes.”

38

[

Devlinetal.,2018

]

BERT:BidirectionalEncoderRepresentationsfromTranformers

BERTwasmassivelypopularandhugelyversatile;finetuningBERTledtonewstate-of-the-artresultsonabroadrangeoftasks.

•QQP:QuoraQuestionPairs(detectparaphrase•CoLA:corpusoflinguisticacceptability(detectquestions)whethersentencesaregrammatical.)

•QNLI:naturallanguageinferenceoverquestion•STS-B:semantictextualsimilarityansweringdata•MRPC:microsoftparaphrasecorpus

•SST-2:sentimentanalysis•RTE:asmallnaturallanguageinferencecorpus

39[

Devlinetal.,2018

]

40

PretrainedEncoder

Limitationsofpretrainedencoders

Thoseresultslookedgreat!Whynotusedpretrainedencodersforeverything?

Ifyourtaskinvolvesgeneratingsequences,considerusingapretraineddecoder;BERTandotherpretrainedencodersdon’tnaturallyleadtoniceautoregressive(1-word-at-a-time)generationmethods.

make/brew/craft

goestomaketastyteaEND

PretrainedDecoder

Irohgoesto[MASK]tastytea

Irohgoestomaketastytea

BERT

ExtensionsofBERT

You’llseealotofBERTvariantslikeRoBERTa,SpanBERT,+++

SomegenerallyacceptedimprovementstotheBERTpretrainingformula:

•RoBERTa:mainlyjusttrainBERTforlongerandremovenextsentenceprediction!

•SpanBERT:maskingcontiguousspansofwordsmakesaharder,moreusefulpretrainingtask

irr##esi##sti##bly

It’sbly

SpanBERT

[MASK]irr##esi##sti##[MASK]good

It’[MASK][MASK][MASK][MASK]good

41

[

Liuetal.,2019

;

Joshietal.,2020

]

ExtensionsofBERT

AtakeawayfromtheRoBERTapaper:morecompute,moredatacanimprovepretrainingevenwhennotchangingtheunderlyingTransformerencoder.

42[

Liuetal.,2019

;

Joshietal.,2020

]

43

Pretrainingforthreetypesofarchitectures

•Languagemodels!Whatwe’veseensofar.

•Nicetogeneratefrom;can’tconditiononfuturewords

•Getsbidirectionalcontext–canconditiononfuture!

•Wait,howdowepretrainthem?

Theneuralarchitectureinfluencesthetypeofpretraining,andnaturalusecases.

Decoders

Encoders

Encoder-

Decoders

•Goodpartsofdecodersandencoders?

•What’sthebestwaytopretrainthem?

44

Pretrainingencoder-decoders:whatpretrainingobjectivetouse?

Forencoder-decoders,wecoulddosomethinglikelanguagemodeling,butwhereaprefixofeveryinputisprovidedtotheencoderandisnotpredicted.

wT+2,…,

ℎ1,…,ℎT=Encoderw1,…,wT

ℎT+1,…,ℎ2=Decoderw1,…,wT,ℎ1,…,ℎT

yi∼Awi+b,i>T

wT+1,…,w2T

w1,…,wT

Theencoderportionbenefitsfrom

bidirectionalcontext;thedecoderportionis

usedtotrainthewholemodelthrough

languagemodeling.

[Raffeletal.,2018]

45

Pretrainingencoder-decoders:whatpretrainingobjectivetouse?

What

Raffeletal.,2018

foundtoworkbestwasspancorruption.Theirmodel:T5.

Replacedifferent-lengthspansfromtheinput

withuniqueplaceholders;decodeoutthe

spansthatwereremoved!

Thisisimplementedintext

preprocessing:it’sstillanobjective

thatlookslikelanguagemodelingat

thedecoderside.

Pretrainingencoder-decoders:whatpretrainingobjectivetouse?

Raffeletal.,2018

foundencoder-decoderstoworkbetterthandecodersfortheirtasks,andspancorruption(denoising)toworkbetterthanlanguagemodeling.

Pretrainingencoder-decoders:whatpretrainingobjectivetouse?

AfascinatingpropertyofT5:itcanbefinetunedtoanswerawiderangeofquestions,retrievingknowledgefromitsparameters.

NQ:NaturalQuestionsWQ:WebQuestionsTQA:TriviaQA

All“open-domain”versions

220millionparams

770millionparams

3billionparams

11billionparams

[

Raffeletal.,2018

]

48

Outline

1.Abriefnoteonsubwordmodeling

2.Motivatingmodelpretrainingfromwordembeddings

3.Modelpretrainingthreeways

1.Decoders

2.Encoders

3.Encoder-Decoders

4.Interlude:whatdowethinkpretrainingisteaching?

5.Verylargemodelsandin-contextlearning

49

Whatkindsofthingsdoespretraininglearn?

There’sincreasingevidencethatpretrainedmodelslearnawidevarietyofthingsaboutthestatisticalpropertiesoflanguage.Takingourexamplesfromthestartofclass:

•StanfordUniversityislocatedin__________,California.[Trivia]

•Iput___forkdownonthetable.[syntax]

•Thewoman

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论