斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining

上传人：策*** IP属地：山西上传时间：2023-07-18 格式：DOCX 页数：90 大小：1.35MB 积分：19.9 举报 版权申诉

斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第2页

斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第3页

斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第4页

斯坦福深度学习自然语言处理课件 -cs224n-2021-lecture10-pretraining_第5页

已阅读5页，还剩85页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

NaturalLanguageProcessing

withDeepLearning

CS224N/Ling284

JohnHewitt

Lecture10:Pretraining

LecturePlan

1.Abriefnoteonsubwordmodeling

2.Motivatingmodelpretrainingfromwordembeddings

3.Modelpretrainingthreeways

1.Decoders

2.Encoders

3.Encoder-Decoders

4.Interlude:whatdowethinkpretrainingisteaching?

5.Verylargemodelsandin-contextlearning

Reminders:

Assignment5isouttoday!Itcoverslecture9(Tuesday)andlecture10(Today)!Ithas~pedagogicallyrelevantmath~sogetstarted!

Wordstructureandsubwordmodels

Let’stakealookattheassumptionswe’vemadeaboutalanguage’s

vocabulary.

Weassumeafixedvocaboftensofthousandsofwords,builtfromthetrainingset.AllnovelwordsseenattesttimearemappedtoasingleUNK.

word

hatlearntaaaaasty

laern

→→→

→

vocabmapping

pizza(index)tasty(index)UNK(index)UNK(index)

UNK(index)

embedding

Commonwords

Variationsmisspellings

novelitems

Transformerify→

Wordstructureandsubwordmodels

Finitevocabularyassumptionsmakeevenlesssenseinmanylanguages.

•Manylanguagesexhibitcomplexmorphology,orwordstructure.

•Theeffectismorewordtypes,eachoccurringfewertimes.

Example:Swahiliverbscanhave

hundredsofconjugations,each

encodingawidevarietyof

information.(Tense,mood,

definiteness,negation,information

abouttheobject,++)

Here’sasmallfractionofthe

conjugationsforambia–totell.

[

Wiktionary

]

Thebyte-pairencodingalgorithm

SubwordmodelinginNLPencompassesawiderangeofmethodsforreasoningaboutstructurebelowthewordlevel.(Partsofwords,characters,bytes.)

•Thedominantmodernparadigmistolearnavocabularyofpartsofwords(subwordtokens).

•Attrainingandtestingtime,eachwordissplitintoasequenceofknownsubwords.

Byte-pairencodingisasimple,effectivestrategyfordefiningasubwordvocabulary.

1.Startwithavocabularycontainingonlycharactersandan“end-of-word”symbol.

2.Usingacorpusoftext,findthemostcommonadjacentcharacters“a,b”;add“ab”asasubword.

3.Replaceinstancesofthecharacterpairwiththenewsubword;repeatuntildesiredvocabsize.

OriginallyusedinNLPformachinetranslation;nowasimilarmethod(WordPiece)isusedinpretrainedmodels.

Sennrichetal.,2016

Wuetal.,2016

]

Wordstructureandsubwordmodels

Commonwordsendupbeingapartofthesubwordvocabulary,whilerarerwordsaresplitinto(sometimesintuitive,sometimesnot)components.

Intheworstcase,wordsaresplitintoasmanysubwordsastheyhavecharacters.

word

hatlearntaaaaasty

laern

→→→

→

vocabmapping

hat

learn

taa##aaa##styla##ern##

Transformer##ify

embedding

Common

words

Transformerify→

Variations

misspellings

novelitems

Outline

1.Abriefnoteonsubwordmodeling

2.Motivatingmodelpretrainingfromwordembeddings

3.Modelpretrainingthreeways

1.Decoders

2.Encoders

3.Encoder-Decoders

4.Interlude:whatdowethinkpretrainingisteaching?

5.Verylargemodelsandin-contextlearning

Motivatingwordmeaningandcontext

Recalltheadagewementionedatthebeginningofthecourse:

“Youshallknowawordbythecompanyitkeeps”(J.R.Firth1957:11)

Thisquoteisasummaryofdistributionalsemantics,andmotivatedword2vec.But:

“…thecompletemeaningofawordisalwayscontextual,

andnostudyofmeaningapartfromacompletecontext

canbetakenseriously.”(J.R.Firth1935)

ConsiderIrecordtherecord:thetwoinstancesofrecordmeandifferentthings.

[Thanksto

YoavGoldbergonTwitter

forpointingoutthe1935Firthquote.]

Wherewewere:pretrainedwordembeddings

Circa2017:

•Startwithpretrainedwordembeddings(nocontext!)

•LearnhowtoincorporatecontextinanLSTMorTransformerwhiletrainingonthetask.

Someissuestothinkabout:

•Thetrainingdatawehaveforourdownstreamtask(likequestionanswering)mustbesufficienttoteachallcontextualaspectsoflanguage.

•Mostoftheparametersinournetworkarerandomlyinitialized!

Notpretrained

pretrained(wordembeddings)

…themoviewas…

[Recall,moviegetsthesamewordembedding,nomatterwhatsentenceitshowsupin]

Wherewe’regoing:pretrainingwholemodels

InmodernNLP:

•All(oralmostall)parametersinNLPnetworksareinitializedviapretraining.

•Pretrainingmethodshidepartsoftheinputfromthemodel,andtrainthemodeltoreconstructthoseparts.

•Thishasbeenexceptionallyeffectiveatbuildingstrong:

•representationsoflanguage

•parameterinitializationsforstrongNLPmodels.

•Probabilitydistributionsoverlanguagethatwecansamplefrom

Pretrainedjointly

…themoviewas…

[Thismodelhaslearnedhowtorepresententiresentencesthroughpretraining]