【斯坦福大学】大语言模型(LMM)简介2024 -Introduction to Large Language_第1页
【斯坦福大学】大语言模型(LMM)简介2024 -Introduction to Large Language_第2页
【斯坦福大学】大语言模型(LMM)简介2024 -Introduction to Large Language_第3页
【斯坦福大学】大语言模型(LMM)简介2024 -Introduction to Large Language_第4页
【斯坦福大学】大语言模型(LMM)简介2024 -Introduction to Large Language_第5页
已阅读5页,还剩108页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Large

LanguageModels

IntroductiontoLargeLanguageModels

Languagemodels

•Rememberthesimplen-gramlanguagemodel

•Assignsprobabilitiestosequencesofwords

•Generatetextbysamplingpossiblenextwords

•Istrainedoncountscomputedfromlotsoftext

•Largelanguagemodelsaresimilaranddifferent:

•Assignsprobabilitiestosequencesofwords

•Generatetextbysamplingpossiblenextwords

•Aretrainedbylearningtoguessthenextword

Largelanguagemodels

•Eventhroughpretrainedonlytopredictwords

•Learnalotofusefullanguageknowledge

•Sincetrainingonalotoftext

Threearchitecturesforlargelanguagemodels

Decoders

GPT,Claude,Llama

Mixtral

Encoders

BERTfamily,HuBERT

Encoder-decoders

Flan-T5,Whisper

Encoders

Manyvarieties!

•Popular:MaskedLanguageModels(MLMs)

•BERTfamily

•Trainedbypredictingwordsfromsurroundingwordsonbothsides

•Areusuallyfinetuned(trainedonsuperviseddata)forclassificationtasks.

Encoder-Decoders

•Trainedtomapfromonesequencetoanother

•Verypopularfor:

•machinetranslation(mapfromonelanguagetoanother)

•speechrecognition(mapfromacousticstowords)

Large

LanguageModels

IntroductiontoLargeLanguageModels

Large

LanguageModels

LargeLanguageModels:Whattaskscantheydo?

Bigidea

Manytaskscanbeturnedintotasksofpredictingwords!

Thislecture:decoder-onlymodels

Alsocalled:

•CausalLLMs

•AutoregressiveLLMs

•Left-to-rightLLMs

•Predictwordslefttoright

ConditionalGeneration:Generatingtextconditionedonprevioustext!

CompletionText

TransformerBlocks

Encoder

Solongandthanksfor

all

the

the

Language

Modeling

Head

+i

E

+i

E

+i

E

+i

E

+i

E

Softmax

Unencoderlayer\U/

+i

E

+i

E

U

logits

all

PrefixText

ManypracticalNLPtaskscanbecastaswordprediction!

Sentimentanalysis:“IlikeJackieChan”

1.Wegivethelanguagemodelthisstring:

Thesentimentofthesentence"I

likeJackieChan"is:

2.Andseewhatworditthinkscomesnext:

P(positive|Thesentimentofthesentence‘‘IlikeJackieChan"is:)

P(negative|Thesentimentofthesentence‘‘IlikeJackieChan"is:)

FraminglotsoftasksasconditionalgenerationQA:“WhowroteTheOriginofSpecies”

1.Wegivethelanguagemodelthisstring:

Q:Whowrotethebook‘‘TheOriginofSpecies"?A:

2.Andseewhatworditthinkscomesnext:

P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:)

3.Anditerate:

P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:Charles)

Summarization

TheonlythingcrazierthanaguyinsnowboundMassachusettsboxingupthepowderywhitestuffandofferingitforsaleonline?Peopleareactuallybuyingit.For$89,self-styledentrepreneurKyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.

Original

ButnotifyouliveinNewEnglandorsurroundingstates.“Wewillnotshipsnowtoanystatesinthenortheast!”saysWaring’swebsite,ShipSnowY.“We’reinthebusinessofexpungingsnow!”

Hiswebsiteandsocialmediaaccountsclaimtohavefilledmorethan133ordersforsnow–morethan30onTuesdayalone,hisbusiestdayyet.Withmorethan45totalinches,Bostonhassetarecordthiswinterforthesnowiestmonthinitshistory.Mostresidentsseethehugepilesofsnowchokingtheiryardsandsidewalksasanuisance,butWaringsawanopportunity.

AccordingtoB,itallstartedafewweeksago,whenWaringandhiswifewereshov-elingdeepsnowfromtheiryardinManchester-by-the-Sea,acoastalsuburbnorthofBoston.Hejokedaboutshippingthestufftofriendsandfamilyinwarmerstates,andanideawasborn.[...]

Summary

KyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.ButnotifyouliveinNewEnglandorsurroundingstates.

LLMsforsummarization(usingtl;dr)

GeneratedSummary

KyleWaringwill…

U

LMHead

U

U

E

E

E

E

E

E

E

E

idea

Theonly

will

tl;dr

Waring

wasborn.Kyle

OriginalStoryDelimiter

Large

LanguageModels

LargeLanguageModels:Whattaskscantheydo?

Large

LanguageModels

SamplingforLLMGeneration

DecodingandSampling

Thistaskofchoosingawordtogeneratebasedonthemodel’sprobabilitiesiscalleddecoding.

ThemostcommonmethodfordecodinginLLMs:sampling.Samplingfromamodel’sdistributionoverwords:

•chooserandomwordsaccordingtotheirprobabilityassignedbythemodel.

Aftereachtokenwe’llsamplewordstogenerateaccordingtotheirprobabilityconditionedonourpreviouschoices,

•Atransformerlanguagemodelwillgivetheprobability

Randomsampling

i←1

wi~p(w)

whilewi!=EOSi←i+1

wi~p(wi|w<i)

Randomsamplingdoesn'tworkverywell

Eventhoughrandomsamplingmostlygeneratesensible,high-probablewords,

Therearemanyodd,low-probabilitywordsinthetailofthedistribution

Eachoneislow-probabilitybutaddeduptheyconstitutealargeportionofthedistribution

Sotheygetpickedenoughtogenerateweirdsentences

Factorsinwordsampling:qualityanddiversity

Emphasizehigh-probabilitywords

+quality:moreaccurate,coherent,andfactual,-diversity:boring,repetitive.

Emphasizemiddle-probabilitywords+diversity:morecreative,diverse,-quality:lessfactual,incoherent

Top-ksampling:

1.Choose#ofwordsk

2.ForeachwordinthevocabularyV,usethelanguagemodeltocomputethelikelihoodofthiswordgiventhecontextp(wt|w<t)

3.Sortthewordsbylikelihood,keeponlythetopkmostprobablewords.

4.Renormalizethescoresofthekwordstobealegitimateprobabilitydistribution.

5.Randomlysampleawordfromwithintheseremainingkmost-probablewordsaccordingtoitsprobability.

Top-psampling

Holtzmanetal.,2020

Problemwithtop-k:kisfixedsomaycoververydifferentamountsofprobabilitymassindifferentsituations

Idea:Instead,keepthetopppercentoftheprobabilitymass

ΣP(w|w<t)≥p

w∈V(p)

Temperaturesampling

ReshapethedistributioninsteadoftruncatingitIntuitionfromthermodynamics,

•asystemathightemperatureisflexibleandcanexploremanypossiblestates,

•asystematlowertemperatureislikelytoexploreasubsetoflowerenergy(better)states.

Inlow-temperaturesampling,(τ≤1)wesmoothly

•increasetheprobabilityofthemostprobablewords

•decreasetheprobabilityoftherarewords.

Temperaturesampling

Dividethelogitbyatemperatureparameterτbeforepassingitthroughthesoftmax.

Insteadofy=softmax(u)

Wedo

y=softmax(u/t)

0≤τ≤1

Temperaturesampling

y=softmax(u/t)

Whydoesthiswork?

•Whenτiscloseto1thedistributiondoesn’tchangemuch.

•Thelowerτis,thelargerthescoresbeingpassedtothesoftmax

•Softmaxpusheshighvaluestoward1andlowvaluestoward0.

•Largeinputspusheshigh-probabilitywordshigherandlowprobabilitywordlower,makingthedistributionmoregreedy.

•Asτapproaches0,theprobabilityofmostlikelywordapproaches1

Large

LanguageModels

SamplingforLLMGeneration

Large

LanguageModels

PretrainingLargeLanguageModels:Algorithm

Pretraining

Thebigideathatunderliesalltheamazingperformanceoflanguagemodels

Firstpretrainatransformermodelonenormousamountsoftext

Thenapplyittonewtasks.

Self-supervisedtrainingalgorithm

Wejusttrainthemtopredictthenextword!

1.Takeacorpusoftext

2.Ateachtimestept

i.askthemodeltopredictthenextword

ii.trainthemodelusinggradientdescenttominimizetheerrorinthisprediction

"Self-supervised"becauseitjustusesthenextwordasthelabel!

Intuitionoflanguagemodeltraining:loss

•Samelossfunction:cross-entropyloss

•Wewantthemodeltoassignahighprobabilitytotruewordw

•=wantlosstobehighifthemodelassignstoolowaprobabilitytow

•CELoss:Thenegativelogprobabilitythatthemodelassignstothetruenextwordw

•Ifthemodelassignstoolowaprobabilitytow

•Wemovethemodelweightsinthedirectionthatassignsahigherprobabilitytow

Cross-entropylossforlanguagemodeling

:terenceb

Thecorrectdistributionytknowsthenextword,sois1fortheactualnextwordand0fortheothers.

Sointhissum,alltermsgetmultipliedbyzeroexceptone:thelogpthemodelassignstothecorrectnextword,so:

Teacherforcing

•Ateachtokenpositiont,modelseescorrecttokensw1:t,•Computesloss(–logprobability)forthenexttokenwt+1

•Atnexttokenpositiont+1weignorewhatmodelpredictedforwt+1

•Insteadwetakethecorrectwordwt+1,addittocontext,moveon

Trainingatransformerlanguagemodel

Nexttokenlongandthanksforall

Loss-logyand-logythanks

Language

Modeling

Head

logitslogitslogitslogitslogits

UUUUU

Stacked

Transformer

Blocks

x1

x2

x4

x3

x5

3

2

+1

+

+

+

4+5

Input

E

E

E

E

Encoding

E

InputtokensSolongandthanksfor

=

Large

LanguageModels

PretrainingLargeLanguageModels:Algorithm

Large

LanguageModels

PretrainingdataforLLMs

LLMsaremainlytrainedontheweb

Commoncrawl,snapshotsoftheentirewebproducedbythenon-profitCommonCrawlwithbillionsofpages

ColossalCleanCrawledCorpus(C4;Raffeletal.2020),156billiontokensofEnglish,filtered

What'sinit?Mostlypatenttextdocuments,Wikipedia,andnewssites

ThePile:apretrainingcorpus

academicswebbooks

dialog

Filteringforqualityandsafety

Qualityissubjective

•ManyLLMsattempttomatchWikipedia,books,particularwebsites

•Needtoremoveboilerplate,adultcontent

•Deduplicationatmanylevels(URLs,documents,evenlines)Safetyalsosubjective

•Toxicitydetectionisimportant,althoughthathasmixedresults

•CanmistakenlyflagdatawrittenindialectslikeAfricanAmericanEnglish

Whatdoesamodellearnfrompretraining?

•Therearecanineseverywhere!Onedoginthefrontroom,andtwodogs

•Itwasn'tjustbigitwasenormous

•Theauthorof"ARoomofOne'sOwn"isVirginiaWoolf

•Thedoctortoldmethathe

•Thesquarerootof4is2

Bigidea

TextcontainsenormousamountsofknowledgePretrainingonlotsoftextwithallthat

knowledgeiswhatgiveslanguagemodelstheirabilitytodosomuch

Butthereareproblemswithscrapingfromtheweb

Copyright:muchofthetextinthesedatasetsiscopyrighted

•NotcleariffairusedoctrineinUSallowsforthisuse

•Thisremainsanopenlegalquestion

Dataconsent

•Websiteownerscanindicatetheydon'twanttheirsitecrawledPrivacy:

•WebsitescancontainprivateIPaddressesandphonenumbers

Large

LanguageModels

PretrainingdataforLLMs

Finetuning

Large

LanguageModels

Finetuningfordaptationtonewdomains

WhathappensifweneedourLLMtoworkwellonadomainitdidn'tseeinpretraining?

Perhapssomespecificmedicalorlegaldomain?

OrmaybeamultilingualLMneedstoseemoredataonsomelanguagethatwasrareinpretraining?

Finetuning

PretrainingData

Pretraining

PretrainedLM

Fine-

tuning

Data

Fine-tuning

Fine-tunedLM

"Finetuning"means4differentthings

We'lldiscuss1here,and3inlaterlectures

Inallfourcases,finetuningmeans:

takingapretrainedmodelandfurtheradaptingsomeorallofitsparameterstosomenewdata

1.Finetuningas"continuedpretraining"onnewdata

•Furthertrainalltheparametersofmodelonnewdata

•usingthesamemethod(wordprediction)andlossfunction(cross-entropyloss)asforpretraining.

•asifthenewdatawereatthetailendofthepretrainingdata

•Hencesometimescalledcontinuedpretraining

Finetuning

Large

LanguageModels

Large

LanguageModels

EvaluatingLargeLanguageModels

Perplexity

Justasforn-gramgrammars,weuseperplexitytomeasurehowwelltheLMpredictsunseentext

Theperplexityofamodelθonanunseentestsetistheinverseprobabilitythatθassignstothetestset,normalizedbythetestsetlength.

Foratestsetofntokensw1:ntheperplexityis:

Whyperplexityinsteadofrawprobabilityofthetestset?

•Probabilitydependsonsizeoftestset

•Probabilitygetssmallerthelongerthetext

•Better:ametricthatisper-word,normalizedbylength

•Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords

(Theinversecomesfromtheoriginaldefinitionofperplexityfromcross-entropyrateininformationtheory)

Probabilityrangeis[0,1],perplexityrangeis[1,∞]

Perplexity

•Thehighertheprobabilityofthewordsequence,thelowertheperplexity.

•Thusthelowertheperplexityofamodelonthedata,thebetterthemodel.

•Minimizingperplexityisthesameasmaximizingprobability

Also:perplexityissensitivetolength/tokenizationsobestusedwhencomparingLMsthatusethesametokenizer.

Manyotherfactorsthatweevaluate,like:

Size

BigmodelstakelotsofGPUsandtimetotrain,memorytostore

Energyusage

CanmeasurekWhorkilogramsofCO2emitted

Fairness

Benchmarksmeasuregenderedandracialstereotypes,ordecreasedperformanceforlanguagefromoraboutsomegroups.

Large

LanguageModels

DealingwithScale

ScalingLaws

LLMperformancedependson

•Modelsize:thenumberofparametersnotcountingembeddings

•Datasetsize:theamountoftrainingdata

•Compute:Amountofcompute(inFLOPSoretc

Canimproveamodelbyaddingparameters(morelayers,

widercontexts),moredata,ortrainingformoreiterationsTheperformanceofalargelanguagemodel(theloss)scalesasapower-lawwitheachofthesethree

ScalingLaws

LossLasafunctionof#parametersN,datasetsizeD,computebudgetC(ifothertwoareheldconstant)

Scalinglawscanbeusedearlyintrainingtopredictwhatthelosswouldbeifweweretoaddmoredataorincreasemodelsize.

Numberofnon-embeddingparametersN

≈12nlayerd2

ThusGPT-3,withn=96layersanddimensionalityd=12288,has12×96×122882≈175billionparameters.

KVCache

Intraining,wecancomputeattentionveryefficientlyinparallel:

Butnotatinference!Wegeneratethenexttokensoneatatime!

Foranewtokenx,needtomultiplybyWQ,WK,andWVtogetquery,key,values

Butdon'twanttorecomputethekeyandvaluevectorsforallthepriortokensx<i

Instead,storekeyandvaluevectorsinmemoryintheKVcache,andthenwecanjustgrabthemfromthecache

KVCache

KT

A

Q

V

k1

k2

k3

k4

a1

a2

a3

a4

q1

q2

q3

q4

v1

v2

v3

v4

x

=

mask

x

=

dkxN

QKT

q1•k1

q1•k2

q1•k3

q1•k4

q2•k1

q2•k2

q2•k3

q2•k4

q3•k1

q3•k2

q3•k3

q3•k4

q4•k1

q4•k2

q4•k3

q4•k4

Nxdv

Nxdv

NxN

Nxdk

Q

x

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论