版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Large
LanguageModels
IntroductiontoLargeLanguageModels
Languagemodels
•Rememberthesimplen-gramlanguagemodel
•Assignsprobabilitiestosequencesofwords
•Generatetextbysamplingpossiblenextwords
•Istrainedoncountscomputedfromlotsoftext
•Largelanguagemodelsaresimilaranddifferent:
•Assignsprobabilitiestosequencesofwords
•Generatetextbysamplingpossiblenextwords
•Aretrainedbylearningtoguessthenextword
Largelanguagemodels
•Eventhroughpretrainedonlytopredictwords
•Learnalotofusefullanguageknowledge
•Sincetrainingonalotoftext
Threearchitecturesforlargelanguagemodels
Decoders
GPT,Claude,Llama
Mixtral
Encoders
BERTfamily,HuBERT
Encoder-decoders
Flan-T5,Whisper
Encoders
Manyvarieties!
•Popular:MaskedLanguageModels(MLMs)
•BERTfamily
•Trainedbypredictingwordsfromsurroundingwordsonbothsides
•Areusuallyfinetuned(trainedonsuperviseddata)forclassificationtasks.
Encoder-Decoders
•Trainedtomapfromonesequencetoanother
•Verypopularfor:
•machinetranslation(mapfromonelanguagetoanother)
•speechrecognition(mapfromacousticstowords)
Large
LanguageModels
IntroductiontoLargeLanguageModels
Large
LanguageModels
LargeLanguageModels:Whattaskscantheydo?
Bigidea
Manytaskscanbeturnedintotasksofpredictingwords!
Thislecture:decoder-onlymodels
Alsocalled:
•CausalLLMs
•AutoregressiveLLMs
•Left-to-rightLLMs
•Predictwordslefttoright
ConditionalGeneration:Generatingtextconditionedonprevioustext!
CompletionText
TransformerBlocks
Encoder
Solongandthanksfor
all
the
the
Language
Modeling
Head
…
…
+i
E
+i
E
+i
E
+i
E
+i
E
Softmax
Unencoderlayer\U/
+i
E
+i
E
U
logits
all
PrefixText
ManypracticalNLPtaskscanbecastaswordprediction!
Sentimentanalysis:“IlikeJackieChan”
1.Wegivethelanguagemodelthisstring:
Thesentimentofthesentence"I
likeJackieChan"is:
2.Andseewhatworditthinkscomesnext:
P(positive|Thesentimentofthesentence‘‘IlikeJackieChan"is:)
P(negative|Thesentimentofthesentence‘‘IlikeJackieChan"is:)
FraminglotsoftasksasconditionalgenerationQA:“WhowroteTheOriginofSpecies”
1.Wegivethelanguagemodelthisstring:
Q:Whowrotethebook‘‘TheOriginofSpecies"?A:
2.Andseewhatworditthinkscomesnext:
P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:)
3.Anditerate:
P(w|Q:Whowrotethebook‘‘TheOriginofSpecies"?A:Charles)
Summarization
TheonlythingcrazierthanaguyinsnowboundMassachusettsboxingupthepowderywhitestuffandofferingitforsaleonline?Peopleareactuallybuyingit.For$89,self-styledentrepreneurKyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.
Original
ButnotifyouliveinNewEnglandorsurroundingstates.“Wewillnotshipsnowtoanystatesinthenortheast!”saysWaring’swebsite,ShipSnowY.“We’reinthebusinessofexpungingsnow!”
Hiswebsiteandsocialmediaaccountsclaimtohavefilledmorethan133ordersforsnow–morethan30onTuesdayalone,hisbusiestdayyet.Withmorethan45totalinches,Bostonhassetarecordthiswinterforthesnowiestmonthinitshistory.Mostresidentsseethehugepilesofsnowchokingtheiryardsandsidewalksasanuisance,butWaringsawanopportunity.
AccordingtoB,itallstartedafewweeksago,whenWaringandhiswifewereshov-elingdeepsnowfromtheiryardinManchester-by-the-Sea,acoastalsuburbnorthofBoston.Hejokedaboutshippingthestufftofriendsandfamilyinwarmerstates,andanideawasborn.[...]
Summary
KyleWaringwillshipyou6poundsofBoston-areasnowinaninsulatedStyrofoambox–enoughfor10to15snowballs,hesays.ButnotifyouliveinNewEnglandorsurroundingstates.
LLMsforsummarization(usingtl;dr)
GeneratedSummary
KyleWaringwill…
U
LMHead
U
U
E
E
E
E
E
E
E
…
E
idea
Theonly
will
tl;dr
Waring
…
wasborn.Kyle
OriginalStoryDelimiter
Large
LanguageModels
LargeLanguageModels:Whattaskscantheydo?
Large
LanguageModels
SamplingforLLMGeneration
DecodingandSampling
Thistaskofchoosingawordtogeneratebasedonthemodel’sprobabilitiesiscalleddecoding.
ThemostcommonmethodfordecodinginLLMs:sampling.Samplingfromamodel’sdistributionoverwords:
•chooserandomwordsaccordingtotheirprobabilityassignedbythemodel.
Aftereachtokenwe’llsamplewordstogenerateaccordingtotheirprobabilityconditionedonourpreviouschoices,
•Atransformerlanguagemodelwillgivetheprobability
Randomsampling
i←1
wi~p(w)
whilewi!=EOSi←i+1
wi~p(wi|w<i)
Randomsamplingdoesn'tworkverywell
Eventhoughrandomsamplingmostlygeneratesensible,high-probablewords,
Therearemanyodd,low-probabilitywordsinthetailofthedistribution
Eachoneislow-probabilitybutaddeduptheyconstitutealargeportionofthedistribution
Sotheygetpickedenoughtogenerateweirdsentences
Factorsinwordsampling:qualityanddiversity
Emphasizehigh-probabilitywords
+quality:moreaccurate,coherent,andfactual,-diversity:boring,repetitive.
Emphasizemiddle-probabilitywords+diversity:morecreative,diverse,-quality:lessfactual,incoherent
Top-ksampling:
1.Choose#ofwordsk
2.ForeachwordinthevocabularyV,usethelanguagemodeltocomputethelikelihoodofthiswordgiventhecontextp(wt|w<t)
3.Sortthewordsbylikelihood,keeponlythetopkmostprobablewords.
4.Renormalizethescoresofthekwordstobealegitimateprobabilitydistribution.
5.Randomlysampleawordfromwithintheseremainingkmost-probablewordsaccordingtoitsprobability.
Top-psampling
Holtzmanetal.,2020
Problemwithtop-k:kisfixedsomaycoververydifferentamountsofprobabilitymassindifferentsituations
Idea:Instead,keepthetopppercentoftheprobabilitymass
ΣP(w|w<t)≥p
w∈V(p)
Temperaturesampling
ReshapethedistributioninsteadoftruncatingitIntuitionfromthermodynamics,
•asystemathightemperatureisflexibleandcanexploremanypossiblestates,
•asystematlowertemperatureislikelytoexploreasubsetoflowerenergy(better)states.
Inlow-temperaturesampling,(τ≤1)wesmoothly
•increasetheprobabilityofthemostprobablewords
•decreasetheprobabilityoftherarewords.
Temperaturesampling
Dividethelogitbyatemperatureparameterτbeforepassingitthroughthesoftmax.
Insteadofy=softmax(u)
Wedo
y=softmax(u/t)
0≤τ≤1
Temperaturesampling
y=softmax(u/t)
Whydoesthiswork?
•Whenτiscloseto1thedistributiondoesn’tchangemuch.
•Thelowerτis,thelargerthescoresbeingpassedtothesoftmax
•Softmaxpusheshighvaluestoward1andlowvaluestoward0.
•Largeinputspusheshigh-probabilitywordshigherandlowprobabilitywordlower,makingthedistributionmoregreedy.
•Asτapproaches0,theprobabilityofmostlikelywordapproaches1
Large
LanguageModels
SamplingforLLMGeneration
Large
LanguageModels
PretrainingLargeLanguageModels:Algorithm
Pretraining
Thebigideathatunderliesalltheamazingperformanceoflanguagemodels
Firstpretrainatransformermodelonenormousamountsoftext
Thenapplyittonewtasks.
Self-supervisedtrainingalgorithm
Wejusttrainthemtopredictthenextword!
1.Takeacorpusoftext
2.Ateachtimestept
i.askthemodeltopredictthenextword
ii.trainthemodelusinggradientdescenttominimizetheerrorinthisprediction
"Self-supervised"becauseitjustusesthenextwordasthelabel!
Intuitionoflanguagemodeltraining:loss
•Samelossfunction:cross-entropyloss
•Wewantthemodeltoassignahighprobabilitytotruewordw
•=wantlosstobehighifthemodelassignstoolowaprobabilitytow
•CELoss:Thenegativelogprobabilitythatthemodelassignstothetruenextwordw
•Ifthemodelassignstoolowaprobabilitytow
•Wemovethemodelweightsinthedirectionthatassignsahigherprobabilitytow
Cross-entropylossforlanguagemodeling
:terenceb
Thecorrectdistributionytknowsthenextword,sois1fortheactualnextwordand0fortheothers.
Sointhissum,alltermsgetmultipliedbyzeroexceptone:thelogpthemodelassignstothecorrectnextword,so:
Teacherforcing
•Ateachtokenpositiont,modelseescorrecttokensw1:t,•Computesloss(–logprobability)forthenexttokenwt+1
•Atnexttokenpositiont+1weignorewhatmodelpredictedforwt+1
•Insteadwetakethecorrectwordwt+1,addittocontext,moveon
Trainingatransformerlanguagemodel
Nexttokenlongandthanksforall
Loss-logyand-logythanks
Language
Modeling
Head
logitslogitslogitslogitslogits
UUUUU
Stacked
Transformer
Blocks
…
x1
x2
x4
x3
x5
3
2
+1
+
+
+
4+5
Input
E
E
E
E
Encoding
E
InputtokensSolongandthanksfor
…
=
…
…
…
…
…
Large
LanguageModels
PretrainingLargeLanguageModels:Algorithm
Large
LanguageModels
PretrainingdataforLLMs
LLMsaremainlytrainedontheweb
Commoncrawl,snapshotsoftheentirewebproducedbythenon-profitCommonCrawlwithbillionsofpages
ColossalCleanCrawledCorpus(C4;Raffeletal.2020),156billiontokensofEnglish,filtered
What'sinit?Mostlypatenttextdocuments,Wikipedia,andnewssites
ThePile:apretrainingcorpus
academicswebbooks
dialog
Filteringforqualityandsafety
Qualityissubjective
•ManyLLMsattempttomatchWikipedia,books,particularwebsites
•Needtoremoveboilerplate,adultcontent
•Deduplicationatmanylevels(URLs,documents,evenlines)Safetyalsosubjective
•Toxicitydetectionisimportant,althoughthathasmixedresults
•CanmistakenlyflagdatawrittenindialectslikeAfricanAmericanEnglish
Whatdoesamodellearnfrompretraining?
•Therearecanineseverywhere!Onedoginthefrontroom,andtwodogs
•Itwasn'tjustbigitwasenormous
•Theauthorof"ARoomofOne'sOwn"isVirginiaWoolf
•Thedoctortoldmethathe
•Thesquarerootof4is2
Bigidea
TextcontainsenormousamountsofknowledgePretrainingonlotsoftextwithallthat
knowledgeiswhatgiveslanguagemodelstheirabilitytodosomuch
Butthereareproblemswithscrapingfromtheweb
Copyright:muchofthetextinthesedatasetsiscopyrighted
•NotcleariffairusedoctrineinUSallowsforthisuse
•Thisremainsanopenlegalquestion
Dataconsent
•Websiteownerscanindicatetheydon'twanttheirsitecrawledPrivacy:
•WebsitescancontainprivateIPaddressesandphonenumbers
Large
LanguageModels
PretrainingdataforLLMs
Finetuning
Large
LanguageModels
Finetuningfordaptationtonewdomains
WhathappensifweneedourLLMtoworkwellonadomainitdidn'tseeinpretraining?
Perhapssomespecificmedicalorlegaldomain?
OrmaybeamultilingualLMneedstoseemoredataonsomelanguagethatwasrareinpretraining?
Finetuning
PretrainingData
Pretraining
PretrainedLM
…
…
…
Fine-
tuning
Data
Fine-tuning
Fine-tunedLM
…
…
…
"Finetuning"means4differentthings
We'lldiscuss1here,and3inlaterlectures
Inallfourcases,finetuningmeans:
takingapretrainedmodelandfurtheradaptingsomeorallofitsparameterstosomenewdata
1.Finetuningas"continuedpretraining"onnewdata
•Furthertrainalltheparametersofmodelonnewdata
•usingthesamemethod(wordprediction)andlossfunction(cross-entropyloss)asforpretraining.
•asifthenewdatawereatthetailendofthepretrainingdata
•Hencesometimescalledcontinuedpretraining
Finetuning
Large
LanguageModels
Large
LanguageModels
EvaluatingLargeLanguageModels
Perplexity
Justasforn-gramgrammars,weuseperplexitytomeasurehowwelltheLMpredictsunseentext
Theperplexityofamodelθonanunseentestsetistheinverseprobabilitythatθassignstothetestset,normalizedbythetestsetlength.
Foratestsetofntokensw1:ntheperplexityis:
Whyperplexityinsteadofrawprobabilityofthetestset?
•Probabilitydependsonsizeoftestset
•Probabilitygetssmallerthelongerthetext
•Better:ametricthatisper-word,normalizedbylength
•Perplexityistheinverseprobabilityofthetestset,normalizedbythenumberofwords
(Theinversecomesfromtheoriginaldefinitionofperplexityfromcross-entropyrateininformationtheory)
Probabilityrangeis[0,1],perplexityrangeis[1,∞]
Perplexity
•Thehighertheprobabilityofthewordsequence,thelowertheperplexity.
•Thusthelowertheperplexityofamodelonthedata,thebetterthemodel.
•Minimizingperplexityisthesameasmaximizingprobability
Also:perplexityissensitivetolength/tokenizationsobestusedwhencomparingLMsthatusethesametokenizer.
Manyotherfactorsthatweevaluate,like:
Size
BigmodelstakelotsofGPUsandtimetotrain,memorytostore
Energyusage
CanmeasurekWhorkilogramsofCO2emitted
Fairness
Benchmarksmeasuregenderedandracialstereotypes,ordecreasedperformanceforlanguagefromoraboutsomegroups.
Large
LanguageModels
DealingwithScale
ScalingLaws
LLMperformancedependson
•Modelsize:thenumberofparametersnotcountingembeddings
•Datasetsize:theamountoftrainingdata
•Compute:Amountofcompute(inFLOPSoretc
Canimproveamodelbyaddingparameters(morelayers,
widercontexts),moredata,ortrainingformoreiterationsTheperformanceofalargelanguagemodel(theloss)scalesasapower-lawwitheachofthesethree
ScalingLaws
LossLasafunctionof#parametersN,datasetsizeD,computebudgetC(ifothertwoareheldconstant)
Scalinglawscanbeusedearlyintrainingtopredictwhatthelosswouldbeifweweretoaddmoredataorincreasemodelsize.
Numberofnon-embeddingparametersN
≈12nlayerd2
ThusGPT-3,withn=96layersanddimensionalityd=12288,has12×96×122882≈175billionparameters.
KVCache
Intraining,wecancomputeattentionveryefficientlyinparallel:
Butnotatinference!Wegeneratethenexttokensoneatatime!
Foranewtokenx,needtomultiplybyWQ,WK,andWVtogetquery,key,values
Butdon'twanttorecomputethekeyandvaluevectorsforallthepriortokensx<i
Instead,storekeyandvaluevectorsinmemoryintheKVcache,andthenwecanjustgrabthemfromthecache
KVCache
KT
A
Q
V
k1
k2
k3
k4
a1
a2
a3
a4
q1
q2
q3
q4
v1
v2
v3
v4
x
=
mask
x
=
dkxN
QKT
q1•k1
q1•k2
q1•k3
q1•k4
q2•k1
q2•k2
q2•k3
q2•k4
q3•k1
q3•k2
q3•k3
q3•k4
q4•k1
q4•k2
q4•k3
q4•k4
Nxdv
Nxdv
NxN
Nxdk
Q
x
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年四川启赛微电子有限公司关于招聘质量工程师岗位的备考题库参考答案详解
- 2026年安徽理工大学科技园经开园区招聘备考题库及完整答案详解一套
- 2026年四川九州电子科技股份有限公司关于招聘技术员的备考题库及完整答案详解1套
- 2026年大连城投城市服务集团有限公司招聘备考题库含答案详解
- 2026年台州市黄岩城市建设开发投资集团有限公司下属公司公开招聘工作人员12人备考题库及完整答案详解1套
- 2026年平武县光大国有投资(集团)有限公司关于面向社会公开招聘会计经理的备考题库及1套参考答案详解
- 2026年宜昌金辉大数据产业发展有限公司招聘50人备考题库完整参考答案详解
- 2026年北京市海淀区实验小学教育集团招聘备考题库带答案详解
- 2026年定州市人民医院(定州市急救中心)招聘备考题库及完整答案详解1套
- 2026年广东碧桂园职业学院招聘33人备考题库有答案详解
- 2025年全国注册监理工程师继续教育题库附答案
- 锅炉原理培训课件
- 重庆市高新技术产业开发区消防救援支队政府专职消防员招录(聘)114人参考题库附答案
- 2026年林学概论选择试题及答案
- 2026年安全员之A证考试题库500道附参考答案(黄金题型)
- 儿童早教中心接待服务流程
- 肿瘤课件模板
- 大学计算机教程-计算与人工智能导论(第4版)课件 第3章 算法和数据结构
- 带脉的课件教学课件
- 自建房消防安全及案例培训课件
- 2025年广东省第一次普通高中学业水平合格性考试(春季高考)思想政治试题(含答案详解)
评论
0/150
提交评论