




已阅读5页,还剩40页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
外文原文LETSBUILDACOMPILERByJackW.Crenshaw,Ph.D.PartVII:LEXICALSCANNINGINTRODUCTIONInthelastinstallment,IleftyouwithacompilerthatwouldALMOSTwork,exceptthatwewerestilllimitedtosingle-charactertokens.Thepurposeofthissessionistogetridofthatrestriction,onceandforall.Thismeansthatwemustdealwiththeconceptofthelexicalscanner.MaybeIshouldmentionwhyweneedalexicalscanneratall.afterall,wevebeenabletomanageallrightwithoutone,uptillnow,evenwhenweprovidedformulti-charactertokens.TheONLYreason,really,hastodowithkeywords.Itsafactofcomputerlifethatthesyntaxforakeywordhasthesameformasthatforanyotheridentifier.WecanttelluntilwegetthecompletewordwhetherornotitISakeyword.Forexample,thevariableIFILEandthekeywordIFlookjustalike,untilyougettothethirdcharacter.Intheexamplestodate,wewerealwaysabletomakeadecisionbaseduponthefirstcharacterofthetoken,butthatsnolongerpossiblewhenkeywordsarepresent.WeneedtoknowthatagivenstringisakeywordBEFOREwebegintoprocessit.Andthatswhyweneedascanner.Inthelastsession,Ialsopromisedthatwewouldbeabletoprovidefornormaltokenswithoutmakingwholesalechangestowhatwehavealreadydone.Ididntlie.wecan,asyouwillseelater.ButeverytimeIsetouttoinstalltheseelementsofthesoftwareintotheparserwehavealreadybuilt,Ihadbadfeelingsaboutit.Thewholethingfeltentirelytoomuchlikeaband-aid.Ifinallyfiguredoutwhatwascausingtheproblem:Iwasinstallinglexicalscanningsoftwarewithoutfirstexplainingtoyouwhatscanningisallabout,andwhatthealternativesare.Uptillnow,Ihavestudiouslyavoidedgivingyoualotoftheory,andcertainlynotalternatives.Igenerallydontrespondwelltothetextbooksthatgiveyoutwenty-fivedifferentwaystodosomething,butnoclueastowhichwaybestfitsyourneeds.IvetriedtoavoidthatpitfallbyjustshowingyouONEmethod,thatWORKS.Butthisisanimportantarea.Whilethelexicalscannerishardlythemostexcitingpartofacompiler,itoftenhasthemostprofoundeffectonthegenerallook&feelofthelanguage,sinceafterallitsthepartclosesttotheuser.IhaveaparticularstructureinmindforthescannertobeusedwithKISS.Itfitsthelook&feelthatIwantforthatlanguage.ButitmaynotworkatallforthelanguageYOUREcookingup,sointhisonecaseIfeelthatitsimportantforyoutoknowyouroptions.SoImgoingtodepart,again,frommyusualformat.Inthissessionwellbegettingmuchdeeperthanusualintothebasictheoryoflanguagesandgrammars.IllalsobetalkingaboutareasOTHERthancompilersinwhichlexicalscanningplaysanimportantrole.Finally,Iwillshowyousomealternativesforthestructureofthelexicalscanner.Then,andonlythen,willwegetbacktoourparserfromthelastinstallment.Bearwithme.Ithinkyoullfinditsworththewait.Infact,sincescannershavemanyapplicationsoutsideofcompilers,youmaywellfindthistobethemostusefulsessionforyou.LEXICALSCANNINGLexicalscanningistheprocessofscanningthestreamofinputcharactersandseparatingitintostringscalledtokens.Mostcompilertextsstarthere,anddevoteseveralchapterstodiscussingvariouswaystobuildscanners.Thisapproachhasitsplace,butasyouhavealreadyseen,thereisalotyoucandowithouteverevenaddressingtheissue,andinfactthescannerwellendupwithherewontlookmuchlikewhatthetextsdescribe.Thereason?Compilertheoryand,consequently,theprogramsresultingfromit,mustdealwiththemostgeneralkindofparsingrules.Wedont.Intherealworld,itispossibletospecifythelanguagesyntaxinsuchawaythataprettysimplescannerwillsuffice.Andasalways,KISSisourmotto.Typically,lexicalscanningisdoneinaseparatepartofthecompiler,sothattheparserperseseesonlyastreamofinputtokens.Now,theoreticallyitisnotnecessarytoseparatethisfunctionfromtherestoftheparser.Thereisonlyonesetofsyntaxequationsthatdefinethewholelanguage,sointheorywecouldwritethewholeparserinonemodule.Whytheseparation?Theanswerhasbothpracticalandtheoreticalbases.In1956,NoamChomskydefinedtheChomskyHierarchyofgrammars.Theyare:Type0:Unrestricted(e.g.,English)Type1:Context-SensitiveType2:Context-FreeType3:RegularAfewfeaturesofthetypicalprogramminglanguage(particularlytheolderones,suchasFORTRAN)areType1,butforthemostpartallmodernlanguagescanbedescribedusingonlythelasttwotypes,andthoseareallwellbedealingwithhere.Theneatpartaboutthesetwotypesisthatthereareveryspecificwaystoparsethem.Ithasbeenshownthatanyregulargrammarcanbeparsedusingaparticularformofabstractmachinecalledthestatemachine(finiteautomaton).Wehavealreadyimplementedstatemachinesinsomeofourrecognizers.Similarly,Type2(context-free)grammarscanalwaysbeparsedusingapush-downautomaton(astatemachineaugmentedbyastack).Wehavealsoimplementedthesemachines.Insteadofimplementingaliteralstack,wehavereliedonthebuilt-instackassociatedwithrecursivecodingtodothejob,andthatinfactisthepreferredapproachfortop-downparsing.Now,ithappensthatinreal,practicalgrammars,thepartsthatqualifyasregularexpressionstendtobethelower-levelparts,suchasthedefinitionofanidentifier:=|*Sinceittakesadifferentkindofabstractmachinetoparsethetwotypesofgrammars,itmakessensetoseparatetheselower-levelfunctionsintoaseparatemodule,thelexicalscanner,whichisbuiltaroundtheideaofastatemachine.Theideaistousethesimplestparsingtechniqueneededforthejob.Thereisanother,morepracticalreasonforseparatingscannerfromparser.Weliketothinkoftheinputsourcefileasastreamofcharacters,whichweprocessrighttoleftwithoutbacktracking.Inpracticethatisntpossible.AlmosteverylanguagehascertainkeywordssuchasIF,WHILE,andEND.AsImentionedearlier,wecantreallyknowwhetheragivencharacterstringisakeyword,untilwevereachedtheendofit,asdefinedbyaspaceorotherdelimiter.Sointhatsense,weMUSTsavethestringlongenoughtofindoutwhetherwehaveakeywordornot.Thatsalimitedformofbacktracking.Sothestructureofaconventionalcompilerinvolvessplittingupthefunctionsofthelower-levelandhigher-levelparsing.Thelexicalscannerdealswiththingsatthecharacterlevel,collectingcharactersintostrings,etc.,andpassingthemalongtotheparserproperasindivisibletokens.Itsalsoconsiderednormaltoletthescannerhavethejobofidentifyingkeywords.STATEMACHINESANDALTERNATIVESImentionedthattheregularexpressionscanbeparsedusingastatemachine.Inmostcompilertexts,andindeedinmostcompilersaswell,youwillfindthistakenliterally.Thereistypicallyarealimplementationofthestatemachine,withintegersusedtodefinethecurrentstate,andatableofactionstotakeforeachcombinationofcurrentstateandinputcharacter.IfyouwriteacompilerfrontendusingthepopularUnixtoolsLEXandYACC,thatswhatyoullget.TheoutputofLEXisastatemachineimplementedinC,plusatableofactionscorrespondingtotheinputgrammargiventoLEX.TheYACCoutputissimilar.acannedtable-drivenparser,plusthetablecorrespondingtothelanguagesyntax.Thatisnottheonlychoice,though.Inourpreviousinstallments,youhaveseenoverandoverthatitispossibletoimplementparserswithoutdealingspecificallywithtables,stacks,orstatevariables.Infact,inInstallmentVIwarnedyouthatifyoufindyourselfneedingthesethingsyoumightbedoingsomethingwrong,andnottakingadvantageofthepowerofPascal.Therearebasicallytwowaystodefineastatemachinesstate:explicitly,withastatenumberorcode,andimplicitly,simplybyvirtueofthefactthatImatacertainplaceinthecode(ifitsTuesday,thismustbeBelgium).Wevereliedheavilyontheimplicitapproachesbefore,andIthinkyoullfindthattheyworkwellhere,too.Inpractice,itmaynotevenbenecessarytoHAVEawell-definedlexicalscanner.Thisisntourfirstexperienceatdealingwithmulti-charactertokens.InInstallmentIII,weextendedourparsertoprovideforthem,andwedidntevenNEEDalexicalscanner.Thatwasbecauseinthatnarrowcontext,wecouldalwaystell,justbylookingatthesinglelookaheadcharacter,whetherweweredealingwithanumber,avariable,oranoperator.Ineffect,webuiltadistributedlexicalscanner,usingproceduresGetNameandGetNum.Withkeywordspresent,wecantknowanymorewhatweredealingwith,untiltheentiretokenisread.Thisleadsustoamorelocalizedscanner;although,asyouwillsee,theideaofadistributedscannerstillhasitsmerits.SOMEEXPERIMENTSINSCANNINGBeforegettingbacktoourcompiler,itwillbeusefultoexperimentabitwiththegeneralconcepts.Letsbeginwiththetwodefinitionsmostoftenseeninrealprogramminglanguages:=|*+(Remember,the*indicateszeroormoreoccurencesofthetermsinbrackets,andthe+,oneormore.)WehavealreadydealtwithsimilaritemsinInstallmentIII.Letsbegin(asusual)withabarecradle.Notsurprisingly,wearegoingtoneedanewrecognizer:RecognizeanAlphanumericCharacterfunctionIsAlNum(c:char):boolean;beginIsAlNum:=IsAlpha(c)orIsDigit(c);end;YoucaneasilyverifythattheseroutinesworkbycallingthemfromthemainprogramWriteLn(GetName);Thisprogramwillprintanylegalnametypedin(maximumeightcharacters,sincethatswhatwetoldGetName).Itwillrejectanythingelse.Testtheotherroutinesimilarly.WHITESPACEWealsohavedealtwithembeddedwhitespacebefore,usingthetworoutinesIsWhiteandSkipWhite.Makesurethattheseroutinesareinyourcurrentversionofthecradle,andaddthethelineSkipWhite;attheendofbothGetNameandGetNum.Now,letsdefinethenewprocedure:-LexicalScannerFunctionScan:string;beginifIsAlpha(Look)thenScan:=GetNameelseifIsDigit(Look)thenScan:=GetNumelsebeginScan:=Look;GetChar;end;SkipWhite;end;-(YouwillhavetoaddthedeclarationofthestringTokenatthebeginningoftheprogram.Makeitanyconvenientlength,say16characters.)Now,runtheprogram.Notehowtheinputstringis,indeed,separatedintodistincttokens.STATEMACHINESFortherecord,aparseroutinelikeGetNamedoesindeedimplementastatemachine.Thestateisimplicitinthecurrentpositioninthecode.Averyusefultrickforvisualizingwhatsgoingonisthesyntaxdiagram,orrailroad-trackdiagram.Itsalittledifficulttodrawoneinthismedium,soIllusethemverysparingly,butthefigurebelowshouldgiveyoutheidea:|-Other-Error|Start-Letter-Other-Finish|=,and,buttheycouldbedealtwithasspecialcases.Still,otherlanguageshavemulti-characteroperators,suchasthe:=ofPascalorthe+andofC.Sowhilewemaynotneedmulti-characteroperators,itsnicetoknowhowtogetthemifnecessary.Needlesstosay,wecanhandleoperatorsverymuchthesamewayastheothertokens.Letsstartwitharecognizer:-RecognizeAnyOperatorfunctionIsOp(c:char):boolean;beginIsOp:=cin+,-,*,/,:,=;end;-ItsimportanttonotethatweDONThavetoincludeeverypossibleoperatorinthislist.Forexample,theparethesesarentincluded,noristheterminatingperiod.ThecurrentversionofScanhandlessingle-characteroperatorsjustfineasitis.Thelistaboveincludesonlythosecharactersthatcanappearinmulti-characteroperators.(Forspecificlanguages,ofcourse,thelistcanalwaysbeedited.)Now,letsmodifyScantoread:-LexicalScannerFunctionScan:string;beginwhileLook=CRdoFin;ifIsAlpha(Look)thenScan:=GetNameelseifIsDigit(Look)thenScan:=GetNumelseifIsOp(Look)thenScan:=GetOpelsebeginScan:=Look;GetChar;end;SkipWhite;end;-Trytheprogramnow.Youwillfindthatanycodefragmentsyoucaretothrowatitwillbeneatlybrokenupintoindividualtokens.LISTS,COMMASANDCOMMANDLINESBeforegettingbacktothemainthrustofourstudy,Idliketogetonmysoapboxforamoment.Howmanytimeshaveyouworkedwithaprogramoroperatingsystemthathadrigidrulesabouthowyoumustseparateitemsinalist?(Try,thelasttimeyouusedMSDOS!)Someprogramsrequirespacesasdelimiters,andsomerequirecommas.Worstofall,somerequireboth,indifferentplaces.Mostareprettyunforgivingaboutviolationsoftheirrules.Ithinkthisisinexcusable.Itstooeasytowriteaparserthatwillhandlebothspacesandcommasinaflexibleway.Considerthefollowingprocedure:-SkipOveraCommaprocedureSkipComma;beginSkipWhite;ifLook=,thenbeginGetChar;SkipWhite;end;end;-Thiseight-lineprocedurewillskipoveradelimiterconsistingofanynumber(includingzero)ofspaces,withzerooronecommaembeddedinthestring.TEMPORARILY,changethecalltoSkipWhiteinScantoacalltoSkipComma,andtryinputtingsomelists.Worksnicely,eh?DontyouwishmoresoftwareauthorsknewaboutSkipComma?Fortherecord,IfoundthataddingtheequivalentofSkipCommatomyZ80assembler-languageprogramstookallof6(six)extrabytesofcode.Evenina64Kmachine,thatsnotaveryhighpricetopayforuser-friendliness!IthinkyoucanseewhereImgoinghere.Evenifyouneverwritealineofacompilercodeinyourlife,thereareplacesineveryprogramwhereyoucanusetheconceptsofparsing.Anyprogramthatprocessesacommandlineneedsthem.Infact,ifyouthinkaboutitforabit,youllhavetoconcludethatanytimeyouwriteaprogramthatprocessesuserinputs,youredefiningalanguage.Peoplecommunicatewithlanguages,andthesyntaximplicitinyourprogramdefinesthatlanguage.Therealquestionis:areyougoingtodefineitdeliberatelyandexplicitly,orjustletitturnouttobewhatevertheprogramendsupparsing?Iclaimthatyoullhaveabetter,moreuser-friendlyprogramifyoulltakethetimetodefinethesyntaxexplicitly.Writedownthesyntaxequationsordrawtherailroad-trackdiagrams,andcodetheparserusingthetechniquesIveshownyouhere.Youllendupwithabetterprogram,anditwillbeeasiertowrite,toboot.GETTINGFANCYOK,atthispointwehaveaprettynicelexicalscannerthatwillbreakaninputstreamupintotokens.Wecoulduseitasitstandsandhaveaservicablecompiler.Buttherearesomeotheraspectsoflexicalscanningthatweneedtocover.Themainconsiderationisefficiency.Rememberwhenweweredealingwithsingle-charactertokens,everytestwasacomparisonofasinglecharacter,Look,withabyteconstant.WealsousedtheCasestatementheavily.Withthemulti-charactertokensbeingreturnedbyScan,allthosetestsnowbecomestringcomparisons.Muchslower.Andnotonlyslower,butmoreawkward,sincethereisnostringequivalentoftheCasestatementinPascal.Itseemsespeciallywastefultotestforwhatusedtobesinglecharacters.the=,+,andotheroperators.usingstringcomparisons.Usingstringcomparisonisnotimpossible.RonCainusedjustthatapproachinwritingSmallC.SincewerestickingtotheKISSprinciplehere,wewouldbetrulyjustifiedinsettlingforthisapproach.ButthenIwouldhavefailedtotellyouaboutoneofthekeyapproachesusedinrealcompilers.Youhavetoremember:thelexicalscannerisgoingtobecalleda_LOT_!Onceforeverytokeninthewholesourceprogram,infact.Experimentshaveindicatedthattheaveragecompilerspendsanywherefrom20%to40%ofitstimeinthescannerroutines.Iftherewereeveraplacewhereefficiencydeservesrealconsideration,thisisit.Forthisreason,mostcompilerwritersaskthelexicalscannertodoalittlemorework,bytokenizingtheinputstream.Theideaistomatcheverytokenagainstalistofacceptablekeywordsandoperators,andreturnuniquecodesforeachonerecognized.Inthecaseofordinaryvariablenamesornumbers,wejustreturnacodethatsayswhatkindoftokentheyare,andsavetheactualstringsomewhereelse.Oneofthefirstthingsweregoingtoneedisawaytoidentifykeywords.WecanalwaysdoitwithsuccessiveIFtests,butitsurelywouldbeniceifwehadageneral-purposeroutinethatcouldcompareagivenstringwithatableofkeywords.(Bytheway,werealsogoingtoneedsucharoutinelater,fordealingwithsymboltables.)ThisusuallypresentsaprobleminPascal,becausestandardPascaldoesntallowforarraysofvariablelengths.Itsarealbothertohavetodeclareadifferentsearchroutineforeverytable.StandardPascalalsodoesntallowforinitializingarrays,soyoutendtoseecodelikeTable1:=IF;Table2:=ELSE;Tablen:=END;whichcangetprettyoldiftherearemanykeywords.Fortunately,TurboPascal4.0hasextensionsthateliminatebothoftheseproblems.ConstantarrayscanbedeclaredusingTPstypedconstantfacility,andthevariabledimensionscanbehandledwithitsC-likeextensionsforpointers.First,modifyyourdeclarationslikethis:-TypeDeclarationstypeSymbol=string8;SymTab=array1.1000ofSymbol;TabPtr=SymTab;-(ThedimensionusedinSymTabisnotreal.nostorageisallocatedbythedeclarationitself,andthenumberneedonlybebigenough.)Now,justbeneaththosedeclarations,addthefollowing:-DefinitionofKeywordsandTokenTypesconstKWlist:array1.4ofSymbol=(IF,ELSE,ENDIF,END);-Next,insertthefollowingnewfunction:-TableLookupIftheinputstringmatchesatableentry,returntheentryindex.Ifnot,returnazero.functionLookup(T:TabPtr;s:string;n:integer):integer;vari:integer;found:boolean;beginfound:=false;i:=n;while(i0)andnotfounddoifs=Tithenfound:=trueelsedec(i);Lookup:=i;end;-Totestit,youcantemporarilychangethemainprogramasfollows:-MainProgrambeginReadLn(Token);WriteLn(Lookup(Addr(KWList),Token,4);end.-NoticehowLookupiscalled:TheAddrfunctionsetsupapointertoKWList,whichgetspassedtoLookup.OK,givethisatry.SincewerebypassingScanhere,youllhavetotypethekeywordsinuppercasetogetanymatches.Nowthatwecanrecognizekeywords,thenextthingistoarrangetoreturncodesforthem.Sowhatkindofcodeshouldwereturn?Therearereallyonlytworeasonablechoices.ThisseemslikeanidealapplicationforthePascalenumeratedtype.Forexample,youcandefinesomethinglikeSymType=(IfSym,ElseSym,EndifSym,EndSym,Ident,Number,Operator);andarrangetoreturnavariableofthistype.Letsgiveitatry.Insertthelineaboveintoyourtypedefinitions.Now,addthetwovariabledeclarations:Token:Symtype;CurrentTokenValue:String16;StringTokenofLookModifythescannertoread:-LexicalScannerprocedureScan;vark:integer;beginwhileLook=CRdoFin;ifIsAlpha(Look)thenbeginValue:=GetName;k:=Lookup(Addr(KWlist),Value,4);ifk=0thenToken:=IdentelseToken:=SymType(k-1);endelseifIsDigit(Look)thenbeginValue:=GetNum;Token:=Number;endelseifIsOp(Look)thenbeginValue:=GetOp;Token:=Operator;endelsebeginValue:=Look;Token:=Operator;GetChar;end;SkipWhite;end;-(NoticethatScanisnowaprocedure,notafunction.)Finally,modifythemainprogramtoread:-M
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 养老服务消费市场现状及发展趋势探讨
- 环保交通工具在旅游碳减排中的角色
- 大数据分析在鞋服供应链运输调度中的应用
- 2025年医疗大数据在慢性病管理中的智能健康管理报告
- 2025年休闲农业与乡村旅游融合发展规划报告:乡村旅游与乡村旅游节庆活动
- 心理干预在提升学生专注力中的作用研究
- 校园体育活动的商业化运营策略-以篮球为例
- 以人为本医疗实验室自动化的用户体验设计研究报告
- 2025年中国拉手座行业投资前景及策略咨询研究报告
- 2025年中国塑料纱网行业投资前景及策略咨询研究报告
- 【高分复习资料】山东大学《244德语》历年考研真题汇编
- (新版)山东省物流工程师职称考试参考试题库-下(多选、判断题)
- 青年兴则国家兴青年强则国家强
- 全国行业职业技能竞赛(电力交易员)考试题库及答案
- DB50-T 1293-2022 松材线虫病疫木除治技术规范
- 山东省青岛市英语中考试题及解答参考(2025年)
- 多功能热洗车热洗清蜡QHSE作业指导书及操作规程
- 2024年北京中考地理试卷
- 《市政养护工程施工方案》
- 液化石油气站规章制度2024
- (安全生产)煤矿安全生产监管检查清单
评论
0/150
提交评论