




已阅读5页,还剩4页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
BusinessmodelsforDictionariesandNLP AdamKilgarri November Abstract NLPneedsdictionaries anddictionary makerscanuseNLPtomakebetterdictio naries sothereisgreatpotentialforsynergybetweenthetwoactivities Todate there hasbeenonlyverylimitedcollaboration Thetworeasonsforthisare a dictionary publishers concernsregardingintellectualproperty and b thedi erentlanguagesthat lexicographersandNLPresearchersspeak InthispaperIpresentamodelforovercoming the rstandsuggestssomestrategiesforthesecond Introduction NLPneedsdictionaries anddictionary makerscanuseNLPtomakebetterdictionaries so thereisgreatpotentialforsynergybetweenthetwoactivities Thereisamplemotivation forNLPtocourtdictionarypublishers andviceversa Todate NLPresearchhasuseddictionariesanddictionarieshaveusedNLP butthetwo processeshavenotbeenbroughttogether TheNLPthathasgoneintomakingdictionaries hasnotbeentheNLPthatwasbasedonanearlierversionofthesamedictionary orindeedof anypublisheddictionaries TheNLPhaseitherbeendictionary independentifbroughtin fromoutside ordevelopedinhouse WhileNLPgroupshavemadeinnumerablecorrections improvements additionsandextensionstothedictionarydatabasestheyhavelicensedfrom publishers thesechangeshaveneverbeenusedbythepublishertoimprovethenextprinting orthenexteditionofthedictionary Therehavebeentworeasonsforthisfailureofsynergy Firstly dictionarypublishersare concernedaboutanumberofquestionsrelatingtointellectualproperty Secondly lexicog raphersandNLPresearchersspeakverydi erentlanguages soitisnotstraightforwardto developtheatmosphereoftrustinwhichthelexicographersfullyunderstandwhattheNLP researchhasdoneorbelievethattheNLP enhancements totheirdictionaryareusefulto them NLP NaturalLanguageProcessing isheretakentodescribeallthosetechnologiesthatmanipulate humanlanguageinputsbycomputer examplesbeingautomaticpart of speechtagging parsing concordancing machinetranslation informationextractionandtextgeneration Languageengineering and computational linguistics arenear synonyms SophisticatedNLPsoftwarethathasbeenusedforEnglishincludestaggers egCOBUILDusedENGCG fromHelsinki Longman OUPandChambers HarrapusedtheBNC whichwasPOS taggedbyCLAWS from Lancaster andparsers theexperimentalHECTORprojectusedtheFidditchparser Hindle These programsdonotusepublishers dictionaries Nordothestatistics basedcollocation ndersthathavebeen quitewidelyused Anexception amongEnglishdictionaries is CIDE wherethedevelopmentof taggingsoftwarewascloselyintegratedwiththedictionaryproductionprocess IntheremainderofthepaperI rstdescribewhatthetwosideshavetogainfromeach other thenthehistoryoftheissue andthenpresentamutuallybene cialbusinessmodel I thencommentonthelanguageproblem Thepapermayseemtoreplicatethe SpecialIssueofthisjournaleditedbyBran Boguraev Boguraev Indeed theoverallmotivationisthesame topromotemutual understandingandcollaborationbetweenthetwo elds ButwhereastheSpecialIssueset outtodescribewhatbene tstherecouldbe inthispaper thisisallbutassumed withthe followingtwosectionssummarisingthemainthemes Rather welookattheinstitutional andbusinessreasonswhymorecollaborationhasnothappenedandconsiderhowtheycan beovercome DictionariesforNLP NLPhascausetocourtthepublishersbecauseitneedslexicalinformationforalmostevery thingitdoes Lexicalinformationtendstobeexpensiveanddi culttoproduce solicensing itfromthosewhohavealreadyinvestedinit thedictionarypublishers makesgood sense Thisappliestothefullrangeoflexicalinformation orthography phonology mor phology syntax semantics pragmatics translations domain thesauri collocation Allof thesehave overthelasttwenty veyears beenextractedfromelectronicversionsofdictio naries onatypesetter stape orCD ROM orotherelectronicmedium andusedinNLP forresearchinallcases forproductsinmost NLPhasfounddictionariesvaluableforall thesepurposesinspiteoftheoverheadsof ndingtherelevantinformationinwhatisusu allyapoorly structuredinput andtheinevitableerrors inconsistenciesandomissions They wouldbemuchmoreenthusiasticifitwerenotforthesefailings NLPforbetterdictionaries DictionarypublishershavecausetocourtNLPforanumberofreasons Themostobviousis thatthereismoneytobemadefromlicensingarrangements Largesumsfrequentlychange hands ThishasbeenthedominantmotivationfordictionarypublishersapproachingNLP groups Adistinctionmustbemadeatthispointbetweendictionarypublishersandlexicog raphers Thismotivationhasbeenvividinthemindsofthepublishers butoflittleinterest tothelexicographers ThemoreintriguingreasonsrelatetothepotentialNLPhastoimprovedictionaryquality Ofthese twovarietiescanbedistinguished thosethatactonadictionarydatabase and thosethatactontextcorpora Bene tsofNLPappliedtodictionarydatabases Thebene tsarecloselyrelatedtothebene tstothedictionaryproductionprocessof simply havingthedictionaryinadatabaseatall Manyerrorscanbedetectedautomatically and consistencyofallmanneroffeatureschecked Super cially thesemayseemtobesimple databasefunctions However manyfeaturescanonlybecheckedforconsistencyifthedata modelunderlyingthedatabaseisbasedonanunderstandingoflexicography Theskills requiredtomodelthedatalieattheintersectionoflinguistics lexicographyandcomputation viz inNLP Moreover therearealwaysfurtherchecksandanalyseswhichgobeyondanythingthe databasewillprovidedirectly Forexample onemightwanttochecktowhatextentthewords inthesamethesauralcategorygetcorrespondingde nitions AnsweringthiswillrequireNLP expertise aswellasbothawell structureddatabase sotherelevantdataisaccessible and lexicographicexpertise toworkoutwherethenon correspondencesarejusti ed Acommonsituationisthatagooddatabasemodelisavailablebutthedictionarydatabase doesnot tit Whatisthenrequiredisaprocesssometimescalled up translation of carryingthedataoverfromtheexistingformatintothenewone Standardly muchofthis willbedoablebysimplytranslatingmark up butinmanycases themappingwillbeone to many orwillbecontext dependent ordistinctionswillberequiredforthenewdatamodel whichareimplicitinthetextorfont changesoftheexistingdictionary Inmostsuchcases a veryhighpercentageoftheup translationcanbedoneautomatically butitisanNLPtask todoit NLPappliedtotextcorpora Thearrivalofelectronictextcorporaiscausingarevolutioninlexicography Previously the primarysourceofevidenceforaword sbehaviourwasthelexicographer sintuition Now whereverpublishershavebeenabletoassemblealargetextcorpus andhaveconcordancing technology itisthecorpus Thereiswidespreadagreementthatthisisahugeboonfor lexicography Yetitclearlyintroducesmanynewchallenges Organisingthelexicographer sworking environmentsothats hehasinstantaccesstoaconcordanceforthewords heisworkingonis noteasy particularlywherelexicographersworkathome Eventhesimplestconcordancing programrequiressomeminimalNLPinput Minimally itneedstoknowwhichcharactersare punctuationcharacters soshouldbeignoredwhende ningwhatawordis ifaspaceistreated astheonlyword delimiter thesearchterm butter willnotmatchthecorpusobject butter InEnglish thecharacters and presentproblemsatthislevel Morphological analysiswillmakethecorpusmoreuseful asthenallformsofawordcanbefoundwitha simplesearch ForEnglish withitssimplemorphology itiseasyenoughtosearchforall fourformsseparately orwitharegularexpression ForFinnish itisnot Thecorpuswill bemoreusefulagainifitispart of speechtagged word sensedisambiguatedorparsed see eg TapanainenandJ arvinen Eachlevelofannotationallowsthelexicographermore controlinsearchingthecorpusforthelinguisticallyinterestingphenomena sothat when lookingthroughtheconcordance noiseisreducedandduplicatingpatternsdonotneedto bescannedexhaustively Beyondtheconcordance therearecorpusstatisticsandmachinelearning Statistically organisedcollocationlistshaveprovedtheirworthforallthedictionaryprojectsthathave hadaccesstoverylargecorpora ChurchandHanks thepaperthatopenedthe currentdebateinNLPaboutcollocationstatistics isanoteworthyexampleoflexicogra phy NLPcollaboration Acquiringlexicalinformationfromcorporaiscurrentlyoneofthe mostdynamicareasinNLP Mypurposeinsayingthisisnottoputfearofredundancyinthe heartsoflexicographersbuttoindicatehowmuchmoresatisfactorytheirworkistobecome whenthetoolsattheirdisposalaresomuchmorepowerful Thetechniquestendto nd manyplausiblehypothesesinacorpus butareunabletosortthewheatfromthecha or evidently toassignmeaningstothepatternsthey nd Thelexicographer staskremainsthe sameatheartbutwithlessdrudge History Therecenthistoryofdictionary NLPinteractionbeginswith Amsler and Michiels Eachtookatypesetterstapeforadictionary Amsler sagendawastoseewhetherthe dictionarycouldbeusedasasourceofgeneralknowledgeoftheeverydayworld knowledge that forexamplesalsationsaredogs forusewitharti cialintelligenceprograms Michiels exploredhowthemorespeci callylinguisticinformationmightbeusedforNLP Bothofthese agendaswerepursuedatlengthinthecourseofthe s notablyintheEUACQUILEX project attheComputerResearchLaboratoryatNewMexicoStateUniversity andatIBM inYorktownHeights BoguraevandBriscoe representstheactivityatitshighwater mark with Byrdetal demonstratingtheascent Wilks Slator andGuthrie reviewingthewhole and IdeandVeronis o eringapostmortem theirsubtitleis Havewewastedourtime Asaresearchtopic theuseofdictionarydatabaseshasnowrunitscourse Thishas anumberofinterpretations ThemostpositiveisthattheNLPworldhasnowdeveloped afairunderstandingofwhatinformationdictionarydatabasescontainandwhatisinvolved inextractingitforNLPuse sothatthetopichasshiftedfrom research to development Thiswouldo eraperspectiveonwhyalmostalltheresearchwasonEnglishdictionaries theywereusedfortestingoutmethods andthemethodscouldthenbeusedfordictionaries forotherlanguages Thenextinterpretationissimplythattheresearchfashionhaschanged particularlyto languagecorpora andmethodsforextractinglexicalinformationfromthem Thenthereistheinterpretationthatmotivatesthispaper Intheacademicsector work donetoenhanceadictionaryisusuallyworkwastedifothersarenotpermittedtousethe enhancedresource Publishers anxiousaboutintellectualproperty havefrequentlynotper mittedit AnumberofNLPworkershaveexplicitlychosennottodoanyfurtherworkon dictionaries exceptthoseinthepublicdomain forthisreason theysuspectthatanysuch work howevergood willbedestinedforoblivion Intheevent theacademicworld sownproductarrived Asof WordNethasbeen availablefree overthewebandwithoutconstraint WordNethasbeenthelexicalresourceof the s andvariouspeoplehavearguedthatWordNetsensesarethedefactostandardfor NLP WordNetwasproducedbylinguistsandpsychologists accordingtoapsycholinguistic agenda anditssuitabilityforNLPresearchremainsalivelytopicofdebate But asagainst thepracticalitiesofgettingholdofit thesepuristconcernshavecarriedlittleweightand everyoneusesWordNetnow WordNetaddressedsemantics forsyntax theNLPcommunityinvestedinCOMLEX and morerecently NOMLEX Grishman MacLeod andMeyers MacLeodetal COMLEXandNOMLEXaredesignedfromtheoutsettomeettheneedsoftheNLP communityandthereiscurrentlyworkinprogressonlinkingtheCOMLEXsyntaxpatterns totheWordNetwordsenses The EuralexconferenceinLi egemayhaveinauguratedanewphaseinthedebate withtheimpetusthistimecomingfromlexicography WhereasearlierEURALEXeshave beenresistanttocomputationorvieweditwarily atLi ege thisauthor sperceptionwas TherearealternativestoCOMLEX OtherlexiconssuchasXTAG Group andANLT Carrolland Grover havealsobeendevelopedbytheNLPcommunity TheANLTlexiconwasdevelopedinthe mid eighties itincludedmaterialfromadictionaryandcopyrightissuespreventeditbeingusedwidelyfor NLPuntilthemid nineties thatitwasuniversallyacceptedthatNLPhadaroletoplayindictionaryproduction The questionwasnolongerwhethertouseit buthowtouseitwell Exceptinthe dictionary use sessions itwashardto ndpaperswhichdidnotassumetheavailabilityofcorporaand concordancingsoftwareormore Onepaperofparticularsaliencetotheargumentwasjointly presentedbyUlrichHeid anNLPacademicandVincentDocherty adictionarypublisher DochertyandHeid here atlast thestateoftheartinNLPwasbeingusedto provideinputstolexicographersforcompilingabetterdictionaryforpeople Publishers anxieties Theovertreasonwhypublishers dictionarieshavenotbeenmorewidelyusediscopyright Thepublishers argumentissimpleanddirect thepublisher stradeisinintellectualproperty soitisnotreasonabletoexpectthemtogivethedictionaryawayorriskitfallingintothe publicdomain Thereareseveralthreadstothecase The rstis simply piracy Ifadictionarystarts beingcopiedfreely or worsestill startsbeingcopiedforafeebutwherethefeedoesnot gotothelegalcopyrightowner thepublisheristheloser Thepublisherwantstoavoid thishappeningaboveallelse AlicenceagreementwithanNLPresearchgroupprovidesan avenuebywhichadictionarymay nditswaytobeingillegallycopiedandre copied Theotherthreadsrelatetothepossiblere useorre salebythepublisherofversionsof theirdictionarywhichhavebeenupgradedinsomewaybyanNLPresearchgroup Theissueshereconcerncontamination Whereadictionarypublisheristhesoleowner ofthecopyrightforadictionary itwouldliketokeepitthatway Therewillbenoother partiestoconsiderinfuturenegotiations andallthepro tswillcometothepublisher Ifa dictionarydatabasehasbeenenhancedbyanNLPgroup then primafacie ashareofthe intellectualpropertybelongstotheNLPgroup Itisnotstraightforwardtoarriveatamodel forhowsharesoftheintellectualpropertyshouldbeallocated Thestartingpointwould usuallybethequantityoflabourthateachpartyhadputin orthefractionofthetextthat eachproduced Thelattercannotbeappliedatall whatevertheNLPenhancementsmay be itcertainlywillnotmakesensetoquantifythemasafractionoftotaltext length The formerisalsooflittleuse theNLPlaboratorywillprobablyhavebeenusingvariouspieces ofsoftwareandexpertisetoaidwiththeenhancement whichwillhavebeenundertakenfor internaluse inotherNLPapplicationsorresearchprojects sothere saleorre licensingof theupgradeddictionarywouldbeaside e ect Thepublishermaywellhavedoubtsabouttheaccuracyandconsistencyoftheenhanced dictionary butmaynotbewell placedtoevaluate asthisinitselfmayrequireNLPtechniques andanunderstandingoftheissueswhicharelikelytobecriticalforNLPapplications Theenhanceddictionarymayhavepotentialusesforprintproducts forelectronicprod uctsfortheconsumermarket orforlicensingonforNLPuse Aquestioninrelationto the rsttwoisaddressedinsection Regardingthethird thepotentialcomplexityofthe arrangementsisforbidding WhereadictionaryislicensedforNLPuse thelicencemay be broadly forresearchuseorforproductdevelopment Researchuseisstraightforward providedthedemarcationintheNLPgroupbetweenresearchanddevelopmentisclearly drawn Whereitisforproductdevelopment thepublishermayreceivealicencefeeora royaltyforeachproduct containing thedictionary orsomeofeach Containing ininverted Ourhosts longstandingengagementwithNLPmaywellhaveplayedarole commas becausethedictionarywillnotgenerallyexistintheproductinanyrecognisable form butasoneofanumberofinputstothesystem slexicon whichmaywellbecompiled andunreadable Thatproductmayitselfbeaconsumerproduct ormaybeacomponentof somelargerapplication sotheremaybemanysteps eachpotentiallyinvolvinglicencesand royalties betweenthedictionarypublisherandtheend product Onehassomesympathy withthepublishernotwishingtohaveitsnegotiatinghandconstrainedbyundertakingsto NLPgroups Responses Piracyisnotanstrongargument Therearemanywaysinwhichadictionarymaybe piratedormayenterthepublicdomain Itmaybere keyed itmaybeOCR dfromthe printedversion thecontentsofCD ROMsmaybeunscrambled hackersmayhackintothe publisher ssystem tapesmaybestolen Itwillofcoursebethedutyoflicence holdersto ensurethatpeoplewhoshouldnothaveaccesstotheirversionofthedictionary donot and thatitdoesnotgetcopiedoutsidethecon nesofthelaboratory andiftheyfailinthisduty theywillbeculpable Buttheyarenomorelikelytofailinthedutythanassortedotherof thepublisher semployeesandagentswhohaveaccesstothedata Regardingcopyrightonanenhanceddictionary rstly manyresearchgroupsinUniversi tieswillnotwanttoclaimashareofcopyright Theoriginallicenceallowingthemtousethe dictionarycouldthencontainaclausestipulatingthatlexiconsforwhichthedictionaryhas beenaninputcanonlybepassedontothirdpartiesbythepublisher andallthecopyright insuchlexiconsshallbevestedinthepublisher WhereNLPgroupsarenothappytohandovercopyrightinthisway therewillbepo tentiallycomplexnegotiationsrequired astherewillwhenalexiconisacomponentofsome otherapplicationatoneormorefurtherremovesfromtheendproduct Thesesortsofcon siderationsare however becomingcommonplaceinthesoftwareandmultimediaindustries sothepublisher sconcernsreducetothefollowing arethegainstobehadfromsynergywith NLPsu cienttomeritthee ort Isthepublisherwillingtotakeonthechallengeofthe complexnegotiations TheModel Theprimaryconcernofthepublisheristoretaincontroloftheresourceandtomaximise the owofincomeitgivesriseto TheprimaryconcernofmanyNLPresearchersisthat theirworkisavailableforotherstouseandextend Reputationandstatusisdependenton beingcited intheacademicworld Ifpeopleuseyourresourcetheywillciteyou Licence incomeisalesserconsideration rstly becauseititisnotexpected secondly becauseit wouldprobablygototheinstitutionratherthantheindividual andthirdly simplybecause itisnotthecurrencyofacademicstatus TheappropriatemodelisforthepublishertoencourageNLPresearcherstousethedictio nary onthebasisthatanyenhancementstothedictionarywillbereturnedtothepublisher forittouseitselfifitsowishes fordictionaryrevisionandotherimprovedconsumerprod ucts andtomarketandgenerallymakeavailabletootherNLPgroups Theagreementwould putthepublisherunderanobligationtomaketheenhanceddictionaryavailable undersim ilartermsagain andforafeethatwouldnotbeprohibitive andthiswouldbethebene t totheNLPgrouptocounterbalancethefactthattheywould inthesimplestcase relinquish claimstoashareinthecopyright Limitationsarerequiredonthisobligationofthepublishertopublish Firstly theen hanceddictionarywouldhavetobeofadequatequality andtothisend thepublishermust acquiretheexpertisetoassessthequality ThiswilldemandsomeinvestmentinNLP expertiseonthepartofthepublisher butthen thepublishershouldnotexpecttoreapthe fruitsofthenewmarketwithoutanyinvestment Secondly thepublisherwillwanttostrikedi erentdealsforthedictionarywithdi erent customers onewouldnotexpecttermsforMicrosofttobeasforaUniversitygrouporstartup company Thetermsoftheagree
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 政治高考体育试卷及答案
- 2025年湖南财盛国际贸易有限公司公开模拟试卷附答案详解(典型题)
- 2025和田地区教师招聘(2000人)模拟试卷及答案详解(必刷)
- 2025年福建省宁德市公安局招聘94人考前自测高频考点模拟试题(含答案详解)
- 2025呼伦贝尔扎兰屯市社会福利中心护理员招聘模拟试卷及答案详解(夺冠)
- 2025湖南邵阳市湘中幼儿师范高等专科学校公开招聘工作人员24人模拟试卷(含答案详解)
- 2025年4月广东广州市天河区华港幼儿园编外聘用制专任教师招聘1人模拟试卷附答案详解(黄金题型)
- 2025年中共昆明市委党校引进高层次人才(5人)模拟试卷及一套答案详解
- 2025年泉州市考试录用公务员暨公开遴选公务员集中工作考前自测高频考点模拟试题(含答案详解)
- 2025年西藏自治区烟草专卖局(公司)应届高校毕业生招聘29人模拟试卷完整参考答案详解
- 应急知识技能与能力培训课件
- 2025湖北襄阳老河口市清源供水有限公司招聘5人考试模拟试题及答案解析
- 2025年河南省文化旅游投资集团有限公司权属企业社会招聘52人笔试参考题库附答案解析
- 吉林省松原市四校2025~2026学年度下学期九年级第一次月考试卷 物理(含答案)
- 2025云南昆明元朔建设发展有限公司第一批收费员招聘20人考试参考试题及答案解析
- 2025年北京市海淀区中考二模语文试题
- 智能化设备在板材加工中的应用-洞察及研究
- 第9课《天上有颗“南仁东星”》 课件 2025-2026学年统编版语文八年级上册
- DB44-T 2723-2025 水库汛末运行水位动态控制方案编制导则
- 《山水相逢》课件2025-2026学年人美版(2024)八年级美术上册
- 上海工资发放管理办法
评论
0/150
提交评论