已阅读5页,还剩4页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
BusinessmodelsforDictionariesandNLP AdamKilgarri November Abstract NLPneedsdictionaries anddictionary makerscanuseNLPtomakebetterdictio naries sothereisgreatpotentialforsynergybetweenthetwoactivities Todate there hasbeenonlyverylimitedcollaboration Thetworeasonsforthisare a dictionary publishers concernsregardingintellectualproperty and b thedi erentlanguagesthat lexicographersandNLPresearchersspeak InthispaperIpresentamodelforovercoming the rstandsuggestssomestrategiesforthesecond Introduction NLPneedsdictionaries anddictionary makerscanuseNLPtomakebetterdictionaries so thereisgreatpotentialforsynergybetweenthetwoactivities Thereisamplemotivation forNLPtocourtdictionarypublishers andviceversa Todate NLPresearchhasuseddictionariesanddictionarieshaveusedNLP butthetwo processeshavenotbeenbroughttogether TheNLPthathasgoneintomakingdictionaries hasnotbeentheNLPthatwasbasedonanearlierversionofthesamedictionary orindeedof anypublisheddictionaries TheNLPhaseitherbeendictionary independentifbroughtin fromoutside ordevelopedinhouse WhileNLPgroupshavemadeinnumerablecorrections improvements additionsandextensionstothedictionarydatabasestheyhavelicensedfrom publishers thesechangeshaveneverbeenusedbythepublishertoimprovethenextprinting orthenexteditionofthedictionary Therehavebeentworeasonsforthisfailureofsynergy Firstly dictionarypublishersare concernedaboutanumberofquestionsrelatingtointellectualproperty Secondly lexicog raphersandNLPresearchersspeakverydi erentlanguages soitisnotstraightforwardto developtheatmosphereoftrustinwhichthelexicographersfullyunderstandwhattheNLP researchhasdoneorbelievethattheNLP enhancements totheirdictionaryareusefulto them NLP NaturalLanguageProcessing isheretakentodescribeallthosetechnologiesthatmanipulate humanlanguageinputsbycomputer examplesbeingautomaticpart of speechtagging parsing concordancing machinetranslation informationextractionandtextgeneration Languageengineering and computational linguistics arenear synonyms SophisticatedNLPsoftwarethathasbeenusedforEnglishincludestaggers egCOBUILDusedENGCG fromHelsinki Longman OUPandChambers HarrapusedtheBNC whichwasPOS taggedbyCLAWS from Lancaster andparsers theexperimentalHECTORprojectusedtheFidditchparser Hindle These programsdonotusepublishers dictionaries Nordothestatistics basedcollocation ndersthathavebeen quitewidelyused Anexception amongEnglishdictionaries is CIDE wherethedevelopmentof taggingsoftwarewascloselyintegratedwiththedictionaryproductionprocess IntheremainderofthepaperI rstdescribewhatthetwosideshavetogainfromeach other thenthehistoryoftheissue andthenpresentamutuallybene cialbusinessmodel I thencommentonthelanguageproblem Thepapermayseemtoreplicatethe SpecialIssueofthisjournaleditedbyBran Boguraev Boguraev Indeed theoverallmotivationisthesame topromotemutual understandingandcollaborationbetweenthetwo elds ButwhereastheSpecialIssueset outtodescribewhatbene tstherecouldbe inthispaper thisisallbutassumed withthe followingtwosectionssummarisingthemainthemes Rather welookattheinstitutional andbusinessreasonswhymorecollaborationhasnothappenedandconsiderhowtheycan beovercome DictionariesforNLP NLPhascausetocourtthepublishersbecauseitneedslexicalinformationforalmostevery thingitdoes Lexicalinformationtendstobeexpensiveanddi culttoproduce solicensing itfromthosewhohavealreadyinvestedinit thedictionarypublishers makesgood sense Thisappliestothefullrangeoflexicalinformation orthography phonology mor phology syntax semantics pragmatics translations domain thesauri collocation Allof thesehave overthelasttwenty veyears beenextractedfromelectronicversionsofdictio naries onatypesetter stape orCD ROM orotherelectronicmedium andusedinNLP forresearchinallcases forproductsinmost NLPhasfounddictionariesvaluableforall thesepurposesinspiteoftheoverheadsof ndingtherelevantinformationinwhatisusu allyapoorly structuredinput andtheinevitableerrors inconsistenciesandomissions They wouldbemuchmoreenthusiasticifitwerenotforthesefailings NLPforbetterdictionaries DictionarypublishershavecausetocourtNLPforanumberofreasons Themostobviousis thatthereismoneytobemadefromlicensingarrangements Largesumsfrequentlychange hands ThishasbeenthedominantmotivationfordictionarypublishersapproachingNLP groups Adistinctionmustbemadeatthispointbetweendictionarypublishersandlexicog raphers Thismotivationhasbeenvividinthemindsofthepublishers butoflittleinterest tothelexicographers ThemoreintriguingreasonsrelatetothepotentialNLPhastoimprovedictionaryquality Ofthese twovarietiescanbedistinguished thosethatactonadictionarydatabase and thosethatactontextcorpora Bene tsofNLPappliedtodictionarydatabases Thebene tsarecloselyrelatedtothebene tstothedictionaryproductionprocessof simply havingthedictionaryinadatabaseatall Manyerrorscanbedetectedautomatically and consistencyofallmanneroffeatureschecked Super cially thesemayseemtobesimple databasefunctions However manyfeaturescanonlybecheckedforconsistencyifthedata modelunderlyingthedatabaseisbasedonanunderstandingoflexicography Theskills requiredtomodelthedatalieattheintersectionoflinguistics lexicographyandcomputation viz inNLP Moreover therearealwaysfurtherchecksandanalyseswhichgobeyondanythingthe databasewillprovidedirectly Forexample onemightwanttochecktowhatextentthewords inthesamethesauralcategorygetcorrespondingde nitions AnsweringthiswillrequireNLP expertise aswellasbothawell structureddatabase sotherelevantdataisaccessible and lexicographicexpertise toworkoutwherethenon correspondencesarejusti ed Acommonsituationisthatagooddatabasemodelisavailablebutthedictionarydatabase doesnot tit Whatisthenrequiredisaprocesssometimescalled up translation of carryingthedataoverfromtheexistingformatintothenewone Standardly muchofthis willbedoablebysimplytranslatingmark up butinmanycases themappingwillbeone to many orwillbecontext dependent ordistinctionswillberequiredforthenewdatamodel whichareimplicitinthetextorfont changesoftheexistingdictionary Inmostsuchcases a veryhighpercentageoftheup translationcanbedoneautomatically butitisanNLPtask todoit NLPappliedtotextcorpora Thearrivalofelectronictextcorporaiscausingarevolutioninlexicography Previously the primarysourceofevidenceforaword sbehaviourwasthelexicographer sintuition Now whereverpublishershavebeenabletoassemblealargetextcorpus andhaveconcordancing technology itisthecorpus Thereiswidespreadagreementthatthisisahugeboonfor lexicography Yetitclearlyintroducesmanynewchallenges Organisingthelexicographer sworking environmentsothats hehasinstantaccesstoaconcordanceforthewords heisworkingonis noteasy particularlywherelexicographersworkathome Eventhesimplestconcordancing programrequiressomeminimalNLPinput Minimally itneedstoknowwhichcharactersare punctuationcharacters soshouldbeignoredwhende ningwhatawordis ifaspaceistreated astheonlyword delimiter thesearchterm butter willnotmatchthecorpusobject butter InEnglish thecharacters and presentproblemsatthislevel Morphological analysiswillmakethecorpusmoreuseful asthenallformsofawordcanbefoundwitha simplesearch ForEnglish withitssimplemorphology itiseasyenoughtosearchforall fourformsseparately orwitharegularexpression ForFinnish itisnot Thecorpuswill bemoreusefulagainifitispart of speechtagged word sensedisambiguatedorparsed see eg TapanainenandJ arvinen Eachlevelofannotationallowsthelexicographermore controlinsearchingthecorpusforthelinguisticallyinterestingphenomena sothat when lookingthroughtheconcordance noiseisreducedandduplicatingpatternsdonotneedto bescannedexhaustively Beyondtheconcordance therearecorpusstatisticsandmachinelearning Statistically organisedcollocationlistshaveprovedtheirworthforallthedictionaryprojectsthathave hadaccesstoverylargecorpora ChurchandHanks thepaperthatopenedthe currentdebateinNLPaboutcollocationstatistics isanoteworthyexampleoflexicogra phy NLPcollaboration Acquiringlexicalinformationfromcorporaiscurrentlyoneofthe mostdynamicareasinNLP Mypurposeinsayingthisisnottoputfearofredundancyinthe heartsoflexicographersbuttoindicatehowmuchmoresatisfactorytheirworkistobecome whenthetoolsattheirdisposalaresomuchmorepowerful Thetechniquestendto nd manyplausiblehypothesesinacorpus butareunabletosortthewheatfromthecha or evidently toassignmeaningstothepatternsthey nd Thelexicographer staskremainsthe sameatheartbutwithlessdrudge History Therecenthistoryofdictionary NLPinteractionbeginswith Amsler and Michiels Eachtookatypesetterstapeforadictionary Amsler sagendawastoseewhetherthe dictionarycouldbeusedasasourceofgeneralknowledgeoftheeverydayworld knowledge that forexamplesalsationsaredogs forusewitharti cialintelligenceprograms Michiels exploredhowthemorespeci callylinguisticinformationmightbeusedforNLP Bothofthese agendaswerepursuedatlengthinthecourseofthe s notablyintheEUACQUILEX project attheComputerResearchLaboratoryatNewMexicoStateUniversity andatIBM inYorktownHeights BoguraevandBriscoe representstheactivityatitshighwater mark with Byrdetal demonstratingtheascent Wilks Slator andGuthrie reviewingthewhole and IdeandVeronis o eringapostmortem theirsubtitleis Havewewastedourtime Asaresearchtopic theuseofdictionarydatabaseshasnowrunitscourse Thishas anumberofinterpretations ThemostpositiveisthattheNLPworldhasnowdeveloped afairunderstandingofwhatinformationdictionarydatabasescontainandwhatisinvolved inextractingitforNLPuse sothatthetopichasshiftedfrom research to development Thiswouldo eraperspectiveonwhyalmostalltheresearchwasonEnglishdictionaries theywereusedfortestingoutmethods andthemethodscouldthenbeusedfordictionaries forotherlanguages Thenextinterpretationissimplythattheresearchfashionhaschanged particularlyto languagecorpora andmethodsforextractinglexicalinformationfromthem Thenthereistheinterpretationthatmotivatesthispaper Intheacademicsector work donetoenhanceadictionaryisusuallyworkwastedifothersarenotpermittedtousethe enhancedresource Publishers anxiousaboutintellectualproperty havefrequentlynotper mittedit AnumberofNLPworkershaveexplicitlychosennottodoanyfurtherworkon dictionaries exceptthoseinthepublicdomain forthisreason theysuspectthatanysuch work howevergood willbedestinedforoblivion Intheevent theacademicworld sownproductarrived Asof WordNethasbeen availablefree overthewebandwithoutconstraint WordNethasbeenthelexicalresourceof the s andvariouspeoplehavearguedthatWordNetsensesarethedefactostandardfor NLP WordNetwasproducedbylinguistsandpsychologists accordingtoapsycholinguistic agenda anditssuitabilityforNLPresearchremainsalivelytopicofdebate But asagainst thepracticalitiesofgettingholdofit thesepuristconcernshavecarriedlittleweightand everyoneusesWordNetnow WordNetaddressedsemantics forsyntax theNLPcommunityinvestedinCOMLEX and morerecently NOMLEX Grishman MacLeod andMeyers MacLeodetal COMLEXandNOMLEXaredesignedfromtheoutsettomeettheneedsoftheNLP communityandthereiscurrentlyworkinprogressonlinkingtheCOMLEXsyntaxpatterns totheWordNetwordsenses The EuralexconferenceinLi egemayhaveinauguratedanewphaseinthedebate withtheimpetusthistimecomingfromlexicography WhereasearlierEURALEXeshave beenresistanttocomputationorvieweditwarily atLi ege thisauthor sperceptionwas TherearealternativestoCOMLEX OtherlexiconssuchasXTAG Group andANLT Carrolland Grover havealsobeendevelopedbytheNLPcommunity TheANLTlexiconwasdevelopedinthe mid eighties itincludedmaterialfromadictionaryandcopyrightissuespreventeditbeingusedwidelyfor NLPuntilthemid nineties thatitwasuniversallyacceptedthatNLPhadaroletoplayindictionaryproduction The questionwasnolongerwhethertouseit buthowtouseitwell Exceptinthe dictionary use sessions itwashardto ndpaperswhichdidnotassumetheavailabilityofcorporaand concordancingsoftwareormore Onepaperofparticularsaliencetotheargumentwasjointly presentedbyUlrichHeid anNLPacademicandVincentDocherty adictionarypublisher DochertyandHeid here atlast thestateoftheartinNLPwasbeingusedto provideinputstolexicographersforcompilingabetterdictionaryforpeople Publishers anxieties Theovertreasonwhypublishers dictionarieshavenotbeenmorewidelyusediscopyright Thepublishers argumentissimpleanddirect thepublisher stradeisinintellectualproperty soitisnotreasonabletoexpectthemtogivethedictionaryawayorriskitfallingintothe publicdomain Thereareseveralthreadstothecase The rstis simply piracy Ifadictionarystarts beingcopiedfreely or worsestill startsbeingcopiedforafeebutwherethefeedoesnot gotothelegalcopyrightowner thepublisheristheloser Thepublisherwantstoavoid thishappeningaboveallelse AlicenceagreementwithanNLPresearchgroupprovidesan avenuebywhichadictionarymay nditswaytobeingillegallycopiedandre copied Theotherthreadsrelatetothepossiblere useorre salebythepublisherofversionsof theirdictionarywhichhavebeenupgradedinsomewaybyanNLPresearchgroup Theissueshereconcerncontamination Whereadictionarypublisheristhesoleowner ofthecopyrightforadictionary itwouldliketokeepitthatway Therewillbenoother partiestoconsiderinfuturenegotiations andallthepro tswillcometothepublisher Ifa dictionarydatabasehasbeenenhancedbyanNLPgroup then primafacie ashareofthe intellectualpropertybelongstotheNLPgroup Itisnotstraightforwardtoarriveatamodel forhowsharesoftheintellectualpropertyshouldbeallocated Thestartingpointwould usuallybethequantityoflabourthateachpartyhadputin orthefractionofthetextthat eachproduced Thelattercannotbeappliedatall whatevertheNLPenhancementsmay be itcertainlywillnotmakesensetoquantifythemasafractionoftotaltext length The formerisalsooflittleuse theNLPlaboratorywillprobablyhavebeenusingvariouspieces ofsoftwareandexpertisetoaidwiththeenhancement whichwillhavebeenundertakenfor internaluse inotherNLPapplicationsorresearchprojects sothere saleorre licensingof theupgradeddictionarywouldbeaside e ect Thepublishermaywellhavedoubtsabouttheaccuracyandconsistencyoftheenhanced dictionary butmaynotbewell placedtoevaluate asthisinitselfmayrequireNLPtechniques andanunderstandingoftheissueswhicharelikelytobecriticalforNLPapplications Theenhanceddictionarymayhavepotentialusesforprintproducts forelectronicprod uctsfortheconsumermarket orforlicensingonforNLPuse Aquestioninrelationto the rsttwoisaddressedinsection Regardingthethird thepotentialcomplexityofthe arrangementsisforbidding WhereadictionaryislicensedforNLPuse thelicencemay be broadly forresearchuseorforproductdevelopment Researchuseisstraightforward providedthedemarcationintheNLPgroupbetweenresearchanddevelopmentisclearly drawn Whereitisforproductdevelopment thepublishermayreceivealicencefeeora royaltyforeachproduct containing thedictionary orsomeofeach Containing ininverted Ourhosts longstandingengagementwithNLPmaywellhaveplayedarole commas becausethedictionarywillnotgenerallyexistintheproductinanyrecognisable form butasoneofanumberofinputstothesystem slexicon whichmaywellbecompiled andunreadable Thatproductmayitselfbeaconsumerproduct ormaybeacomponentof somelargerapplication sotheremaybemanysteps eachpotentiallyinvolvinglicencesand royalties betweenthedictionarypublisherandtheend product Onehassomesympathy withthepublishernotwishingtohaveitsnegotiatinghandconstrainedbyundertakingsto NLPgroups Responses Piracyisnotanstrongargument Therearemanywaysinwhichadictionarymaybe piratedormayenterthepublicdomain Itmaybere keyed itmaybeOCR dfromthe printedversion thecontentsofCD ROMsmaybeunscrambled hackersmayhackintothe publisher ssystem tapesmaybestolen Itwillofcoursebethedutyoflicence holdersto ensurethatpeoplewhoshouldnothaveaccesstotheirversionofthedictionary donot and thatitdoesnotgetcopiedoutsidethecon nesofthelaboratory andiftheyfailinthisduty theywillbeculpable Buttheyarenomorelikelytofailinthedutythanassortedotherof thepublisher semployeesandagentswhohaveaccesstothedata Regardingcopyrightonanenhanceddictionary rstly manyresearchgroupsinUniversi tieswillnotwanttoclaimashareofcopyright Theoriginallicenceallowingthemtousethe dictionarycouldthencontainaclausestipulatingthatlexiconsforwhichthedictionaryhas beenaninputcanonlybepassedontothirdpartiesbythepublisher andallthecopyright insuchlexiconsshallbevestedinthepublisher WhereNLPgroupsarenothappytohandovercopyrightinthisway therewillbepo tentiallycomplexnegotiationsrequired astherewillwhenalexiconisacomponentofsome otherapplicationatoneormorefurtherremovesfromtheendproduct Thesesortsofcon siderationsare however becomingcommonplaceinthesoftwareandmultimediaindustries sothepublisher sconcernsreducetothefollowing arethegainstobehadfromsynergywith NLPsu cienttomeritthee ort Isthepublisherwillingtotakeonthechallengeofthe complexnegotiations TheModel Theprimaryconcernofthepublisheristoretaincontroloftheresourceandtomaximise the owofincomeitgivesriseto TheprimaryconcernofmanyNLPresearchersisthat theirworkisavailableforotherstouseandextend Reputationandstatusisdependenton beingcited intheacademicworld Ifpeopleuseyourresourcetheywillciteyou Licence incomeisalesserconsideration rstly becauseititisnotexpected secondly becauseit wouldprobablygototheinstitutionratherthantheindividual andthirdly simplybecause itisnotthecurrencyofacademicstatus TheappropriatemodelisforthepublishertoencourageNLPresearcherstousethedictio nary onthebasisthatanyenhancementstothedictionarywillbereturnedtothepublisher forittouseitselfifitsowishes fordictionaryrevisionandotherimprovedconsumerprod ucts andtomarketandgenerallymakeavailabletootherNLPgroups Theagreementwould putthepublisherunderanobligationtomaketheenhanceddictionaryavailable undersim ilartermsagain andforafeethatwouldnotbeprohibitive andthiswouldbethebene t totheNLPgrouptocounterbalancethefactthattheywould inthesimplestcase relinquish claimstoashareinthecopyright Limitationsarerequiredonthisobligationofthepublishertopublish Firstly theen hanceddictionarywouldhavetobeofadequatequality andtothisend thepublishermust acquiretheexpertisetoassessthequality ThiswilldemandsomeinvestmentinNLP expertiseonthepartofthepublisher butthen thepublishershouldnotexpecttoreapthe fruitsofthenewmarketwithoutanyinvestment Secondly thepublisherwillwanttostrikedi erentdealsforthedictionarywithdi erent customers onewouldnotexpecttermsforMicrosofttobeasforaUniversitygrouporstartup company Thetermsoftheagree
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025航天级电子元器件制造领域供需现状与产品质量保证研究分析报告
- 部编版五年级语文多音字教学汇编
- 员工绩效改进与激励措施
- 二年级经典诵读教学设计方案
- 2026年安阳幼儿师范高等专科学校单招职业技能测试题库带答案
- 2026年山东外国语职业技术大学单招职业适应性考试必刷测试卷带答案
- 2025天津市静海区卫生健康事业单位招聘64人笔试历年题库附答案解析
- 2026年广州番禺职业技术学院单招职业技能测试题库带答案
- 2026年宁夏工商职业技术学院单招职业技能测试必刷测试卷及答案1套
- 2026年上海建桥学院单招职业适应性测试题库附答案
- 中医适宜技术-中药热奄包
- GB/T 44545-2024制冷系统试验
- 专用设备制造业生产成本研究
- 创新创业理论与实践智慧树知到期末考试答案章节答案2024年陕西师范大学
- GB/T 44090-2024登山健身步道配置要求
- GB/T 21711.201-2024基础机电继电器第2-1部分:可靠性B10值验证程序
- HGT 3809-2023 工业溴化钠 (正式版)
- 日志分析报告模板
- 军车车辆安全教育课件
- 劳务外包服务方案(技术方案)
- 贵州省黔东南州2023届九年级上学期期末文化水平测试数学试卷(含答案)
评论
0/150
提交评论