Business models for Dictionaries and NLP.pdf_第1页
Business models for Dictionaries and NLP.pdf_第2页
Business models for Dictionaries and NLP.pdf_第3页
Business models for Dictionaries and NLP.pdf_第4页
Business models for Dictionaries and NLP.pdf_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

BusinessmodelsforDictionariesandNLP AdamKilgarri November Abstract NLPneedsdictionaries anddictionary makerscanuseNLPtomakebetterdictio naries sothereisgreatpotentialforsynergybetweenthetwoactivities Todate there hasbeenonlyverylimitedcollaboration Thetworeasonsforthisare a dictionary publishers concernsregardingintellectualproperty and b thedi erentlanguagesthat lexicographersandNLPresearchersspeak InthispaperIpresentamodelforovercoming the rstandsuggestssomestrategiesforthesecond Introduction NLPneedsdictionaries anddictionary makerscanuseNLPtomakebetterdictionaries so thereisgreatpotentialforsynergybetweenthetwoactivities Thereisamplemotivation forNLPtocourtdictionarypublishers andviceversa Todate NLPresearchhasuseddictionariesanddictionarieshaveusedNLP butthetwo processeshavenotbeenbroughttogether TheNLPthathasgoneintomakingdictionaries hasnotbeentheNLPthatwasbasedonanearlierversionofthesamedictionary orindeedof anypublisheddictionaries TheNLPhaseitherbeendictionary independentifbroughtin fromoutside ordevelopedinhouse WhileNLPgroupshavemadeinnumerablecorrections improvements additionsandextensionstothedictionarydatabasestheyhavelicensedfrom publishers thesechangeshaveneverbeenusedbythepublishertoimprovethenextprinting orthenexteditionofthedictionary Therehavebeentworeasonsforthisfailureofsynergy Firstly dictionarypublishersare concernedaboutanumberofquestionsrelatingtointellectualproperty Secondly lexicog raphersandNLPresearchersspeakverydi erentlanguages soitisnotstraightforwardto developtheatmosphereoftrustinwhichthelexicographersfullyunderstandwhattheNLP researchhasdoneorbelievethattheNLP enhancements totheirdictionaryareusefulto them NLP NaturalLanguageProcessing isheretakentodescribeallthosetechnologiesthatmanipulate humanlanguageinputsbycomputer examplesbeingautomaticpart of speechtagging parsing concordancing machinetranslation informationextractionandtextgeneration Languageengineering and computational linguistics arenear synonyms SophisticatedNLPsoftwarethathasbeenusedforEnglishincludestaggers egCOBUILDusedENGCG fromHelsinki Longman OUPandChambers HarrapusedtheBNC whichwasPOS taggedbyCLAWS from Lancaster andparsers theexperimentalHECTORprojectusedtheFidditchparser Hindle These programsdonotusepublishers dictionaries Nordothestatistics basedcollocation ndersthathavebeen quitewidelyused Anexception amongEnglishdictionaries is CIDE wherethedevelopmentof taggingsoftwarewascloselyintegratedwiththedictionaryproductionprocess IntheremainderofthepaperI rstdescribewhatthetwosideshavetogainfromeach other thenthehistoryoftheissue andthenpresentamutuallybene cialbusinessmodel I thencommentonthelanguageproblem Thepapermayseemtoreplicatethe SpecialIssueofthisjournaleditedbyBran Boguraev Boguraev Indeed theoverallmotivationisthesame topromotemutual understandingandcollaborationbetweenthetwo elds ButwhereastheSpecialIssueset outtodescribewhatbene tstherecouldbe inthispaper thisisallbutassumed withthe followingtwosectionssummarisingthemainthemes Rather welookattheinstitutional andbusinessreasonswhymorecollaborationhasnothappenedandconsiderhowtheycan beovercome DictionariesforNLP NLPhascausetocourtthepublishersbecauseitneedslexicalinformationforalmostevery thingitdoes Lexicalinformationtendstobeexpensiveanddi culttoproduce solicensing itfromthosewhohavealreadyinvestedinit thedictionarypublishers makesgood sense Thisappliestothefullrangeoflexicalinformation orthography phonology mor phology syntax semantics pragmatics translations domain thesauri collocation Allof thesehave overthelasttwenty veyears beenextractedfromelectronicversionsofdictio naries onatypesetter stape orCD ROM orotherelectronicmedium andusedinNLP forresearchinallcases forproductsinmost NLPhasfounddictionariesvaluableforall thesepurposesinspiteoftheoverheadsof ndingtherelevantinformationinwhatisusu allyapoorly structuredinput andtheinevitableerrors inconsistenciesandomissions They wouldbemuchmoreenthusiasticifitwerenotforthesefailings NLPforbetterdictionaries DictionarypublishershavecausetocourtNLPforanumberofreasons Themostobviousis thatthereismoneytobemadefromlicensingarrangements Largesumsfrequentlychange hands ThishasbeenthedominantmotivationfordictionarypublishersapproachingNLP groups Adistinctionmustbemadeatthispointbetweendictionarypublishersandlexicog raphers Thismotivationhasbeenvividinthemindsofthepublishers butoflittleinterest tothelexicographers ThemoreintriguingreasonsrelatetothepotentialNLPhastoimprovedictionaryquality Ofthese twovarietiescanbedistinguished thosethatactonadictionarydatabase and thosethatactontextcorpora Bene tsofNLPappliedtodictionarydatabases Thebene tsarecloselyrelatedtothebene tstothedictionaryproductionprocessof simply havingthedictionaryinadatabaseatall Manyerrorscanbedetectedautomatically and consistencyofallmanneroffeatureschecked Super cially thesemayseemtobesimple databasefunctions However manyfeaturescanonlybecheckedforconsistencyifthedata modelunderlyingthedatabaseisbasedonanunderstandingoflexicography Theskills requiredtomodelthedatalieattheintersectionoflinguistics lexicographyandcomputation viz inNLP Moreover therearealwaysfurtherchecksandanalyseswhichgobeyondanythingthe databasewillprovidedirectly Forexample onemightwanttochecktowhatextentthewords inthesamethesauralcategorygetcorrespondingde nitions AnsweringthiswillrequireNLP expertise aswellasbothawell structureddatabase sotherelevantdataisaccessible and lexicographicexpertise toworkoutwherethenon correspondencesarejusti ed Acommonsituationisthatagooddatabasemodelisavailablebutthedictionarydatabase doesnot tit Whatisthenrequiredisaprocesssometimescalled up translation of carryingthedataoverfromtheexistingformatintothenewone Standardly muchofthis willbedoablebysimplytranslatingmark up butinmanycases themappingwillbeone to many orwillbecontext dependent ordistinctionswillberequiredforthenewdatamodel whichareimplicitinthetextorfont changesoftheexistingdictionary Inmostsuchcases a veryhighpercentageoftheup translationcanbedoneautomatically butitisanNLPtask todoit NLPappliedtotextcorpora Thearrivalofelectronictextcorporaiscausingarevolutioninlexicography Previously the primarysourceofevidenceforaword sbehaviourwasthelexicographer sintuition Now whereverpublishershavebeenabletoassemblealargetextcorpus andhaveconcordancing technology itisthecorpus Thereiswidespreadagreementthatthisisahugeboonfor lexicography Yetitclearlyintroducesmanynewchallenges Organisingthelexicographer sworking environmentsothats hehasinstantaccesstoaconcordanceforthewords heisworkingonis noteasy particularlywherelexicographersworkathome Eventhesimplestconcordancing programrequiressomeminimalNLPinput Minimally itneedstoknowwhichcharactersare punctuationcharacters soshouldbeignoredwhende ningwhatawordis ifaspaceistreated astheonlyword delimiter thesearchterm butter willnotmatchthecorpusobject butter InEnglish thecharacters and presentproblemsatthislevel Morphological analysiswillmakethecorpusmoreuseful asthenallformsofawordcanbefoundwitha simplesearch ForEnglish withitssimplemorphology itiseasyenoughtosearchforall fourformsseparately orwitharegularexpression ForFinnish itisnot Thecorpuswill bemoreusefulagainifitispart of speechtagged word sensedisambiguatedorparsed see eg TapanainenandJ arvinen Eachlevelofannotationallowsthelexicographermore controlinsearchingthecorpusforthelinguisticallyinterestingphenomena sothat when lookingthroughtheconcordance noiseisreducedandduplicatingpatternsdonotneedto bescannedexhaustively Beyondtheconcordance therearecorpusstatisticsandmachinelearning Statistically organisedcollocationlistshaveprovedtheirworthforallthedictionaryprojectsthathave hadaccesstoverylargecorpora ChurchandHanks thepaperthatopenedthe currentdebateinNLPaboutcollocationstatistics isanoteworthyexampleoflexicogra phy NLPcollaboration Acquiringlexicalinformationfromcorporaiscurrentlyoneofthe mostdynamicareasinNLP Mypurposeinsayingthisisnottoputfearofredundancyinthe heartsoflexicographersbuttoindicatehowmuchmoresatisfactorytheirworkistobecome whenthetoolsattheirdisposalaresomuchmorepowerful Thetechniquestendto nd manyplausiblehypothesesinacorpus butareunabletosortthewheatfromthecha or evidently toassignmeaningstothepatternsthey nd Thelexicographer staskremainsthe sameatheartbutwithlessdrudge History Therecenthistoryofdictionary NLPinteractionbeginswith Amsler and Michiels Eachtookatypesetterstapeforadictionary Amsler sagendawastoseewhetherthe dictionarycouldbeusedasasourceofgeneralknowledgeoftheeverydayworld knowledge that forexamplesalsationsaredogs forusewitharti cialintelligenceprograms Michiels exploredhowthemorespeci callylinguisticinformationmightbeusedforNLP Bothofthese agendaswerepursuedatlengthinthecourseofthe s notablyintheEUACQUILEX project attheComputerResearchLaboratoryatNewMexicoStateUniversity andatIBM inYorktownHeights BoguraevandBriscoe representstheactivityatitshighwater mark with Byrdetal demonstratingtheascent Wilks Slator andGuthrie reviewingthewhole and IdeandVeronis o eringapostmortem theirsubtitleis Havewewastedourtime Asaresearchtopic theuseofdictionarydatabaseshasnowrunitscourse Thishas anumberofinterpretations ThemostpositiveisthattheNLPworldhasnowdeveloped afairunderstandingofwhatinformationdictionarydatabasescontainandwhatisinvolved inextractingitforNLPuse sothatthetopichasshiftedfrom research to development Thiswouldo eraperspectiveonwhyalmostalltheresearchwasonEnglishdictionaries theywereusedfortestingoutmethods andthemethodscouldthenbeusedfordictionaries forotherlanguages Thenextinterpretationissimplythattheresearchfashionhaschanged particularlyto languagecorpora andmethodsforextractinglexicalinformationfromthem Thenthereistheinterpretationthatmotivatesthispaper Intheacademicsector work donetoenhanceadictionaryisusuallyworkwastedifothersarenotpermittedtousethe enhancedresource Publishers anxiousaboutintellectualproperty havefrequentlynotper mittedit AnumberofNLPworkershaveexplicitlychosennottodoanyfurtherworkon dictionaries exceptthoseinthepublicdomain forthisreason theysuspectthatanysuch work howevergood willbedestinedforoblivion Intheevent theacademicworld sownproductarrived Asof WordNethasbeen availablefree overthewebandwithoutconstraint WordNethasbeenthelexicalresourceof the s andvariouspeoplehavearguedthatWordNetsensesarethedefactostandardfor NLP WordNetwasproducedbylinguistsandpsychologists accordingtoapsycholinguistic agenda anditssuitabilityforNLPresearchremainsalivelytopicofdebate But asagainst thepracticalitiesofgettingholdofit thesepuristconcernshavecarriedlittleweightand everyoneusesWordNetnow WordNetaddressedsemantics forsyntax theNLPcommunityinvestedinCOMLEX and morerecently NOMLEX Grishman MacLeod andMeyers MacLeodetal COMLEXandNOMLEXaredesignedfromtheoutsettomeettheneedsoftheNLP communityandthereiscurrentlyworkinprogressonlinkingtheCOMLEXsyntaxpatterns totheWordNetwordsenses The EuralexconferenceinLi egemayhaveinauguratedanewphaseinthedebate withtheimpetusthistimecomingfromlexicography WhereasearlierEURALEXeshave beenresistanttocomputationorvieweditwarily atLi ege thisauthor sperceptionwas TherearealternativestoCOMLEX OtherlexiconssuchasXTAG Group andANLT Carrolland Grover havealsobeendevelopedbytheNLPcommunity TheANLTlexiconwasdevelopedinthe mid eighties itincludedmaterialfromadictionaryandcopyrightissuespreventeditbeingusedwidelyfor NLPuntilthemid nineties thatitwasuniversallyacceptedthatNLPhadaroletoplayindictionaryproduction The questionwasnolongerwhethertouseit buthowtouseitwell Exceptinthe dictionary use sessions itwashardto ndpaperswhichdidnotassumetheavailabilityofcorporaand concordancingsoftwareormore Onepaperofparticularsaliencetotheargumentwasjointly presentedbyUlrichHeid anNLPacademicandVincentDocherty adictionarypublisher DochertyandHeid here atlast thestateoftheartinNLPwasbeingusedto provideinputstolexicographersforcompilingabetterdictionaryforpeople Publishers anxieties Theovertreasonwhypublishers dictionarieshavenotbeenmorewidelyusediscopyright Thepublishers argumentissimpleanddirect thepublisher stradeisinintellectualproperty soitisnotreasonabletoexpectthemtogivethedictionaryawayorriskitfallingintothe publicdomain Thereareseveralthreadstothecase The rstis simply piracy Ifadictionarystarts beingcopiedfreely or worsestill startsbeingcopiedforafeebutwherethefeedoesnot gotothelegalcopyrightowner thepublisheristheloser Thepublisherwantstoavoid thishappeningaboveallelse AlicenceagreementwithanNLPresearchgroupprovidesan avenuebywhichadictionarymay nditswaytobeingillegallycopiedandre copied Theotherthreadsrelatetothepossiblere useorre salebythepublisherofversionsof theirdictionarywhichhavebeenupgradedinsomewaybyanNLPresearchgroup Theissueshereconcerncontamination Whereadictionarypublisheristhesoleowner ofthecopyrightforadictionary itwouldliketokeepitthatway Therewillbenoother partiestoconsiderinfuturenegotiations andallthepro tswillcometothepublisher Ifa dictionarydatabasehasbeenenhancedbyanNLPgroup then primafacie ashareofthe intellectualpropertybelongstotheNLPgroup Itisnotstraightforwardtoarriveatamodel forhowsharesoftheintellectualpropertyshouldbeallocated Thestartingpointwould usuallybethequantityoflabourthateachpartyhadputin orthefractionofthetextthat eachproduced Thelattercannotbeappliedatall whatevertheNLPenhancementsmay be itcertainlywillnotmakesensetoquantifythemasafractionoftotaltext length The formerisalsooflittleuse theNLPlaboratorywillprobablyhavebeenusingvariouspieces ofsoftwareandexpertisetoaidwiththeenhancement whichwillhavebeenundertakenfor internaluse inotherNLPapplicationsorresearchprojects sothere saleorre licensingof theupgradeddictionarywouldbeaside e ect Thepublishermaywellhavedoubtsabouttheaccuracyandconsistencyoftheenhanced dictionary butmaynotbewell placedtoevaluate asthisinitselfmayrequireNLPtechniques andanunderstandingoftheissueswhicharelikelytobecriticalforNLPapplications Theenhanceddictionarymayhavepotentialusesforprintproducts forelectronicprod uctsfortheconsumermarket orforlicensingonforNLPuse Aquestioninrelationto the rsttwoisaddressedinsection Regardingthethird thepotentialcomplexityofthe arrangementsisforbidding WhereadictionaryislicensedforNLPuse thelicencemay be broadly forresearchuseorforproductdevelopment Researchuseisstraightforward providedthedemarcationintheNLPgroupbetweenresearchanddevelopmentisclearly drawn Whereitisforproductdevelopment thepublishermayreceivealicencefeeora royaltyforeachproduct containing thedictionary orsomeofeach Containing ininverted Ourhosts longstandingengagementwithNLPmaywellhaveplayedarole commas becausethedictionarywillnotgenerallyexistintheproductinanyrecognisable form butasoneofanumberofinputstothesystem slexicon whichmaywellbecompiled andunreadable Thatproductmayitselfbeaconsumerproduct ormaybeacomponentof somelargerapplication sotheremaybemanysteps eachpotentiallyinvolvinglicencesand royalties betweenthedictionarypublisherandtheend product Onehassomesympathy withthepublishernotwishingtohaveitsnegotiatinghandconstrainedbyundertakingsto NLPgroups Responses Piracyisnotanstrongargument Therearemanywaysinwhichadictionarymaybe piratedormayenterthepublicdomain Itmaybere keyed itmaybeOCR dfromthe printedversion thecontentsofCD ROMsmaybeunscrambled hackersmayhackintothe publisher ssystem tapesmaybestolen Itwillofcoursebethedutyoflicence holdersto ensurethatpeoplewhoshouldnothaveaccesstotheirversionofthedictionary donot and thatitdoesnotgetcopiedoutsidethecon nesofthelaboratory andiftheyfailinthisduty theywillbeculpable Buttheyarenomorelikelytofailinthedutythanassortedotherof thepublisher semployeesandagentswhohaveaccesstothedata Regardingcopyrightonanenhanceddictionary rstly manyresearchgroupsinUniversi tieswillnotwanttoclaimashareofcopyright Theoriginallicenceallowingthemtousethe dictionarycouldthencontainaclausestipulatingthatlexiconsforwhichthedictionaryhas beenaninputcanonlybepassedontothirdpartiesbythepublisher andallthecopyright insuchlexiconsshallbevestedinthepublisher WhereNLPgroupsarenothappytohandovercopyrightinthisway therewillbepo tentiallycomplexnegotiationsrequired astherewillwhenalexiconisacomponentofsome otherapplicationatoneormorefurtherremovesfromtheendproduct Thesesortsofcon siderationsare however becomingcommonplaceinthesoftwareandmultimediaindustries sothepublisher sconcernsreducetothefollowing arethegainstobehadfromsynergywith NLPsu cienttomeritthee ort Isthepublisherwillingtotakeonthechallengeofthe complexnegotiations TheModel Theprimaryconcernofthepublisheristoretaincontroloftheresourceandtomaximise the owofincomeitgivesriseto TheprimaryconcernofmanyNLPresearchersisthat theirworkisavailableforotherstouseandextend Reputationandstatusisdependenton beingcited intheacademicworld Ifpeopleuseyourresourcetheywillciteyou Licence incomeisalesserconsideration rstly becauseititisnotexpected secondly becauseit wouldprobablygototheinstitutionratherthantheindividual andthirdly simplybecause itisnotthecurrencyofacademicstatus TheappropriatemodelisforthepublishertoencourageNLPresearcherstousethedictio nary onthebasisthatanyenhancementstothedictionarywillbereturnedtothepublisher forittouseitselfifitsowishes fordictionaryrevisionandotherimprovedconsumerprod ucts andtomarketandgenerallymakeavailabletootherNLPgroups Theagreementwould putthepublisherunderanobligationtomaketheenhanceddictionaryavailable undersim ilartermsagain andforafeethatwouldnotbeprohibitive andthiswouldbethebene t totheNLPgrouptocounterbalancethefactthattheywould inthesimplestcase relinquish claimstoashareinthecopyright Limitationsarerequiredonthisobligationofthepublishertopublish Firstly theen hanceddictionarywouldhavetobeofadequatequality andtothisend thepublishermust acquiretheexpertisetoassessthequality ThiswilldemandsomeinvestmentinNLP expertiseonthepartofthepublisher butthen thepublishershouldnotexpecttoreapthe fruitsofthenewmarketwithoutanyinvestment Secondly thepublisherwillwanttostrikedi erentdealsforthedictionarywithdi erent customers onewouldnotexpecttermsforMicrosofttobeasforaUniversitygrouporstartup company Thetermsoftheagree

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论