Chinese Segmentation and New Word Detection using Conditional Random Fields(译文).doc_第1页
Chinese Segmentation and New Word Detection using Conditional Random Fields(译文).doc_第2页
Chinese Segmentation and New Word Detection using Conditional Random Fields(译文).doc_第3页
Chinese Segmentation and New Word Detection using Conditional Random Fields(译文).doc_第4页
Chinese Segmentation and New Word Detection using Conditional Random Fields(译文).doc_第5页
免费预览已结束,剩余5页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Chinese Segmentation and New Word Detection using Conditional Random Fields译者:赵森栋译文题目: 中文分词和应用条件随机域的新词识别 原文题目:Chinese Segmentation and New Word Detection using Conditional Random Fields 原文出处: COLING 2004. Chinese Segmentation and New Word Detection using Conditional Random Fields译者:赵森栋Chinese Segmentation and New Word Detection using Conditional Random FieldsFuchun Peng, Fangfang Feng, Andrew McCallum Computer Science Department, University of Massachusetts Amherst 140 Governors Drive, Amherst, MA, U.S.A. 01003 ffuchun, feng, Abstract Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the integration of domain knowledge in the form of multiple lexicons of characters and words. We also present a probabilistic new word detection method, which further improves performance. Our system is evaluated on four datasets used in a recent comprehensive Chinese word segmentation competition. State-of-the-art performance is obtained. 1 Introduction Unlike English and other western languages, many Asian languages such as Chinese, Japanese, and Thai, do not delimit words by white-space. Word segmentation is therefore a key precursor for language processing tasks in these languages. For Chinese, there has been significant research on finding word boundaries in unsegmented sequences (see (Sproat and Shih, 2002) for a review). Unfortunately, building a Chinese word segmentation system is complicated by the fact that there is no standard definition of word boundaries in Chinese. Approaches to Chinese segmentation fall roughly into two categories: heuristic dictionary-based methods and statistical machine learning methods. In dictionary-based methods, a predefined dictionary is used along with hand-generated rules for segmenting input sequence (Wu, 1999). However these approaches have been limited by the impossibility of creating a lexicon that includes all possible Chinese words and by the lack of robust statistical inference in the rules. Machine learning approaches are more desirable and have been successful in both unsupervised learning (Peng and Schuurmans, 2001) and supervised learning (Teahan et al., 2000). Many current approaches suffer from either lack of exact inference over sequences or difficulty in incorporating domain knowledge effectively into segmentation. Domain knowledge is either not used, used in a limited way, or used in a complicated way spread across different components. For example, the N-gram generative language modeling based approach of Teahan et al (2000) does not use domain knowledge. Gao et al (2003) uses class-based language for word segmentation where some word category information can be incorporated. Zhang et al (2003) use a hierarchical hidden Markov Model to incorporate lexical knowledge. A recent advance in this area is Xue (2003), in which the author uses a sliding-window maximum entropy classifier to tag Chinese characters into one of four position tags, and then covert these tags into a segmentation using rules. Maximum entropy models give tremendous flexibility to incorporate arbitrary features. However, a traditional maximum entropy tagger, as used in Xue (2003), labels characters without considering dependencies among the predicted segmentation labels that is inherent in the state transitions of finite-state sequence models. Linear-chain conditional random fields (CRFs) (Lafferty et al., 2001) are models that address both issues above. Unlike heuristic methods, they are principled probabilistic finite state models on which exact inference over sequences can be efficiently performed. Unlike generative N-gram or hidden Markov models, they have the ability to straightforwardly combine rich domain knowledge, for example in this paper, in the form of multiple readily-available lexicons. Furthermore, they are discriminatively-trained, and are often more accurate than generative models, even with the same features. In their most general form, CRFs are arbitrary undirected graphical models trained to maximize the conditional probability of the desired outputs given the corresponding inputs. In the linear-chain special case we use here, they can be roughly understood as discriminatively-trained hidden Markov models with next-state transition functions represented by exponential models (as in maximum entropy classifiers), and with great flexibility to view the observation sequence in terms of arbitrary, overlapping features, with long-range dependencies, and at multiple levels of granularity. These beneficial properties suggest that CRFs are a promising approach for Chinese word segmentation. New word detection is one of the most important problems in Chinese information processing. Many machine learning approaches have been proposed (Chen and Bai, 1998; Wu and Jiang, 2000; Nie et al., 1995). New word detection is normally considered as a separate process from segmentation. However, integrating them would benefit both segmentation and new word detection. CRFs provide a convenient framework for doing this. They can produce not only a segmentation, but also confidence in local segmentation decisions, which can be used to find new, unfamiliar character sequences surrounded by high-confidence segmentations. Thus, our new word detection is not a stand-alone process, but an integral part of segmentation. Newly detected words are re-incorporated into our word lexicon, and used to improve segmentation. Improved segmentation can then be further used to improve new word detection. Comparing Chinese word segmentation accuracy across systems can be difficult because many research papers use different datasets and different ground-rules. Some published results claim 98% or 99% segmentation precision and recall, but these either count only the words that occur in the lexicon, or use unrealistically simple data, lexicons that have extremely small (or artificially non-existant) out-of-vocabulary rates, short sentences or many numbers. A recent Chinese word segmentation competition (Sproat and Emerson, 2003) has made comparisons easier. The competition provided four datasets with significantly different segmentation guidelines, and consistent train-test splits. The performance of participating system varies significantly across different datasets. Our system achieves top performance in two of the runs, and a state-of-the-art performance on average. This indicates that CRFs are a viable model for robust Chinese word segmentation. 2 Conditional Random Fields Conditional random fields (CRFs) are undirected graphical models trained to maximize a conditional probability (Lafferty et al., 2001). A common special-case graph structure is a linear chain, which corresponds to a finite state machine, and is suitable for sequence labeling. A linear-chain CRF with parameters defines a conditional probability for a state (label) sequence (for example, labels indicating where words start or have their interior) given an input sequence (for example, the characters of a Chinese sentence) to be (1)where Zx is the per-input normalization that makes the probability of all state sequences sum to one; is a feature function which is often binary-valued, but can be real-valued, and k is a learned weight associated with feature fk. The feature functions can measure any aspect of a state transition, yt-1 yt, and the entire observation sequence, x, centered at the current time step, t. For example, one feature function might have value 1 when yt-1 is the state START, yt is the state NOTSTART, and xt is a word appearing in a lexicon of peoples first names. Large positive values for indicate a preference for such an event; large negative values make the event unlikely. The most probable label sequence for an input x, ,y can be efficiently determined using the Viterbi algorithm (Rabiner, 1990). An N-best list of labeling sequences can also be obtained using modified Viterbi algorithm and A* search (Schwartz and Chow, 1990). The parameters can be estimated by maximum likelihoodmaximizing the conditional probability of a set of label sequences, each given their corresponding input sequences. The log-likelihood of training set is written Traditional maximum entropy learning algorithms, such as GIS and IIS (della Pietra et al., 1995), can be used to train CRFs. However, our implementation uses a quasi-Newton gradient-climber BFGS for optimization, which has been shown to converge much faster (Malouf, 2002;Sha and Pereira, 2003). The gradient of the likelihood is .CRFs share many of the advantageous properties of standard maximum entropy classifiers, including their convex likelihood function, which guarantees that the learning procedure converges to the global maximum. 2.1 Regularization in CRFs To avoid over-fitting, log-likelihood is usually penalized by some prior distribution over the parameters. A commonly used prior is a zero-mean Gaussian. With a Gaussian prior, log-likelihood is penalized as follows. Where is the variance for feature dimension k. The variance can be feature dependent. However for simplicity, constant variance is often used for all features. We experiment an alternate version of Gaussian prior in which the variance is feature dependent. We bin features by frequency in the training set, and let the features in the same bin share the same variance. The discounted value is set to be where ck is the count of features, M is the bin size set by held out validation, and a is the ceiling function. See Peng and McCallum (2004) for more details and further experiments. 2.2 State transition features Varying state-transition structures with different Markov order can be specified by different CRF feature functions, as determined by the number of output labels y examined together in a feature function. We define four different state transition feature functions corresponding to different Markov orders. Higher-order features capture more long-range dependencies, but also cause more data sparseness problems and require more memory for training. The best Markov order for a particular application can be selected by held-out cross-validation. 1. First-order: Here the inputs are examined in the context of the current state only. The feature functions are represented as f(yt, x). There are no separate parameters for state transitions. 2. First-order+transitions: Here we add parameters corresponding to state transitions. The feature functions used are f(yt, x);f(yt-1;yt). 3. Second-order: Here inputs are examined in the context of the current and previous states. Feature function are represented as f(yt-1,yt, x). 4. Third-order: Here inputs are examined in the context of the current, and two previous states. Feature function are represented as f(yt-2,yt-1,yt,x). 中文分词和应用条件随机域的新词识别摘要中文分词是一个难度大,重要性高且被广泛研究的顺序建模问题。这篇论文通过提供一个原则性的框架来论证线性链条件随机域(CRFs)执行精准的中文分词的能力。这个原则性框架很容易的提供了以若干字词词典为形式的领域知识的整合。我们也提出了一个可能的新词识别方法,这个方法更进一步的推动中文分词。我们的系统被用在四个数据集上进行评估,这四个数据集被用在当前的一个中文分词竞赛中。最后,获得了完美的中文分词。1 介绍不像英语和其他西方语言,中文、日文和泰文等很多亚洲语言不用空格分隔单词。因此,在这些语言对文字处理的任务中词的分割就是一个关键前驱工作。对中文来说,在没分词的序列中找到词的分界已经有了意义重大的研究(参见Sproat and Shih, 2002)。不幸的是,由于在中文中没有标准的词边界的定义而使建立中文分词系统变得非常复杂。中文分词的方法可以粗略的分成两种:基于字典的启发式方法和基于统计的机器学习方法。在基于字典的方法中,预先设定的字典连同人工制定的规则一起用于分割输入序列。然而,由于不可能创建一个包含所有可能中文词语的词典并且由于在规则中缺乏统计学的推论,所以限定了这些分词方法。机器学习的方法是更值得推崇的方法并且已经成功的应用于无监督学习(Peng and Schuurmans, 2001)和有监督学习(Teahan et al., 2000)。许多目前的分词方法遭受着或者在序列上缺乏准确的推论或者在把各领域知识高效地整合到分词中的方面存在困难。领域知识或者不被使用,或者以限定的方式被使用,或者以复杂的方式被使用延伸到不同的要素。例如,N-gram(大词汇连续语音识别中常用的一种语言模型)生成的基于Teahan 等人 (2000)方法的语言建模没使用领域知识。Gao 等人 (2003)使用基于类的语言来分词,其中一些词范畴信息可以被整合。Zhang 等人 (2003)使用了一个分等级的隐藏标记模型来整合词汇知识。在这个领域的一个新的发展是Xue (2003),在其中作者使用了一个最大滑动窗口熵分类器用于标记中文字符,它把中文字符标记成四分之一位置标记,然后使用规则把这些标记隐匿于分词。最大熵模型对整合任意特征给出了巨大的适应性。然而,像应用于Xue (2003)中的那样,传统的最大熵标记器标记字符。但它没有考虑被预测的分割标记中的依赖关系。这些被预测的分割标记在有限状态序列模型的状态转换中是固定的。线性链条件随机域(CRFs) (Lafferty et al., 2001)是一些用于以下问题的模型。和启发式方法不同的是他们是有规则的可能的有限状态模型。在这些模型上,序列上的准确推论可以被高效的执行。和生成的N-gram或隐匿标记模型不同的是它有能力以多倍易用词典的形式直接的整合丰富的领域知识。此外,他们被特殊处理过,并且在对待相同的字符串是它们比生成模型更准确。在它们的大多普通形式中,CRFs是任意的不定向图形模型,这些模型被处理来使给定通讯输入的输出量的希望值的条件可能性最大化。在这里我们用的是线性链的特殊例子,它们可以被看作是经过特殊处理的隐匿标识模型,这些隐匿标识模型具有被幂指数模型描述的下一状态转换功能(像在最大熵分类器中那样),具有在理解观测序列的任意性方面的巨大的适应性,具有重组字符功能,具有长期的依赖性,而且在多重粒度水平上。这些好的方面表明CRFs是一种大有前途的中文分词方法。在中文信息处理过程中新词识别是一个非常重要的问题。研究人员提出了许多机器学习的方法(Chen and Bai, 1998; Wu and Jiang, 2000; Nie et al., 1995)。新词识别一般上被认为是从分词中独立出来的一个过程。然而,把它们整合起来对分词和新词识别都有好处。对于这项工作CRFs提供了一个便利的框架。CRFs不仅可以产生一个新词,而且对局部分词决策更加确定,CRFs可以被用于探测新的不常见的被高信度围绕的字符序列。如此一来,我们的新词探测就不是一个孤立的过程而是整个分词的一部分。新探测到的词被并入词典中然后用于改善分词。被改善了的分词可以更进一步的改善新词探测。在这些系统中比较中文分词的准确性是非常困难的,因为许多论文用的是不同的数据集和不同的程序。有些已经发布的结果声称达到98%或99%的分词准确率和召回率,但是这些或者只统计了出现在词典中的词,或者用了不切实际的测试数据,或者词典中只有极其少的词不再其中,或者测试数据中都是短句或许多数字。近来的一个中文分词竞赛已经做了一个简单的对照。这个竞赛提供了具有截然不同的分割指导的数据库,且具有一致的处理测试分界。参加的系统通过不同的数据库得到了截然不同

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论