版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、/* created by yzh 2004.5.12* 请大家引用时保留这段作者声明,此代码为开源代码;使用不受限制,欢迎大家采用本人所写JS动态拖动表格实现代码。* 中文分词代码*此代码为作者多年经验总结,以前发表过VB,PB版本*/import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.util.Locale;import java.util.TreeMap;import java.uti
2、l.TreeSet;public class ChineseSegmenter private static ChineseSegmenter segmenter = null; / private Hashtable zhwords; private TreeMap zhwords; private TreeSet cforeign, cnumbers; / Char form public final static int TRAD = 0; public final static int SIMP = 1; public final static int BOTH = 2; / Char
3、form is TRAD, SIMP or BOTH private ChineseSegmenter(int charform, boolean loadwordfile) cforeign = new TreeSet(); cnumbers = new TreeSet(); if (charform = SIMP) loadset(cnumbers, data/snumbers_u8.txt); loadset(cforeign, data/sforeign_u8.txt); else if (charform = TRAD) loadset(cnumbers, data/tnumbers
4、_u8.txt); loadset(cforeign, data/tforeign_u8.txt); else / BOTH loadset(cnumbers, data/snumbers_u8.txt); loadset(cforeign, data/sforeign_u8.txt); loadset(cnumbers, data/tnumbers_u8.txt); loadset(cforeign, data/tforeign_u8.txt); / zhwords = new Hashtable(); zhwords = new TreeMap(); if (!loadwordfile)
5、return; String newword = null; try InputStream worddata = null; if (charform = SIMP) worddata = getClass().getResourceAsStream(simplexu8.txt); else if (charform = TRAD) worddata = getClass().getResourceAsStream(tradlexu8.txt); else if (charform = BOTH) worddata = getClass().getResourceAsStream(bothl
6、exu8.txt); BufferedReader in = new BufferedReader(new InputStreamReader( worddata, UTF8); while (newword = in.readLine() != null) if (newword.indexOf(#) = -1) & (newword.length() -1) | (dataline.length() = 0) continue; targetset.add(ern(); in.close(); catch (Exception e) System.err.print
7、ln(Exception loading data file + sourcefile + + e); e.printStackTrace(); public boolean isNumber(String testword) boolean result = true; for (int i = 0; i testword.length(); i+) if (cnumbers.contains(testword.substring(i, i + 1).intern() = false) result = false; break; return result; public boolean
8、isAllForeign(String testword) boolean result = true; for (int i = 0; i testword.length(); i+) if (cforeign.contains(testword.substring(i, i + 1).intern() = false) result = false; break; return result; public boolean isNotCJK(String testword) boolean result = true; for (int i = 0; i testword.length()
9、; i+) if (Character.UnicodeBlock.of(testword.charAt(i) = Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS) result = false; break; return result; public String segmentLine(String cline, String separator) StringBuffer currentword = new StringBuffer(); StringBuffer outline = new StringBuffer(); int i, cle
10、ngth; char currentchar; / separator = ; clength = cline.length(); for (i = 0; i 0 & (Character.isWhitespace(cline.charAt(i - 1) = false) outline.append(separator); currentword.append(currentchar); else if (zhwords.containsKey(new String(currentword.toString() + currentchar).intern() = true & (String
11、) (zhwords.get(new String(currentword .toString() + currentchar).intern().equals(1) = true) / word is in lexicon currentword.append(currentchar); else if (isAllForeign(currentword.toString() & cforeign.contains(new String( new char currentchar ).intern() & i + 2 clength & (zhwords.containsKey(cline.
12、substring(i, i + 2) .intern() = false) / Possible a transliteration of a foreign name currentword.append(currentchar); else if (isNumber(currentword.toString() & cnumbers.contains(new String( new char currentchar ).intern() /* * & (i + 2 clength) & * (zhwords.containsKey(cline.substring(i, i+2).inte
13、rn() = * false) */) / Put all consecutive number characters together currentword.append(currentchar); else if (zhwords.containsKey(new String(currentword .toString() + currentchar).intern() & (String) (zhwords.get(new String(currentword .toString() + currentchar).intern().equals(2) = true) & i + 1 0) outline.append(currentword.toString(); if (Character.isWhitespace(currentchar) = false) outline.append(separator); currentword.setLength(0); outline.append(currentchar); outline.append(currentword.toString(); retu
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论