ICTCLAS with nutch 0.7.1.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

ICTCLAS with nutch 0.7.1.

吴志敏
hi all
  i get a big problem when i integrated ICTCLAS with nutch 0.7.1.
i followed the page "
http://www.nutchhacks.com/ftopic391.php&highlight=chinese"
but when i ant the nutch,i got a lot of errors like this:

i 've modified the files in org.apache.nutch.analysis directory. and my
question is that should i modified the lucene.
and how to deal with it!!!


any reply will be appreciated.


I have integrated Nutch with an intelligent Chinese
Lexical Analysis System.So Nutch now can segment
Chinese words effectively.

Following is my solution:

1.modify NutchAnalysis.jj:

-| <#CJK: // non-alphabets
- [
- "\u3040"-"\u318f",
- "\u3300"-"\u337f",
- "\u3400"-"\u3d2d",
- "\u4e00"-"\u9fff",
- "\uf900"-"\ufaff"
- ]
- >

+| <#OTHER_CJK: //japanese and korean characters
+ [
+ "\u3040"-"\u318f",
+ "\u3300"-"\u337f",
+ "\u3400"-"\u3d2d",
+ "\uf900"-"\ufaff"
+ ]
+ >
+| <#CHINESE: //chinese characters
+ [
+ "\u4e00"-"\u9fff"
+ ]
+ >

-| <SIGRAM: <CJK> >

+| <SIGRAM: <OTHER_CJK> >
+| <CNWORD: (<CHINESE>)+ > //chinese words

- ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
+ ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM> | token=<CNWORD>)

I will segment chinese characters intelligently but japanese
and korean characters remains single-gram segmentation.

2.modify NutchDocumentTokenizer.java

-case EOF: case WORD: case ACRONYM: case SIGRAM:
+case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD:

3.modify FastCharStream.java

+private final static caomo.ICTCLASCaller spliter = new
caomo.ICTCLASCaller();
+private final int IO_BUFFER_SIZE=2048;

-buffer = new char[2048];
+buffer = new char[IO_BUFFER_SIZE];

-int charsRead = input.read(buffer, newPosition,
buffer.length-newPosition);
+int charsRead=readString(newPosition);

+ // do intelligent Chinese word segmentation
+private int readString(int newPosition) throws java.io.IOException {
+ char[] tempBuffer = new char[IO_BUFFER_SIZE / 2]; //read from io
+ char[] hzBuffer = new char[IO_BUFFER_SIZE / 2]; //store Chinese
characters string
+ int len=0;
+
+ len = input.read(tempBuffer, 0, IO_BUFFER_SIZE / 4);
+
+
+ int pos=-1; //position in buffer
+ if (len > 0) {
+ pos=0;
+
+ int hzPos=0; //position in hzBuffer
+ char c=' ';
+ int value=-1;
+ for(int i=0;i<len;i++){ //iterate tempBuffer
+ hzPos=0;
+ c=tempBuffer[i];
+ value=(int)c;
+
+ if( (value<19968)||(value>40959) ){ //non-chinese characters
+ buffer[pos + newPosition] = c;
+ pos++;
+ }
+ else{ //Chinese character unicode: '\u4e00---'\u9fff'
+ hzBuffer[hzPos++]=' ';
+ hzBuffer[hzPos] = c;
+ hzPos++;
+ i++;
+ while(i<len){
+ c=tempBuffer[i];
+ value=(int)c;
+ //Chinese character sequence,store it in hzBuffer
+ if ( (value>=19968)&&(value<=40959) ){
+ hzBuffer[hzPos] = c;
+ hzPos++;
+ i++;
+ }
+ else
+ break; //have extracted a Chinese String
+ }
+
+ i--;
+ if(hzPos>0){
+ String str = new String(hzBuffer, 0, hzPos);
+ String str2 = spliter.segSentence(str2); // perform
Chinese word
+ // segmentation
+
+ if(str2!=null){
+
+ while(str2.length()>buffer.length-newPosition){ //expand the buffer
+ char[] newBuffer = new char[buffer.length*2];
+ System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
+ buffer = newBuffer;
+ }
+
+ for(int j=0;j<str2.length();j++){
+ buffer[pos + newPosition] = str2.charAt(j);
+ pos++;
+ }
+ }else{
+ for(int j=0;j<str.length();j++){
+ buffer[pos + newPosition] = str.charAt(j);
+ pos++;
+ }
+
+ }
+ }
+ }
+ }
+
+ }
+
+ return pos;
+ }


I use ICTCLASC to perform Chinese word segmentation.ICTCLASC don't
just simply perform bi-gram segmentation but using an approach based on
multi-layer HMM. Its segmentation precision is 97.58%
ICTCLASC is free for researchers. see:
http://www.nlp.org.cn/project/project.php?proj_id=6

4.modify Summarizer.java

+ //reset startOffset and endOffset of tokens
+ private void resetTokenOffset(Token[] tokens,String text)
+ {
+ String text3=text.toLowerCase();
+
+ char[] textArray=text3.toCharArray();
+ int tokenStart=0;
+ char[] tokenArray=null;
+ int j;
+ Token preToken=new Token(" ",0,1);
+ Token curToken=new Token(" ",0,1);
+ Token nextToken=null;
+ int startSearch=0;
+ while(true){
+ tokenArray = null;
+ for (int i = startSearch; i < textArray.length; i++) {
+
+ if (tokenStart == tokens.length)
+ break;
+
+ if (tokenArray == null) {
+ tokenArray =
tokens[tokenStart].termText().toCharArray();
+ preToken = curToken;
+ curToken = tokens[tokenStart];
+ nextToken = null;
+
+ }
+
+ //deals with following situation:(common grams)
+ //text: about buaa a welcome from buaa president
+ //token sequences:about buaa buaa-a a a-welcome welcome from
buaa president
+ if ((preToken.termText().charAt(0) ==
+ curToken.termText().charAt(0)) &&
+ (preToken.termText().length() <
curToken.termText().length())) {
+ if (curToken.termText().startsWith(preToken.termText() +
"-")) { //buaa-a starts with buaa-
+ if (tokenStart + 1 < tokens.length) {
+ nextToken = tokens[tokenStart + 1];
+ if (curToken.termText().endsWith("-" +
+ nextToken.termText())) { //meets buaa
buaa-a a
+ int curTokenLength = curToken.endOffset() -
+ curToken.startOffset();
+
curToken.setStartOffset(preToken.startOffset());
+ curToken.setEndOffset(preToken.startOffset()
+
+ curTokenLength);
+ tokenStart++;
+ tokenArray = null;
+ i = preToken.startOffset();
+ startSearch=i;//the start position in
textArray for the next turn,if need.
+ continue;
+ }
+ }
+
+ }
+ }
+ //------------------------
+
+ j = 0;
+ if (textArray[i] == tokenArray[j]) {
+
+ if (i + tokenArray.length - 1 >= textArray.length) {
+ //do nothing?
+ } else {
+
+ int k = i + 1;
+ for (j = 1; j < tokenArray.length; j++) {
+ if (textArray[k++] != tokenArray[j])
+ break; //not meets
+ }
+ if (j >= tokenArray.length) { //meets
+ curToken.setStartOffset(i);
+ curToken.setEndOffset(i + tokenArray.length);
+
+ i = i + tokenArray.length - 1;
+ tokenStart++;
+ startSearch=i;//the start position in textArray
for the next turn,if need.
+ tokenArray = null;
+ }
+ }
+ }
+ }
+ if (tokenStart == tokens.length)
+ break; //have resetted all tokens
+
+ if (tokenStart < tokens.length ) { //next turn
+ curToken.setStartOffset(preToken.startOffset());
+ curToken.setEndOffset(preToken.endOffset());
+
+ tokenStart++; //skip this token
+
+ }
+
+ }//the end of while(true)
+ }

under the line: Token[] tokens = getTokens(text)
in getSummary(String text, Query query);

+resetTokenOffset(tokens, text);

I perform Chinese word Segmentation after tokenizer and insert space
between
two Chinese words.So I need reset all tokens' startOffset and
endOffset in Summarizer.java.
To do this,I added method resetTokenOffset(Token[] tokens,String text)
in Summarizer.java and I have to add two methods setStartOffset(int start)
and
setEndOffset(int end) in Lucene's Token.java.



By the above four steps,Nutch can search Chinese web site
nearly perfectly.You can try it.I just made Nutch to do it,
but my solution is less perfect.

If Chinese word segmentation could be done in NutchAnalysis.jj
before tokenizer,then we don't need reset tokens' offset in
Summarizer.java and everything will be perfect.
But it seems too difficult to perform intelligent Chinese word
segmentation in NutchAnalysis.jj.Even impossible??


Any suggestions?



Buildfile: build.xml

init:

compile-core:
    [javac] Compiling 247 source files to E:\search\new\nutch-
0.7.1\build\classes
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Query.java:408: unreported
exception org.apache.nutch.analysis.ParseException; must be caught or
declared to be thrown
    [javac]     return fixup(NutchAnalysis.parseQuery(queryString));
    [javac]                                          ^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:140: cannot find
symbol
    [javac] symbol  : method setStartOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                              curToken.setStartOffset(
preToken.startOffset());
    [javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:141: cannot find
symbol
    [javac] symbol  : method setEndOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                              curToken.setEndOffset(
preToken.startOffset() + curTokenLength);
    [javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:164: cannot find
symbol
    [javac] symbol  : method setStartOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                                 curToken.setStartOffset(i);

[javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:165: cannot find
symbol
    [javac] symbol  : method setEndOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                                 curToken.setEndOffset(i +
tokenArray.length);

[javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:179: cannot find
symbol
    [javac] symbol  : method setStartOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                   curToken.setStartOffset(preToken.startOffset
());
    [javac]                                           ^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:180: cannot find
symbol
    [javac] symbol  : method setEndOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                   curToken.setEndOffset(preToken.endOffset());
    [javac]                                           ^
    [javac] Note: * uses or overrides a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 7 errors

BUILD FAILED
E:\search\new\nutch-0.7.1\build.xml:70: Compile failed; see the compiler
error output for details.

Total time: 39 seconds


--
www.babatu.com