NutchAnalysis and CJK

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

NutchAnalysis and CJK

Jack.Tang
Hi All

It takes long time for me to think about embedding improved
CJKAnalysis into NutchAnalysis. I got nothing but some failure
experiences, and share with you, maybe you can hack it( well, I am not
going to give up).

I have written several Chinese words segmentation, some are dictionary
based, such as Forward Maximum Matching(FMM) and Backward Maximum
Matching(BMM), and some auto-segmentation, say bi-gram. And they work
fine in pure Chinese words env.(not the mixture of Chinese and other
languages).

Why I only aim at pure Chinese words env.? In NutchAnalysis.jj

<orig>

  // chinese, japanese and korean characters
| <SIGRAM: <CJK> >

</orig>

<modified>

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >

</modified>

SIGRAM only contains CJK words.

Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
As you know:

  // basic word -- lowercase it
<WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
  { matchedToken.image = matchedToken.image.toLowerCase(); }
 
this statement means if the sentence matches "WORD" rule, then the
wrapped object matchedToken will extract
target word. *ONE* word is extracted in one matching.

so, in term() function, it is simple.

/** Parse a single term. */
String term() :
{
  Token token;
}
{
  ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
put "token=<SIGRAM>" here.

  { return token.image; }
}

For CJK it is quite different. We have to extract *MANY* words in one matching.

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >
{
// parse <CJK>+ will generate many words(tokens) here!
}

And my approach is constructing one TokenList to hold these tokens.
The pseudocode looks like

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >
{
for (int i = 0; i < image.length();...) {
Token token = extract in bi-gram.
tokenList.add(token);
}
}

accordingly, the term() function should return ArrayList.

/** .... **/
ArrayList term():
{
Token token;
}
{
(token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
  {
    return tokenList;
  }

}

After these modification, running NutchAnalysis.class, you will get odd result.
Say, I input some Chinese characters:C1C2C3
the result will be: "C1C2 C2C3" (NOTICE the quotation mark).

I am in the wrong direction? Or will someone share any thoughts on
NutchAnalysis.jj? Thanks



Regards
/Jack

--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Transbuerg Tian
hi,
Jack Tang

I have the same condition with u , could you share your total
NutchAnalysis.jj file at here, I am not use nutch but lucene .

good luck.

http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx


2005/7/15, Jack Tang <[hidden email]>:

>
> Hi All
>
> It takes long time for me to think about embedding improved
> CJKAnalysis into NutchAnalysis. I got nothing but some failure
> experiences, and share with you, maybe you can hack it( well, I am not
> going to give up).
>
> I have written several Chinese words segmentation, some are dictionary
> based, such as Forward Maximum Matching(FMM) and Backward Maximum
> Matching(BMM), and some auto-segmentation, say bi-gram. And they work
> fine in pure Chinese words env.(not the mixture of Chinese and other
> languages).
>
> Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
>
> <orig>
>
> // chinese, japanese and korean characters
> | <SIGRAM: <CJK> >
>
> </orig>
>
> <modified>
>
> // chinese, japanese and korean characters
> | <SIGRAM: (<CJK>)+ >
>
> </modified>
>
> SIGRAM only contains CJK words.
>
> Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
> As you know:
>
> // basic word -- lowercase it
> <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
> { matchedToken.image = matchedToken.image.toLowerCase(); }
>
> this statement means if the sentence matches "WORD" rule, then the
> wrapped object matchedToken will extract
> target word. *ONE* word is extracted in one matching.
>
> so, in term() function, it is simple.
>
> /** Parse a single term. */
> String term() :
> {
> Token token;
> }
> {
> ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
> put "token=<SIGRAM>" here.
>
> { return token.image; }
> }
>
> For CJK it is quite different. We have to extract *MANY* words in one
> matching.
>
> // chinese, japanese and korean characters
> | <SIGRAM: (<CJK>)+ >
> {
> // parse <CJK>+ will generate many words(tokens) here!
> }
>
> And my approach is constructing one TokenList to hold these tokens.
> The pseudocode looks like
>
> // chinese, japanese and korean characters
> | <SIGRAM: (<CJK>)+ >
> {
> for (int i = 0; i < image.length();...) {
> Token token = extract in bi-gram.
> tokenList.add(token);
> }
> }
>
> accordingly, the term() function should return ArrayList.
>
> /** .... **/
> ArrayList term():
> {
> Token token;
> }
> {
> (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> {
> return tokenList;
> }
>
> }
>
> After these modification, running NutchAnalysis.class, you will get odd
> result.
> Say, I input some Chinese characters:C1C2C3
> the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
>
> I am in the wrong direction? Or will someone share any thoughts on
> NutchAnalysis.jj? Thanks
>
>
>
> Regards
> /Jack
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Bin Shi
In reply to this post by Jack.Tang
Hi Jack
I have tried "weblucene" written by Che Dong. Weblucene supports cjk
analysis, that is, it supports bigram chinese character indexing. I
have successfully used it to index and search local documents. But for
internet index and search, I have no idea. Maybe what weblucene has
done can help u.
http://sourceforge.net/projects/weblucene/ is the website.
    Thanks

Regards
ShiBin
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Jack.Tang
Hi ShiBin

Thanks for your post.
I had known weblucene since 2003. It was said weblucene used
FMM(dictionary based segmentation) segmenation. But I find  nothing in
weblucene cvs util now.

Here, I hope you can understand my mail, there is no difficult to make
cjk-plugin avaiable, and I wanna NutchAnlysis.jj supports cjk
segmenatation by default.

Thoughts?

Regards
/Jack

On 7/17/05, Bin Shi <[hidden email]> wrote:

> Hi Jack
> I have tried "weblucene" written by Che Dong. Weblucene supports cjk
> analysis, that is, it supports bigram chinese character indexing. I
> have successfully used it to index and search local documents. But for
> internet index and search, I have no idea. Maybe what weblucene has
> done can help u.
> http://sourceforge.net/projects/weblucene/ is the website.
>    Thanks
>
> Regards
> ShiBin
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Jack.Tang
In reply to this post by Transbuerg Tian
Hi Transbuerg

Could you please describe your solution in detail? Appreciate your time.

Regards
/Jack

On 7/15/05, Transbuerg Tian <[hidden email]> wrote:

> hi,
>           Jack Tang
>          
>           I have the same condition with u , could you share your total
> NutchAnalysis.jj file at here, I am not use nutch but lucene .
>
>          good luck.
>
>        
> http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx
>
>
> 2005/7/15, Jack Tang < [hidden email]>:
> > Hi All
> >
> > It takes long time for me to think about embedding improved
> > CJKAnalysis into NutchAnalysis. I got nothing but some failure
> > experiences, and share with you, maybe you can hack it( well, I am not
> > going to give up).
> >
> > I have written several Chinese words segmentation, some are dictionary
> > based, such as Forward Maximum Matching(FMM) and Backward Maximum
> > Matching(BMM), and some auto-segmentation, say bi-gram. And they work
> > fine in pure Chinese words env.(not the mixture of Chinese and other
> > languages).
> >
> > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> >
> > <orig>
> >
> >   // chinese, japanese and korean characters
> > | <SIGRAM: <CJK> >
> >
> > </orig>
> >
> > <modified>
> >
> >   // chinese, japanese and korean characters
> > | <SIGRAM: (<CJK>)+ >
> >
> > </modified>
> >
> > SIGRAM only contains CJK words.
> >
> > Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
> > As you know:
> >
> >   // basic word -- lowercase it
> > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ |
> <IRREGULAR_WORD>)>
> >   { matchedToken.image = matchedToken.image.toLowerCase(); }
> >
> > this statement means if the sentence matches "WORD" rule, then the
> > wrapped object matchedToken will extract
> > target word. *ONE* word is extracted in one matching.
> >
> > so, in term() function, it is simple.
> >
> > /** Parse a single term. */
> > String term() :
> > {
> >   Token token;
> > }
> > {
> >   ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
> > put "token=<SIGRAM>" here.
> >
> >   { return token.image; }
> > }
> >
> > For CJK it is quite different. We have to extract *MANY* words in one
> matching.
> >
> >   // chinese, japanese and korean characters
> > | <SIGRAM: (<CJK>)+ >
> > {
> > // parse <CJK>+ will generate many words(tokens) here!
> > }
> >
> > And my approach is constructing one TokenList to hold these tokens.
> > The pseudocode looks like
> >
> >   // chinese, japanese and korean characters
> > | <SIGRAM: (<CJK>)+ >
> > {
> > for (int i = 0; i < image.length();...) {
> > Token token = extract in bi-gram.
> > tokenList.add(token);
> > }
> > }
> >
> > accordingly, the term() function should return ArrayList.
> >
> > /** .... **/
> > ArrayList term():
> > {
> > Token token;
> > }
> > {
> > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> >   {
> >     return tokenList;
> >   }
> >
> > }
> >
> > After these modification, running NutchAnalysis.class, you will get odd
> result.
> > Say, I input some Chinese characters:C1C2C3
> > the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> >
> > I am in the wrong direction? Or will someone share any thoughts on
> > NutchAnalysis.jj? Thanks
> >
> >
> >
> > Regards
> > /Jack
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Transbuerg Tian
hi,
the weblucene do not use dictionary base segmentation. on the contrary , it
use the bi-gram segmentation. you could get more infomation at :
http://www.chedong.com or search 车东 and lucene for more information.

at this time , I am try to use dictionary base segmentation , you could
visit my blog :
http://blog.csdn.net/accesine960/category/35308.aspx

I have written a dictionary based segmentation java programe , but now under
test condition.
I meet two question :
1. my dictionary term all about 150000 chinese phrases , so I put it to a
hashmap when segmenting.
2. use the java programe , the build Index process is very good , but when
searching , my computer server CPU alway 99.% busy. ( my server: 4G mem and
4 cpu and the index file size about: 2.2G )


so , recent days, I am strive to solve the above 2 questions.

good luck
if you are chinese , we could use chinese for further exchange......

2005/7/19, Jack Tang <[hidden email]>:

>
> Hi Transbuerg
>
> Could you please describe your solution in detail? Appreciate your time.
>
> Regards
> /Jack
>
> On 7/15/05, Transbuerg Tian <[hidden email]> wrote:
> > hi,
> > Jack Tang
> >
> > I have the same condition with u , could you share your total
> > NutchAnalysis.jj file at here, I am not use nutch but lucene .
> >
> > good luck.
> >
> >
> > http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx
> >
> >
> > 2005/7/15, Jack Tang < [hidden email]>:
> > > Hi All
> > >
> > > It takes long time for me to think about embedding improved
> > > CJKAnalysis into NutchAnalysis. I got nothing but some failure
> > > experiences, and share with you, maybe you can hack it( well, I am not
> > > going to give up).
> > >
> > > I have written several Chinese words segmentation, some are dictionary
> > > based, such as Forward Maximum Matching(FMM) and Backward Maximum
> > > Matching(BMM), and some auto-segmentation, say bi-gram. And they work
> > > fine in pure Chinese words env.(not the mixture of Chinese and other
> > > languages).
> > >
> > > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> > >
> > > <orig>
> > >
> > > // chinese, japanese and korean characters
> > > | <SIGRAM: <CJK> >
> > >
> > > </orig>
> > >
> > > <modified>
> > >
> > > // chinese, japanese and korean characters
> > > | <SIGRAM: (<CJK>)+ >
> > >
> > > </modified>
> > >
> > > SIGRAM only contains CJK words.
> > >
> > > Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
> > > As you know:
> > >
> > > // basic word -- lowercase it
> > > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ |
> > <IRREGULAR_WORD>)>
> > > { matchedToken.image = matchedToken.image.toLowerCase(); }
> > >
> > > this statement means if the sentence matches "WORD" rule, then the
> > > wrapped object matchedToken will extract
> > > target word. *ONE* word is extracted in one matching.
> > >
> > > so, in term() function, it is simple.
> > >
> > > /** Parse a single term. */
> > > String term() :
> > > {
> > > Token token;
> > > }
> > > {
> > > ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
> > > put "token=<SIGRAM>" here.
> > >
> > > { return token.image; }
> > > }
> > >
> > > For CJK it is quite different. We have to extract *MANY* words in one
> > matching.
> > >
> > > // chinese, japanese and korean characters
> > > | <SIGRAM: (<CJK>)+ >
> > > {
> > > // parse <CJK>+ will generate many words(tokens) here!
> > > }
> > >
> > > And my approach is constructing one TokenList to hold these tokens.
> > > The pseudocode looks like
> > >
> > > // chinese, japanese and korean characters
> > > | <SIGRAM: (<CJK>)+ >
> > > {
> > > for (int i = 0; i < image.length();...) {
> > > Token token = extract in bi-gram.
> > > tokenList.add(token);
> > > }
> > > }
> > >
> > > accordingly, the term() function should return ArrayList.
> > >
> > > /** .... **/
> > > ArrayList term():
> > > {
> > > Token token;
> > > }
> > > {
> > > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> > > {
> > > return tokenList;
> > > }
> > >
> > > }
> > >
> > > After these modification, running NutchAnalysis.class, you will get
> odd
> > result.
> > > Say, I input some Chinese characters:C1C2C3
> > > the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> > >
> > > I am in the wrong direction? Or will someone share any thoughts on
> > > NutchAnalysis.jj? Thanks
> > >
> > >
> > >
> > > Regards
> > > /Jack
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Jack.Tang
Hi Transbuerg

    Firstly, could you please explain "same condition with u"? Thanks
   
    I have no constructive suggestion to you, because your dictionary
are really huge.  If you can share with me, I will appricate that.
   
    And here is my tips on the dictionary structure.  Optimization? I
don't think so.
    1. I split the whole dictionary into small pieces according
Chinese pronunciation.
    2. using dictionary-lazy-loading when indexing
    3. loading on demand when search, and you can control how many
pieces should be in memory
   
    Thoughts?
     
/Jack

On 7/20/05, Transbuerg Tian <[hidden email]> wrote:

> hi,
>        the weblucene do not use dictionary base segmentation. on the
> contrary , it use the bi-gram segmentation.  you could get more infomation
> at : http://www.chedong.com or  search ³µ¶« and lucene for more information.
>
>        at this time , I am try to use dictionary base segmentation , you
> could visit my blog :
>        http://blog.csdn.net/accesine960/category/35308.aspx
>  
>        I have written a dictionary based segmentation java programe , but
> now under test condition.
>        I meet two question :
>        1. my dictionary term all about 150000 chinese phrases , so I put it
> to a hashmap when segmenting.
>        2. use the java programe , the build Index process is very good , but
> when searching , my computer server CPU alway 99.% busy.  ( my server: 4G
> mem and 4 cpu and the index file size about: 2.2G )
>        
>
>        so , recent days, I am strive to solve the above 2 questions.
>
>        good luck
>        if you are chinese , we could use chinese for further exchange......
>
> 2005/7/19, Jack Tang <[hidden email]>:
> > Hi Transbuerg
> >
> > Could you please describe your solution in detail? Appreciate your time.
> >
> > Regards
> > /Jack
> >
> > On 7/15/05, Transbuerg Tian <[hidden email] > wrote:
> > > hi,
> > >           Jack Tang
> > >
> > >           I have the same condition with u , could you share your total
> > > NutchAnalysis.jj file at here, I am not use nutch but lucene .
> > >
> > >          good luck.
> > >
> > >
> > >
> http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx
> > >
> > >
> > > 2005/7/15, Jack Tang < [hidden email]>:
> > > > Hi All
> > > >
> > > > It takes long time for me to think about embedding improved
> > > > CJKAnalysis into NutchAnalysis. I got nothing but some failure
> > > > experiences, and share with you, maybe you can hack it( well, I am not
> > > > going to give up).
> > > >
> > > > I have written several Chinese words segmentation, some are dictionary
> > > > based, such as Forward Maximum Matching(FMM) and Backward Maximum
> > > > Matching(BMM), and some auto-segmentation, say bi-gram. And they work
> > > > fine in pure Chinese words env.(not the mixture of Chinese and other
> > > > languages).
> > > >
> > > > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> > > >
> > > > <orig>
> > > >
> > > >   // chinese, japanese and korean characters
> > > > | <SIGRAM: <CJK> >
> > > >
> > > > </orig>
> > > >
> > > > <modified>
> > > >
> > > >   // chinese, japanese and korean characters
> > > > | <SIGRAM: (<CJK>)+ >
> > > >
> > > > </modified>
> > > >
> > > > SIGRAM only contains CJK words.
> > > >
> > > > Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
> > > > As you know:
> > > >
> > > >   // basic word -- lowercase it
> > > > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ |
> > > <IRREGULAR_WORD>)>
> > > >   { matchedToken.image = matchedToken.image.toLowerCase(); }
> > > >
> > > > this statement means if the sentence matches "WORD" rule, then the
> > > > wrapped object matchedToken will extract
> > > > target word. *ONE* word is extracted in one matching.
> > > >
> > > > so, in term() function, it is simple.
> > > >
> > > > /** Parse a single term. */
> > > > String term() :
> > > > {
> > > >   Token token;
> > > > }
> > > > {
> > > >   ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
> > > > put "token=<SIGRAM>" here.
> > > >
> > > >   { return token.image; }
> > > > }
> > > >
> > > > For CJK it is quite different. We have to extract *MANY* words in one
> > > matching.
> > > >
> > > >   // chinese, japanese and korean characters
> > > > | <SIGRAM: (<CJK>)+ >
> > > > {
> > > > // parse <CJK>+ will generate many words(tokens) here!
> > > > }
> > > >
> > > > And my approach is constructing one TokenList to hold these tokens.
> > > > The pseudocode looks like
> > > >
> > > >   // chinese, japanese and korean characters
> > > > | <SIGRAM: (<CJK>)+ >
> > > > {
> > > > for (int i = 0; i < image.length ();...) {
> > > > Token token = extract in bi-gram.
> > > > tokenList.add(token);
> > > > }
> > > > }
> > > >
> > > > accordingly, the term() function should return ArrayList.
> > > >
> > > > /** .... **/
> > > > ArrayList term():
> > > > {
> > > > Token token;
> > > > }
> > > > {
> > > > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> > > >   {
> > > >     return tokenList;
> > > >   }
> > > >
> > > > }
> > > >
> > > > After these modification, running NutchAnalysis.class, you will get
> odd
> > > result.
> > > > Say, I input some Chinese characters:C1C2C3
> > > > the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> > > >
> > > > I am in the wrong direction? Or will someone share any thoughts on
> > > > NutchAnalysis.jj? Thanks
> > > >
> > > >
> > > >
> > > > Regards
> > > > /Jack
> > > >
> > > > --
> > > > Keep Discovering ... ...
> > > > http://www.jroller.com/page/jmars 
> > > >
> > >
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Reply | Threaded
Open this post in threaded view
|

Re: NutchAnalysis and CJK

Transbuerg Tian
hi, Jack
My chinese phrase dictionary and chinese sentence segmentation programm you
could get at the below link:
 [sandbox]Lucene中文分词的2个试验模块<http://www.grass.org.cn/blog/archives/2005/07/sandboxluceneae.html
gRaSSland开发日记<http://www.grass.org.cn/blog/>

and now , I had solved the problems I mentioned at previous mail.
good luck

tian

2005/7/22, Jack Tang <[hidden email]>:

>
> Hi Transbuerg
>
> Firstly, could you please explain "same condition with u"? Thanks
>
> I have no constructive suggestion to you, because your dictionary
> are really huge. If you can share with me, I will appricate that.
>
> And here is my tips on the dictionary structure. Optimization? I
> don't think so.
> 1. I split the whole dictionary into small pieces according
> Chinese pronunciation.
> 2. using dictionary-lazy-loading when indexing
> 3. loading on demand when search, and you can control how many
> pieces should be in memory
>
> Thoughts?
>
> /Jack
>
> On 7/20/05, Transbuerg Tian <[hidden email]> wrote:
> > hi,
> > the weblucene do not use dictionary base segmentation. on the
> > contrary , it use the bi-gram segmentation. you could get more
> infomation
> > at : http://www.chedong.com or search ³µ¶« and lucene for more
> information.
> >
> > at this time , I am try to use dictionary base segmentation , you
> > could visit my blog :
> > http://blog.csdn.net/accesine960/category/35308.aspx
> >
> > I have written a dictionary based segmentation java programe , but
> > now under test condition.
> > I meet two question :
> > 1. my dictionary term all about 150000 chinese phrases , so I put it
> > to a hashmap when segmenting.
> > 2. use the java programe , the build Index process is very good , but
> > when searching , my computer server CPU alway 99.% busy. ( my server: 4G
> > mem and 4 cpu and the index file size about: 2.2G )
> >
> >
> > so , recent days, I am strive to solve the above 2 questions.
> >
> > good luck
> > if you are chinese , we could use chinese for further exchange......
> >
> > 2005/7/19, Jack Tang <[hidden email]>:
> > > Hi Transbuerg
> > >
> > > Could you please describe your solution in detail? Appreciate your
> time.
> > >
> > > Regards
> > > /Jack
> > >
> > > On 7/15/05, Transbuerg Tian <[hidden email] > wrote:
> > > > hi,
> > > > Jack Tang
> > > >
> > > > I have the same condition with u , could you share your total
> > > > NutchAnalysis.jj file at here, I am not use nutch but lucene .
> > > >
> > > > good luck.
> > > >
> > > >
> > > >
> > http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx
> > > >
> > > >
> > > > 2005/7/15, Jack Tang < [hidden email]>:
> > > > > Hi All
> > > > >
> > > > > It takes long time for me to think about embedding improved
> > > > > CJKAnalysis into NutchAnalysis. I got nothing but some failure
> > > > > experiences, and share with you, maybe you can hack it( well, I am
> not
> > > > > going to give up).
> > > > >
> > > > > I have written several Chinese words segmentation, some are
> dictionary
> > > > > based, such as Forward Maximum Matching(FMM) and Backward Maximum
> > > > > Matching(BMM), and some auto-segmentation, say bi-gram. And they
> work
> > > > > fine in pure Chinese words env.(not the mixture of Chinese and
> other
> > > > > languages).
> > > > >
> > > > > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> > > > >
> > > > > <orig>
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: <CJK> >
> > > > >
> > > > > </orig>
> > > > >
> > > > > <modified>
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: (<CJK>)+ >
> > > > >
> > > > > </modified>
> > > > >
> > > > > SIGRAM only contains CJK words.
> > > > >
> > > > > Well, I am not much familiar with JavaCC, so the big puzzle pauses
> me.
> > > > > As you know:
> > > > >
> > > > > // basic word -- lowercase it
> > > > > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ |
> > > > <IRREGULAR_WORD>)>
> > > > > { matchedToken.image = matchedToken.image.toLowerCase(); }
> > > > >
> > > > > this statement means if the sentence matches "WORD" rule, then the
> > > > > wrapped object matchedToken will extract
> > > > > target word. *ONE* word is extracted in one matching.
> > > > >
> > > > > so, in term() function, it is simple.
> > > > >
> > > > > /** Parse a single term. */
> > > > > String term() :
> > > > > {
> > > > > Token token;
> > > > > }
> > > > > {
> > > > > ( token=<WORD> | token=<ACRONYM>) // I don't think it is
> reasonable
> > > > > put "token=<SIGRAM>" here.
> > > > >
> > > > > { return token.image; }
> > > > > }
> > > > >
> > > > > For CJK it is quite different. We have to extract *MANY* words in
> one
> > > > matching.
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: (<CJK>)+ >
> > > > > {
> > > > > // parse <CJK>+ will generate many words(tokens) here!
> > > > > }
> > > > >
> > > > > And my approach is constructing one TokenList to hold these
> tokens.
> > > > > The pseudocode looks like
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: (<CJK>)+ >
> > > > > {
> > > > > for (int i = 0; i < image.length ();...) {
> > > > > Token token = extract in bi-gram.
> > > > > tokenList.add(token);
> > > > > }
> > > > > }
> > > > >
> > > > > accordingly, the term() function should return ArrayList.
> > > > >
> > > > > /** .... **/
> > > > > ArrayList term():
> > > > > {
> > > > > Token token;
> > > > > }
> > > > > {
> > > > > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> > > > > {
> > > > > return tokenList;
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > > After these modification, running NutchAnalysis.class, you will
> get
> > odd
> > > > result.
> > > > > Say, I input some Chinese characters:C1C2C3
> > > > > the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> > > > >
> > > > > I am in the wrong direction? Or will someone share any thoughts on
> > > > > NutchAnalysis.jj? Thanks
> > > > >
> > > > >
> > > > >
> > > > > Regards
> > > > > /Jack
> > > > >
> > > > > --
> > > > > Keep Discovering ... ...
> > > > > http://www.jroller.com/page/jmars
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>