add CJKTokenizer to solr

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

add CJKTokenizer to solr

Xuesong Luo
Hi,

I got the error below after adding CJKTokenizer to schema.xml.  I
checked the constructor of CJKTokenizer, it requires a Reader parameter,
I guess that's why I get this error, I searched the email archive, it
seems working for other users. Does anyone know what is the problem?

 

Thanks

Xuesong

 

 

2007-06-18 17:09:29,369 ERROR [STDERR] Jun 18, 2007 5:09:29 PM
org.apache.solr.core.SolrException log

SEVERE: org.apache.solr.core.SolrException: Error instantiating class
class org.apache.lucene.analysis.cjk.CJKTokenizer

            at org.apache.solr.core.Config.newInstance(Config.java:229)

            at
org.apache.solr.schema.IndexSchema.readTokenizerFactory(IndexSchema.java
:619)

            at
org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:593)

            at
org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:331)

            at
org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:71)

 

 

 

Schema.xml

    <fieldtype name="itext" class="solr.TextField"
positionIncrementGap="100" >

      <analyzer>

        <tokenizer class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>


        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldtype>

 

Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Chris Hostetter-3

: I got the error below after adding CJKTokenizer to schema.xml.  I
: checked the constructor of CJKTokenizer, it requires a Reader parameter,
: I guess that's why I get this error, I searched the email archive, it
: seems working for other users. Does anyone know what is the problem?

You can use any Lucene "Analyzers" that has a default constructor as is by
declaring it in the <analyzer> declaration (the example schema.xml shows
this using the GreekAnalyzer) os you could use the CJKAnalyzer directly
... if you want to use a Lucene "Tokenizer" you need a simple Solr
"TokenizerFactory" to generate instances of it.  writting a
TokenizerFactory is easy, they can be simple -- really, REALLY simple ...
most of the ones in the Solr code base have more lines of License text
then they do of code...

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/LowerCaseTokenizerFactory.java?view=markup

http://wiki.apache.org/solr/SolrPlugins#head-718653697f60b44092280c8c506077e0933e3668
http://lucene.apache.org/solr/api/org/apache/solr/analysis/TokenizerFactory.html


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Toru Matsuzawa
In reply to this post by Xuesong Luo
> I got the error below after adding CJKTokenizer to schema.xml.  I
> checked the constructor of CJKTokenizer, it requires a Reader parameter,
> I guess that's why I get this error, I searched the email archive, it
> seems working for other users. Does anyone know what is the problem?


CJKTokenizerFactory that I am using is appended.

On Mon, 18 Jun 2007 21:35:37 -0700
"Xuesong Luo" <[hidden email]> wrote:

> Hi,
>
> I got the error below after adding CJKTokenizer to schema.xml.  I
> checked the constructor of CJKTokenizer, it requires a Reader parameter,
> I guess that's why I get this error, I searched the email archive, it
> seems working for other users. Does anyone know what is the problem?
>
>  
>
> Thanks
>
> Xuesong
>
>  
>
>  
>
> 2007-06-18 17:09:29,369 ERROR [STDERR] Jun 18, 2007 5:09:29 PM
> org.apache.solr.core.SolrException log
>
> SEVERE: org.apache.solr.core.SolrException: Error instantiating class
> class org.apache.lucene.analysis.cjk.CJKTokenizer
>
>             at org.apache.solr.core.Config.newInstance(Config.java:229)
>
>             at
> org.apache.solr.schema.IndexSchema.readTokenizerFactory(IndexSchema.java
> :619)
>
>             at
> org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:593)
>
>             at
> org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:331)
>
>             at
> org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:71)
>
>  
>
>  
>
>  
>
> Schema.xml
>
>     <fieldtype name="itext" class="solr.TextField"
> positionIncrementGap="100" >
>
>       <analyzer>
>
>         <tokenizer class="org.apache.lucene.analysis.cjk.CJKTokenizer"/>
>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>     </fieldtype>
>
>  
>
--
Toru Matsuzawa
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Toru Matsuzawa
I'm sorry. Because it was not possible to append it,
it sends it again.

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
>
>
> CJKTokenizerFactory that I am using is appended.
>
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


--
Trou Matsuzawa


Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Mike Klaas

On 18-Jun-07, at 10:28 PM, Toru Matsuzawa wrote:

> I'm sorry. Because it was not possible to append it,
> it sends it again.
>
>>> I got the error below after adding CJKTokenizer to schema.xml.  I
>>> checked the constructor of CJKTokenizer, it requires a Reader  
>>> parameter,
>>> I guess that's why I get this error, I searched the email  
>>> archive, it
>>> seems working for other users. Does anyone know what is the problem?
>>
>>
>> CJKTokenizerFactory that I am using is appended.

Would you be interested in contributing this class to solr?

-MIke
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

dma_bamboo
Hi

Well, creating a Factory for each new Tokenizer we want to add means you are
replicating the same code again and again just to bind the Factory (Solr
interface) to the Tokenizer (Lucene interface).

Why instead of that we don't create an UbberFactory that takes the Tokenizer
class as a parameter and instantiates the proper Tokenizer?

It could be done simply, but it would impact the schema.xml and its parsers
and config classes associated. But I think it would make things simpler.

What do you think about it?

A code example follows:
public class UbberTokenizerFactory extends BaseTokenizerFactory {    /**
* @see org.apache.solr.analysis.TokenizerFactory#create(Reader)     */
public TokenStream create(Reader input)    {        String
tokenizerClassName = ""; // get the tokenizer class name from the config
try        {            return
(TokenStream)(Class.forName(tokenizerClassName).getConstructor(new
Class[]{Reader.class}).newInstance(input));        }        catch (Exception
e)        {            throw new IllegalArgumentException("It wasn't
possible to instantiate the Factory. Verify if the tokenizer class name \""
+ tokenizerClassName + "\" is correct and is available in the classpath.",
e);        }        //return new CJKTokenizer(input);    } }

Regards,
Daniel Alheiros

On 19/6/07 18:57, "Mike Klaas" <[hidden email]> wrote:

>
> On 18-Jun-07, at 10:28 PM, Toru Matsuzawa wrote:
>
>> I'm sorry. Because it was not possible to append it,
>> it sends it again.
>>
>>>> I got the error below after adding CJKTokenizer to schema.xml.  I
>>>> checked the constructor of CJKTokenizer, it requires a Reader
>>>> parameter,
>>>> I guess that's why I get this error, I searched the email
>>>> archive, it
>>>> seems working for other users. Does anyone know what is the problem?
>>>
>>>
>>> CJKTokenizerFactory that I am using is appended.
>
> Would you be interested in contributing this class to solr?
>
> -MIke


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Chris Hostetter-3

: Why instead of that we don't create an UbberFactory that takes the Tokenizer
: class as a parameter and instantiates the proper Tokenizer?

The idea has come up before ... and there's really no reason why it
wouldn't be okay to include a reflection based facotry like this in Solr
-- it just hasn't been done yet.

One of the reasons is that there are some performance costs associated
with the reflection, so we wouldn't want to competley replace the existing
"configuration via factory name" model with a "configure via class name
and an uber factory does the reflection quetly in the background" model
because it's the kind of appraoch that would really only make sense for
simple prototypes -- in any system where you are really concerned about
performacne, reflection on every analyzer call would probably be pretty
expensive.  (allthough i'd love to see benchmarks prove me wrong)

Another question in my mind is "why doesn't solr provide an optional jar
with factories for every tokenizer/tokenfilter in the lucene contribs?"
... the only answer to that is that no one has bothered to crank out a
patch that does it.

http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
http://www.nabble.com/foo-tf1737025.html#a4720545


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Otis Gospodnetic-2
In reply to this post by Xuesong Luo
Eh, I was looking at these Factories just the other day and wondering about the similar stuff as Daniel.
Regarding reflection - even if reflection is slower, and I'm sure it is, I just don't know exactly how much slower it is, couldn't we cache the instantiated instances keyed off by name?  Such instances would have to be thread-safe, but I imagine most/all Tokenizers already are thread-safe.

Daniel, I suggest you take that UbberTokenizerFactory code, slap ASL 2.0 on top of it, add simple instance caching as mentioned above, and post the code to JIRA.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Chris Hostetter <[hidden email]>
To: [hidden email]
Sent: Thursday, June 21, 2007 9:39:20 PM
Subject: Re: add CJKTokenizer to solr


: Why instead of that we don't create an UbberFactory that takes the Tokenizer
: class as a parameter and instantiates the proper Tokenizer?

The idea has come up before ... and there's really no reason why it
wouldn't be okay to include a reflection based facotry like this in Solr
-- it just hasn't been done yet.

One of the reasons is that there are some performance costs associated
with the reflection, so we wouldn't want to competley replace the existing
"configuration via factory name" model with a "configure via class name
and an uber factory does the reflection quetly in the background" model
because it's the kind of appraoch that would really only make sense for
simple prototypes -- in any system where you are really concerned about
performacne, reflection on every analyzer call would probably be pretty
expensive.  (allthough i'd love to see benchmarks prove me wrong)

Another question in my mind is "why doesn't solr provide an optional jar
with factories for every tokenizer/tokenfilter in the lucene contribs?"
... the only answer to that is that no one has bothered to crank out a
patch that does it.

http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
http://www.nabble.com/foo-tf1737025.html#a4720545


-Hoss




Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Chris Hostetter-3

: Regarding reflection - even if reflection is slower, and I'm sure it is,
: I just don't know exactly how much slower it is, couldn't we cache the
: instantiated instances keyed off by name?  Such instances would have to
: be thread-safe, but I imagine most/all Tokenizers already are
: thread-safe.

most instances of Tokenizer and TokenFilter aren't threadsafe -- i'm not
sure how they could be given that the only real method they have is
"next()" ... everyone implementation i know of is constructed using a
Reader or TokenStream (depending on wether it's a Tokenizer or
TokenFilter) ... so reuse with new input is a bit hard

as i mentioned in one of the threads i linked to, the best we can probably
do is resolve the classname into a Class object in the init methods of
a ReflectionTOkenFilterFactory or ReflectionTokenizerFactory class, but a
new instance really needs to be constructed everytime the create() methods
are called.

like i said though: i'm in favore of factories like this ... i just don't
think we should do anything to hide their use and make refering to
Tokenizer or TOkenFilter class names directly use reflection magicly.


: http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
: http://www.nabble.com/foo-tf1737025.html#a4720545



-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: add CJKTokenizer to solr

Xuesong Luo
In reply to this post by Xuesong Luo
Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund".  The search criteria is beißt.

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result is:
<str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str>

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
<str>Ein Mann <em>bei</em><em>ß</em><em>t</em> den Hund</str>

When using standard tokenizer, beißt is treated as a word, the result is:
<str>Ein Mann <em>beißt</em> den Hund</str>


I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit?


Thanks
Xuesong

-----Original Message-----
From: Toru Matsuzawa [mailto:[hidden email]]
Sent: Monday, June 18, 2007 10:29 PM
To: [hidden email]
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it,
it sends it again.

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
>
>
> CJKTokenizerFactory that I am using is appended.
>
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


--
Trou Matsuzawa



Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Otis Gospodnetic-2
In reply to this post by Xuesong Luo
I'm jumping in the middle of the thread here.
CJK = Chinese, Japanese, Korean
German = etwas ganz anderes
Why are you trying to use CJKAnalyzer+Tokenizer for German?  Have you tried German Analyzer from Lucene contrib?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Xuesong Luo <[hidden email]>
To: [hidden email]
Sent: Friday, June 22, 2007 8:54:37 AM
Subject: RE: add CJKTokenizer to solr

Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund".  The search criteria is beißt.

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result is:
<str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str>

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
<str>Ein Mann <em>bei</em><em>ß</em><em>t</em> den Hund</str>

When using standard tokenizer, beißt is treated as a word, the result is:
<str>Ein Mann <em>beißt</em> den Hund</str>


I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit?


Thanks
Xuesong

-----Original Message-----
From: Toru Matsuzawa [mailto:[hidden email]]
Sent: Monday, June 18, 2007 10:29 PM
To: [hidden email]
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it,
it sends it again.

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
>
>
> CJKTokenizerFactory that I am using is appended.
>
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


--
Trou Matsuzawa






Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

dma_bamboo
In reply to this post by Chris Hostetter-3
Hi Hoss.

I've done a few tests using reflection to instantiate a simple object and
the results will vary a lot depending on the JVM. As the JVM optimizes code
as it is executed it will vary depending on the usage, but I think we have
something to consider:

If done 1,000 samples (5 clean X loop of 200) and each sample is creating
100,000 objects and the results were:

With reflection:
    - Average                      : 0.0005418
    - Worst (first clean execution): 0.0007760

Without reflection:
    - Average                      : 0.0000469
    - Worst (first clean execution): 0.0002140

So comparing these numbers, I can see that using reflection on the average
case will cost 10 times more than creating the object without reflection.

But my question is: Do we need to create factories so frequently or the are
just create once and re-used (are they thread safe)? The term Factory made
me think of a class that is responsible for building others instance, so
usually they can be singletons... If they don't need to be created all the
time it will not impact really and will give extra flexibility in terms of
incorporating new Tokenizers (it would make easier to make Solr/Lucene
versions less coupled).

Environment:
java version "1.5.0_07"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)
Heap size: 256M
Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM

Regards,
Daniel


On 21/6/07 20:39, "Chris Hostetter" <[hidden email]> wrote:

>
> : Why instead of that we don't create an UbberFactory that takes the Tokenizer
> : class as a parameter and instantiates the proper Tokenizer?
>
> The idea has come up before ... and there's really no reason why it
> wouldn't be okay to include a reflection based facotry like this in Solr
> -- it just hasn't been done yet.
>
> One of the reasons is that there are some performance costs associated
> with the reflection, so we wouldn't want to competley replace the existing
> "configuration via factory name" model with a "configure via class name
> and an uber factory does the reflection quetly in the background" model
> because it's the kind of appraoch that would really only make sense for
> simple prototypes -- in any system where you are really concerned about
> performacne, reflection on every analyzer call would probably be pretty
> expensive.  (allthough i'd love to see benchmarks prove me wrong)
>
> Another question in my mind is "why doesn't solr provide an optional jar
> with factories for every tokenizer/tokenfilter in the lucene contribs?"
> ... the only answer to that is that no one has bothered to crank out a
> patch that does it.
>
> http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
> http://www.nabble.com/foo-tf1737025.html#a4720545
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

dma_bamboo
Sorry I've confused things a bit... The thread safeness have to be
considered only on the Tokenizers, not on the factories. So are the
Tokenizers thread safe?

Regards,
Daniel


On 22/6/07 11:36, "Daniel Alheiros" <[hidden email]> wrote:

> Hi Hoss.
>
> I've done a few tests using reflection to instantiate a simple object and
> the results will vary a lot depending on the JVM. As the JVM optimizes code
> as it is executed it will vary depending on the usage, but I think we have
> something to consider:
>
> If done 1,000 samples (5 clean X loop of 200) and each sample is creating
> 100,000 objects and the results were:
>
> With reflection:
>     - Average                      : 0.0005418
>     - Worst (first clean execution): 0.0007760
>
> Without reflection:
>     - Average                      : 0.0000469
>     - Worst (first clean execution): 0.0002140
>
> So comparing these numbers, I can see that using reflection on the average
> case will cost 10 times more than creating the object without reflection.
>
> But my question is: Do we need to create factories so frequently or the are
> just create once and re-used (are they thread safe)? The term Factory made
> me think of a class that is responsible for building others instance, so
> usually they can be singletons... If they don't need to be created all the
> time it will not impact really and will give extra flexibility in terms of
> incorporating new Tokenizers (it would make easier to make Solr/Lucene
> versions less coupled).
>
> Environment:
> java version "1.5.0_07"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
> Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)
> Heap size: 256M
> Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM
>
> Regards,
> Daniel
>
>
> On 21/6/07 20:39, "Chris Hostetter" <[hidden email]> wrote:
>
>>
>> : Why instead of that we don't create an UbberFactory that takes the
>> Tokenizer
>> : class as a parameter and instantiates the proper Tokenizer?
>>
>> The idea has come up before ... and there's really no reason why it
>> wouldn't be okay to include a reflection based facotry like this in Solr
>> -- it just hasn't been done yet.
>>
>> One of the reasons is that there are some performance costs associated
>> with the reflection, so we wouldn't want to competley replace the existing
>> "configuration via factory name" model with a "configure via class name
>> and an uber factory does the reflection quetly in the background" model
>> because it's the kind of appraoch that would really only make sense for
>> simple prototypes -- in any system where you are really concerned about
>> performacne, reflection on every analyzer call would probably be pretty
>> expensive.  (allthough i'd love to see benchmarks prove me wrong)
>>
>> Another question in my mind is "why doesn't solr provide an optional jar
>> with factories for every tokenizer/tokenfilter in the lucene contribs?"
>> ... the only answer to that is that no one has bothered to crank out a
>> patch that does it.
>>
>> http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
>> http://www.nabble.com/foo-tf1737025.html#a4720545
>>
>>
>> -Hoss
>>
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Otis Gospodnetic-2
In reply to this post by Xuesong Luo
Tokenizers are not thread safe (I made a mistake yesterday saying they are - I don't know what I was thinking).
This is why:

public abstract class Tokenizer extends TokenStream {
  /** The text source for this Tokenizer. */
  protected Reader input;                                   <---- oops :(
  ...

public abstract class CharTokenizer extends Tokenizer {
  public CharTokenizer(Reader input) {
    super(input);
  }
  ...

Otis
 
--
Lucene Consulting -- http://lucene-consulting.com/


----- Original Message ----
From: Daniel Alheiros <[hidden email]>
To: [hidden email]
Sent: Friday, June 22, 2007 12:43:50 PM
Subject: Re: add CJKTokenizer to solr

Sorry I've confused things a bit... The thread safeness have to be
considered only on the Tokenizers, not on the factories. So are the
Tokenizers thread safe?

Regards,
Daniel


On 22/6/07 11:36, "Daniel Alheiros" <[hidden email]> wrote:

> Hi Hoss.
>
> I've done a few tests using reflection to instantiate a simple object and
> the results will vary a lot depending on the JVM. As the JVM optimizes code
> as it is executed it will vary depending on the usage, but I think we have
> something to consider:
>
> If done 1,000 samples (5 clean X loop of 200) and each sample is creating
> 100,000 objects and the results were:
>
> With reflection:
>     - Average                      : 0.0005418
>     - Worst (first clean execution): 0.0007760
>
> Without reflection:
>     - Average                      : 0.0000469
>     - Worst (first clean execution): 0.0002140
>
> So comparing these numbers, I can see that using reflection on the average
> case will cost 10 times more than creating the object without reflection.
>
> But my question is: Do we need to create factories so frequently or the are
> just create once and re-used (are they thread safe)? The term Factory made
> me think of a class that is responsible for building others instance, so
> usually they can be singletons... If they don't need to be created all the
> time it will not impact really and will give extra flexibility in terms of
> incorporating new Tokenizers (it would make easier to make Solr/Lucene
> versions less coupled).
>
> Environment:
> java version "1.5.0_07"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
> Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing)
> Heap size: 256M
> Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM
>
> Regards,
> Daniel
>
>
> On 21/6/07 20:39, "Chris Hostetter" <[hidden email]> wrote:
>
>>
>> : Why instead of that we don't create an UbberFactory that takes the
>> Tokenizer
>> : class as a parameter and instantiates the proper Tokenizer?
>>
>> The idea has come up before ... and there's really no reason why it
>> wouldn't be okay to include a reflection based facotry like this in Solr
>> -- it just hasn't been done yet.
>>
>> One of the reasons is that there are some performance costs associated
>> with the reflection, so we wouldn't want to competley replace the existing
>> "configuration via factory name" model with a "configure via class name
>> and an uber factory does the reflection quetly in the background" model
>> because it's the kind of appraoch that would really only make sense for
>> simple prototypes -- in any system where you are really concerned about
>> performacne, reflection on every analyzer call would probably be pretty
>> expensive.  (allthough i'd love to see benchmarks prove me wrong)
>>
>> Another question in my mind is "why doesn't solr provide an optional jar
>> with factories for every tokenizer/tokenfilter in the lucene contribs?"
>> ... the only answer to that is that no one has bothered to crank out a
>> patch that does it.
>>
>> http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
>> http://www.nabble.com/foo-tf1737025.html#a4720545
>>
>>
>> -Hoss
>>
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                   



Reply | Threaded
Open this post in threaded view
|

RE: add CJKTokenizer to solr

Xuesong Luo
In reply to this post by Xuesong Luo
Thanks, otis, I didn't know CJK is only used for Asian language. I'll try the German Analyzer.

-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: Friday, June 22, 2007 3:18 AM
To: [hidden email]
Subject: Re: add CJKTokenizer to solr

I'm jumping in the middle of the thread here.
CJK = Chinese, Japanese, Korean
German = etwas ganz anderes
Why are you trying to use CJKAnalyzer+Tokenizer for German?  Have you tried German Analyzer from Lucene contrib?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Xuesong Luo <[hidden email]>
To: [hidden email]
Sent: Friday, June 22, 2007 8:54:37 AM
Subject: RE: add CJKTokenizer to solr

Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund".  The search criteria is beißt.

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result is:
<str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str>

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
<str>Ein Mann <em>bei</em><em>ß</em><em>t</em> den Hund</str>

When using standard tokenizer, beißt is treated as a word, the result is:
<str>Ein Mann <em>beißt</em> den Hund</str>


I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit?


Thanks
Xuesong

-----Original Message-----
From: Toru Matsuzawa [mailto:[hidden email]]
Sent: Monday, June 18, 2007 10:29 PM
To: [hidden email]
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it,
it sends it again.

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
>
>
> CJKTokenizerFactory that I am using is appended.
>
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


--
Trou Matsuzawa








Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Chris Hostetter-3
In reply to this post by dma_bamboo

: Sorry I've confused things a bit... The thread safeness have to be
: considered only on the Tokenizers, not on the factories. So are the
: Tokenizers thread safe?

nope ... they are constructed using Readers and mainting state about the
text they are processing ... the only api is a "next()" method.

: > But my question is: Do we need to create factories so frequently or the are
: > just create once and re-used (are they thread safe)? The term Factory made
: > me think of a class that is responsible for building others instance, so
: > usually they can be singletons... If they don't need to be created all the

just to be clear, the Factories are reused, but if we wanted one
"UberFactory" class to be able to return any arbitrary Tokenizer specfied
in the config, the reflection would have to be for the Tokenizer classes

the factories aren't singletons, becuase you might want to use them for
multiple fields with differnet configurations.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Mike Klaas
In reply to this post by Chris Hostetter-3
On 21-Jun-07, at 10:22 PM, Chris Hostetter wrote:

>
> like i said though: i'm in favore of factories like this ... i just  
> don't
> think we should do anything to hide their use and make refering to
> Tokenizer or TOkenFilter class names directly use reflection magicly.

What would be the best way to not hide their use?

<filter class="...CJKTokenFilter" autofactory="true"/>

<filter autoclass="...CJKTokenFilter" />

<autofilter class="...CJKTokenFilter" />
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Chris Hostetter-3

: What would be the best way to not hide their use?
:
: <filter class="...CJKTokenFilter" autofactory="true"/>

How about just...

 <filter class="solr.ReflectionFilterFactory" impl="...CJKTokenFilter" />



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

dma_bamboo
Or

<filter factory="solr.ReflectionFilterFactory" class="...CJKTokenFilter" />

I think this way, the config terms are a bit clearer... What do you think?

Regards,
Daniel

On 22/6/07 20:45, "Chris Hostetter" <[hidden email]> wrote:

>
> : What would be the best way to not hide their use?
> :
> : <filter class="...CJKTokenFilter" autofactory="true"/>
>
> How about just...
>
>  <filter class="solr.ReflectionFilterFactory" impl="...CJKTokenFilter" />
>
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: add CJKTokenizer to solr

Chris Hostetter-3

: <filter factory="solr.ReflectionFilterFactory" class="...CJKTokenFilter" />
:
: I think this way, the config terms are a bit clearer... What do you think?

in general, do i think it would be better if the <filter> and <tokenizer>
declarations used "factory" as the attribute instead of "class"? ...yes.

So i think it makes sense to change this now? ... i don't know.

the backward compatibiily issues are tricky ... not from an implementation
standpoint, but from a clarify standpoint.

we could always make hte schema.xml parsing code say that <filter> and
<tokenizer> declarations will first be checked for a "factory' attribute,
and if it's found use that class, if it's not found then revert to the
"legacy" behavior of looking for a "class" attribute ... but that means
that as people with existing schemas start to take advantage of newer
factories like the ReflectionFactory, and maybe cut/paste examples
from other configs, they'll start to have a hodgepodge of syntax...

  <filter class="solr.LowerCaseFilterFactory" />
  <filter factory="solr.SomeOtherFilterFactory" blahblah="true"
          yadayadaydad="false" numOption="42" />
  <filter class="solr.YetAnotherFilterFactory" />
  <filter factory="solr.ReflectionFilterFactory
          class="org.apache.lucene.contrib.FooFilter" />
  <filter class="solr.OneMoreFilterFactory" />

...i don't know that the "clarity" a "factory" attribute would add for new
users would balance out the confusion this might cause to existing users.


-Hoss

12