Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/solr/conf/schema.xml

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/solr/conf/schema.xml

Yonik Seeley
Ah, thanks... I had been meaning to add that.

-Yonik

On 6/5/06, [hidden email] <[hidden email]> wrote:
> Author: hossman
> Date: Mon Jun  5 11:20:13 2006
> New Revision: 411882
>
> URL: http://svn.apache.org/viewvc?rev=411882&view=rev
> Log:
> adding new KeywordTokenizerFactory
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/

Chris Hostetter-3

: Ah, thanks... I had been meaning to add that.

yeah .. the impetus was a coworker who wanted a string field that would
sort in a case insensative way ... i thought about writing a new
SortComparatorSource to do this ... but then figured this would be easier
(and more generally usefull).

I just wish it wasn't neccessary to have all these Factories ... has
anyone done any serious bnchmarking of the cost of reflection in a case
like this? ... if getting a Class by name is the expensive part, we can do
that once when the config is loaded -- it's just a question of how
"clazz.newInstance() performs relative to "new Foo()"



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/

Yonik Seeley
On 6/5/06, Chris Hostetter <[hidden email]> wrote:
> I just wish it wasn't neccessary to have all these Factories ... has
> anyone done any serious bnchmarking of the cost of reflection in a case
> like this? ... if getting a Class by name is the expensive part, we can do
> that once when the config is loaded -- it's just a question of how
> "clazz.newInstance() performs relative to "new Foo()"

A factory is certainly necessary sometimes.  You don't want to incur
setup time for creating a SynonymMap or a StopSet for every instance
you create.

For simpler tokenizers or token filters, it would be nice to be able
to use them directly.
Perhaps we could look at the class type, and if it's a lucene
Tokenizer or TokenFilter, try instantiating it directly with
newInstance().

At schema creation time, we should probably check if the specified
Tokenizer or TokenFilter has a default constructor.  If it doesn't we
should throw an error right then, not waiting for a confusing runtime
exception.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/

Chris Hostetter-3

: A factory is certainly necessary sometimes.  You don't want to incur
: setup time for creating a SynonymMap or a StopSet for every instance
: you create.

Sure -- i'm not saying we should eliminate the Factoires completely, just
that i wish it wasn't neccessary to write a factory for every
simple Filter/Tokenizer.

: Perhaps we could look at the class type, and if it's a lucene
: Tokenizer or TokenFilter, try instantiating it directly with
: newInstance().

I was thinking of a "ReflectionFilterFactory" and a
"ReflectionTokenizerFactory" that take in the class name as an argument --
but if we want to make it automatic that's cool too ... my only concern is
that *if* the reflection performance is noticable, we'd want to make it
explicit to people that they are using it .. so they don't use...

        <tokenizer class="solr.StandardTokenizer"/>
..instead of...
        <tokenizer class="solr.StandardTokenizerFactory"/>
...just becuase they don't understand the difference, and then complain
that Solr is really slow.

: At schema creation time, we should probably check if the specified
: Tokenizer or TokenFilter has a default constructor.  If it doesn't we
: should throw an error right then, not waiting for a confusing runtime
: exception.

Why would a default constructor matter? isn't what really matters for all
of the Tokenizers and TokenFilters the one arg constructor?



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/

Yonik Seeley
On 6/5/06, Chris Hostetter <[hidden email]> wrote:
> Why would a default constructor matter? isn't what really matters for all
> of the Tokenizers and TokenFilters the one arg constructor?

Oh, right.  I meant arguments other than the Reader or TokenStream...
things like a stopSet for instance.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: svn commit: r411882 - in /incubator/solr/trunk: CHANGES.txt src/java/org/apache/solr/analysis/KeywordTokenizerFactory.java src/test/org/apache/solr/BasicFunctionalityTest.java src/test/test-files/

Chris Hostetter-3

: Oh, right.  I meant arguments other than the Reader or TokenStream...
: things like a stopSet for instance.

Hmmm.. it seems like any situation where a Filter/Tokenizer needs
additional information beyond the TokenStream/Reader would neccessitate a
Factory ... how would you know what order to pass the args from the config
to the constructor, or what datatypes to use?  in the case of a stopSet
how would you know to interpret the param as a filename 9and not just a
really small set of words) ?


i think it's okay for the complicated cases to stay complicated.  I'm just
thinking about the little cases where simple TokenFilters get added to
lucene's contrib and the only way for people to use them is if they write
their own factory.


-Hoss