ClassicTokenizer

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ClassicTokenizer

Rick Leir-2
Hi all
A while ago the default was changed to StandardTokenizer from ClassicTokenizer. The biggest difference seems to be that Classic does not break on hyphens. There is also a different character pr(mumble). I prefer the Classic's non-break on hyphens.

What was the reason for changing this default? If I understand this better I can avoid some pitfalls, perhaps.
Thanks -- Rick
--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: ClassicTokenizer

Shawn Heisey-2
On 1/9/2018 9:36 AM, Rick Leir wrote:
> A while ago the default was changed to StandardTokenizer from ClassicTokenizer. The biggest difference seems to be that Classic does not break on hyphens. There is also a different character pr(mumble). I prefer the Classic's non-break on hyphens.

To have any ability to research changes, we're going to need to know
precisely what you mean by "default" in that statement.

Are you talking about the example schemas, or some kind of inherent
default when an analysis chain is not specified?

Probably the reason for the change is an attempt to move into the modern
era, become more standardized, and stop using old/legacy
implementations.  The name of the new default contains the word
"Standard" which would fit in with that goal.

I can't locate any changes in the last couple of years that change the
classic tokenizer to standard.  Maybe I just don't know the right place
to look.

> What was the reason for changing this default? If I understand this better I can avoid some pitfalls, perhaps.

If you are talking about example schemas, then the following may apply:

Because you understand how analysis components work well enough to even
ask your question, I think you're probably the kind of admin who is
going to thoroughly customize the schema and not rely on the defaults
for TextField types that come with Solr.  You're free to continue using
the classic tokenizer in your schema if that meets your needs better
than whatever changes are made to the examples by the devs.  The
examples are only starting points, virtually all Solr installs require
customizing the schema.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: ClassicTokenizer

Rick Leir-2
Shawn
I did not express that clearly.
The reference guide says "The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. "

So I am curious to know why they changed StandardTokenizer after 3.1 to break on hyphens, when it seems to me to work better the old way?
Thanks
Rick

On January 9, 2018 7:07:59 PM EST, Shawn Heisey <[hidden email]> wrote:

>On 1/9/2018 9:36 AM, Rick Leir wrote:
>> A while ago the default was changed to StandardTokenizer from
>ClassicTokenizer. The biggest difference seems to be that Classic does
>not break on hyphens. There is also a different character pr(mumble). I
>prefer the Classic's non-break on hyphens.
>
>To have any ability to research changes, we're going to need to know
>precisely what you mean by "default" in that statement.
>
>Are you talking about the example schemas, or some kind of inherent
>default when an analysis chain is not specified?
>
>Probably the reason for the change is an attempt to move into the
>modern
>era, become more standardized, and stop using old/legacy
>implementations.  The name of the new default contains the word
>"Standard" which would fit in with that goal.
>
>I can't locate any changes in the last couple of years that change the
>classic tokenizer to standard.  Maybe I just don't know the right place
>to look.
>
>> What was the reason for changing this default? If I understand this
>better I can avoid some pitfalls, perhaps.
>
>If you are talking about example schemas, then the following may apply:
>
>Because you understand how analysis components work well enough to even
>ask your question, I think you're probably the kind of admin who is
>going to thoroughly customize the schema and not rely on the defaults
>for TextField types that come with Solr.  You're free to continue using
>the classic tokenizer in your schema if that meets your needs better
>than whatever changes are made to the examples by the devs.  The
>examples are only starting points, virtually all Solr installs require
>customizing the schema.
>
>Thanks,
>Shawn

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: ClassicTokenizer

Shawn Heisey-2
On 1/10/2018 2:27 PM, Rick Leir wrote:
> I did not express that clearly.
> The reference guide says "The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. "
>
> So I am curious to know why they changed StandardTokenizer after 3.1 to break on hyphens, when it seems to me to work better the old way?

I really have no idea.  Those are Lucene classes, not Solr.  Maybe
someone who was around for whatever discussions happened on Lucene lists
back in those days will comment.

I wasn't able to find the issue where ClassicTokenizer was created, and
I couldn't find any information discussing the change.

If I had to guess why StandardTokenizer was updated this way, I think it
is to accommodate searches where people were searching for one word in
text where that word was part of something larger with a hyphen, and it
wasn't being found.  There was probably a discussion among the
developers about what a typical Lucene user would want, so they could
decide what they would have the standard tokenizer do.

Likely because there was a vocal segment of the community reliant on the
old behavior, they preserved that behavior in ClassicTokenizer, but
updated the standard one to do what they felt would be normal for a
typical user.

Obviously *your* needs do not fall in line with what was decided ... so
the standard tokenizer isn't going to work for you.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: ClassicTokenizer

sarowe
Hi Rick,

Quoting Robert Muir’s comments on https://issues.apache.org/jira/browse/LUCENE-2167 (he’s referring to the word break rules in UAX#29[1] when he says “the standard”):
 
> i actually am of the opinion StandardTokenizer should follow unicode standard tokenization. then we can throw subjective decisions away, and stick with a standard.

> I think it would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex [....] Then its name would actually make sense.


[1] Unicode Standard Annex #29: Unicode Text Segmentation <http://unicode.org/reports/tr29/>

--
Steve
www.lucidworks.com

> On Jan 10, 2018, at 10:09 PM, Shawn Heisey <[hidden email]> wrote:
>
> On 1/10/2018 2:27 PM, Rick Leir wrote:
>> I did not express that clearly.
>> The reference guide says "The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. "
>> So I am curious to know why they changed StandardTokenizer after 3.1 to break on hyphens, when it seems to me to work better the old way?
>
> I really have no idea.  Those are Lucene classes, not Solr.  Maybe someone who was around for whatever discussions happened on Lucene lists back in those days will comment.
>
> I wasn't able to find the issue where ClassicTokenizer was created, and I couldn't find any information discussing the change.
>
> If I had to guess why StandardTokenizer was updated this way, I think it is to accommodate searches where people were searching for one word in text where that word was part of something larger with a hyphen, and it wasn't being found.  There was probably a discussion among the developers about what a typical Lucene user would want, so they could decide what they would have the standard tokenizer do.
>
> Likely because there was a vocal segment of the community reliant on the old behavior, they preserved that behavior in ClassicTokenizer, but updated the standard one to do what they felt would be normal for a typical user.
>
> Obviously *your* needs do not fall in line with what was decided ... so the standard tokenizer isn't going to work for you.
>
> Thanks,
> Shawn