StandardTokenizerFactory doesn't split on underscore

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

StandardTokenizerFactory doesn't split on underscore

Rahul Goswami
Hello,
So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
returning the desired results. Turned out that the indexed terms had
underscore separated terms, but the query didn't. I was under the
impression that terms separated by underscore are also tokenized by
StandardTokenizerFactory, but turns out that's not the case. Eg:
'hello-world' would be tokenized into 'hello' and 'world', but
'hello_world' is treated as a single token.
Is this a bug or a designed behavior?

If this is by design, it would be helpful if this behavior is included in
the documentation since it is similar to the behavior with periods.

https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
"Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names. "

Thanks,
Rahul
Reply | Threaded
Open this post in threaded view
|

Re:StandardTokenizerFactory doesn't split on underscore

xiefengchang
did you configured PatternReplaceFilterFactory?

















At 2021-01-08 12:16:06, "Rahul Goswami" <[hidden email]> wrote:

>Hello,
>So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
>returning the desired results. Turned out that the indexed terms had
>underscore separated terms, but the query didn't. I was under the
>impression that terms separated by underscore are also tokenized by
>StandardTokenizerFactory, but turns out that's not the case. Eg:
>'hello-world' would be tokenized into 'hello' and 'world', but
>'hello_world' is treated as a single token.
>Is this a bug or a designed behavior?
>
>If this is by design, it would be helpful if this behavior is included in
>the documentation since it is similar to the behavior with periods.
>
>https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
>"Periods (dots) that are not followed by whitespace are kept as part of the
>token, including Internet domain names. "
>
>Thanks,
>Rahul
Reply | Threaded
Open this post in threaded view
|

Re: StandardTokenizerFactory doesn't split on underscore

Rahul Goswami
Nope. The underscore is preserved right after tokenization even before it
reaches any filters. You can choose the type "text_general" and try an
index time analysis through the "Analysis" page on Solr Admin UI.

Thanks,
Rahul

On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <[hidden email]> wrote:

> did you configured PatternReplaceFilterFactory?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2021-01-08 12:16:06, "Rahul Goswami" <[hidden email]> wrote:
> >Hello,
> >So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
> >returning the desired results. Turned out that the indexed terms had
> >underscore separated terms, but the query didn't. I was under the
> >impression that terms separated by underscore are also tokenized by
> >StandardTokenizerFactory, but turns out that's not the case. Eg:
> >'hello-world' would be tokenized into 'hello' and 'world', but
> >'hello_world' is treated as a single token.
> >Is this a bug or a designed behavior?
> >
> >If this is by design, it would be helpful if this behavior is included in
> >the documentation since it is similar to the behavior with periods.
> >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >"Periods (dots) that are not followed by whitespace are kept as part of
> the
> >token, including Internet domain names. "
> >
> >Thanks,
> >Rahul
>
Reply | Threaded
Open this post in threaded view
|

Re: StandardTokenizerFactory doesn't split on underscore

Adam Walz
It is expected that the StandardTokenizer will not break on underscores.
The StandardTokenizer follows the Unicode UAX 29
<https://unicode.org/reports/tr29/#Word_Boundaries> standard which
specifies an underscore as an "extender" and this rule
<https://unicode.org/reports/tr29/#WB13a> says to not break from extenders.
This is why xiefengchang was suggesting to use a
PatternReplaceFilterFactory after the StandardTokenizer in order to further
split on underscores if that is your use case.

On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami <[hidden email]> wrote:

> Nope. The underscore is preserved right after tokenization even before it
> reaches any filters. You can choose the type "text_general" and try an
> index time analysis through the "Analysis" page on Solr Admin UI.
>
> Thanks,
> Rahul
>
> On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <[hidden email]>
> wrote:
>
> > did you configured PatternReplaceFilterFactory?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2021-01-08 12:16:06, "Rahul Goswami" <[hidden email]> wrote:
> > >Hello,
> > >So recently I was debugging a problem on Solr 7.7.2 where the query
> wasn't
> > >returning the desired results. Turned out that the indexed terms had
> > >underscore separated terms, but the query didn't. I was under the
> > >impression that terms separated by underscore are also tokenized by
> > >StandardTokenizerFactory, but turns out that's not the case. Eg:
> > >'hello-world' would be tokenized into 'hello' and 'world', but
> > >'hello_world' is treated as a single token.
> > >Is this a bug or a designed behavior?
> > >
> > >If this is by design, it would be helpful if this behavior is included
> in
> > >the documentation since it is similar to the behavior with periods.
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > >"Periods (dots) that are not followed by whitespace are kept as part of
> > the
> > >token, including Internet domain names. "
> > >
> > >Thanks,
> > >Rahul
> >
>


--
Adam Walz
Reply | Threaded
Open this post in threaded view
|

Re: StandardTokenizerFactory doesn't split on underscore

Rahul Goswami
Ah ok! Thanks Adam and Xiefeng

On Sat, Jan 9, 2021 at 6:02 PM Adam Walz <[hidden email]> wrote:

> It is expected that the StandardTokenizer will not break on underscores.
> The StandardTokenizer follows the Unicode UAX 29
> <https://unicode.org/reports/tr29/#Word_Boundaries> standard which
> specifies an underscore as an "extender" and this rule
> <https://unicode.org/reports/tr29/#WB13a> says to not break from
> extenders.
> This is why xiefengchang was suggesting to use a
> PatternReplaceFilterFactory after the StandardTokenizer in order to further
> split on underscores if that is your use case.
>
> On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami <[hidden email]>
> wrote:
>
> > Nope. The underscore is preserved right after tokenization even before it
> > reaches any filters. You can choose the type "text_general" and try an
> > index time analysis through the "Analysis" page on Solr Admin UI.
> >
> > Thanks,
> > Rahul
> >
> > On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <[hidden email]>
> > wrote:
> >
> > > did you configured PatternReplaceFilterFactory?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2021-01-08 12:16:06, "Rahul Goswami" <[hidden email]> wrote:
> > > >Hello,
> > > >So recently I was debugging a problem on Solr 7.7.2 where the query
> > wasn't
> > > >returning the desired results. Turned out that the indexed terms had
> > > >underscore separated terms, but the query didn't. I was under the
> > > >impression that terms separated by underscore are also tokenized by
> > > >StandardTokenizerFactory, but turns out that's not the case. Eg:
> > > >'hello-world' would be tokenized into 'hello' and 'world', but
> > > >'hello_world' is treated as a single token.
> > > >Is this a bug or a designed behavior?
> > > >
> > > >If this is by design, it would be helpful if this behavior is included
> > in
> > > >the documentation since it is similar to the behavior with periods.
> > > >
> > > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > > >"Periods (dots) that are not followed by whitespace are kept as part
> of
> > > the
> > > >token, including Internet domain names. "
> > > >
> > > >Thanks,
> > > >Rahul
> > >
> >
>
>
> --
> Adam Walz
>