UAX29 URL Email Tokenizer not working as expected

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

UAX29 URL Email Tokenizer not working as expected

Tom Van Cuyck-2
Hi,

The UAX29 URL Email Tokenizer is not working as expected.
According to the documentation (
https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved."

So I expect "ABC-123" to remain "ABC-123"
However the term is split in 2 separate tokens "ABC" and "123".

Same for "AB12-CD34" --> "AB12" and "CD34" etc...

Is this behavior to be expected? Or is there a way to get the behavior I
expect?

Kind regards, Tom

--

Would you like to receive our newsletter to stay updated? Please click here
<http://eepurl.com/dwoymH>


Tom Van Cuyck
Software Engineer

<http://www.ontoforce.com>
ONTOFORCE
WINNER of EY scale-up of the year 2018
@: [hidden email]
T: +32 9 292 80 37 <+32+9+292+80+37>
W: http://www.ontoforce.com
W: http://www.disqover.com
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
<https://goo.gl/maps/UjuekPHVoFK2>
CIC, One Broadway, MA 02142 Cambridge, United States
<https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>

DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or
interference.
Reply | Threaded
Open this post in threaded view
|

Re: UAX29 URL Email Tokenizer not working as expected

sarowe
Hi Tom,

The documentation is wrong.  The sentence you quoted was inherited from Classic Tokenizer's description.  UAX 29 URL Email Tokenizer is a specialization of Standard Tokenizer, the 7.2 documentation for which says the following:

    Note that words are split at hyphens.

I've made an issue to fix the Solr ref guide: https://issues.apache.org/jira/browse/SOLR-13448

If you don't need the UAX#29 word break rules and identification of URLs and emails, you could switch to Classic Tokenizer, which handles hyphens like you want.

Alternatively, if you want to continue using UAX29 URL Email Tokenizer, you could use a (pre-tokenization) char filter to convert hyphens to something that won't trigger a word break, and then a (post-tokenization) token filter to convert back to a hyphen, e.g. something like (untested; "_._" is an example of a string that is unlikely to occur in your data and which will not trigger a word break[1]):

  <charFilter class="solr.PatternReplaceCharFilterFactory"
              pattern="(\d[A-Za-z]*)-([A-Za-z]*\d)" replacement="$1_._$2"/>
  <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
  <filter class="solr.PatternReplaceFilterFactory"
          pattern="_\._" replacement="-"/>

(I'm guessing you'll need more than one PatternReplaceCharFilterFactory instance to handle all permutations.)

FYI the following note from UAX#29 explains why the default word break rules have hyphens trigger word breaks:

    The correct interpretation of hyphens in the context
    of word boundaries is challenging. It is quite common
    for separate words to be connected with a hyphen:
    “out-of-the-box,” “under-the-table,” “Italian-American,”
    and so on. A significant number are hyphenated names,
    such as “Smith-Hawkins.” When doing a Whole Word Search
    or query, users expect to find the word within those
    hyphens. While there are some cases where they are
    separate words (usually to resolve some ambiguity such
    as “re-sort” as opposed to “resort”), it is better
    overall to keep the hyphen out of the default
    definition. Hyphens include U+002D HYPHEN-MINUS,
    U+2010 HYPHEN, possibly also U+058A ARMENIAN HYPHEN,
    and U+30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN.

Steve

[1] To figure out which chars to use to not trigger a word break, look at rules WB6, WB7, WB8 & WB9 (https://unicode.org/reports/tr29/#WB6 etc.) - "×" in these rules means "do not break".  The MidLetter and MidNumLet character sets are your best bet for such chars: https://unicode.org/reports/tr29/#MidNumLet , https://unicode.org/reports/tr29/#MidLetter

> On May 6, 2019, at 7:22 AM, Tom Van Cuyck <[hidden email]> wrote:
>
> Hi,
>
> The UAX29 URL Email Tokenizer is not working as expected.
> According to the documentation (
> https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
> at hyphens, unless there is a number in the word, in which case the token
> is not split and the numbers and hyphen(s) are preserved."
>
> So I expect "ABC-123" to remain "ABC-123"
> However the term is split in 2 separate tokens "ABC" and "123".
>
> Same for "AB12-CD34" --> "AB12" and "CD34" etc...
>
> Is this behavior to be expected? Or is there a way to get the behavior I
> expect?
>
> Kind regards, Tom
>
> --
>
> Would you like to receive our newsletter to stay updated? Please click here
> <http://eepurl.com/dwoymH>
>
>
> Tom Van Cuyck
> Software Engineer
>
> <http://www.ontoforce.com>
> ONTOFORCE
> WINNER of EY scale-up of the year 2018
> @: [hidden email]
> T: +32 9 292 80 37 <+32+9+292+80+37>
> W: http://www.ontoforce.com
> W: http://www.disqover.com
> AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
> <https://goo.gl/maps/UjuekPHVoFK2>
> CIC, One Broadway, MA 02142 Cambridge, United States
> <https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>
>
> DISCLAIMER This message (including any attachments) may contain information
> which is confidential and/or protected by intellectual property rights and
> is intended for the sole use of the recipient(s) named above. Any use of
> the information herein (including, but not limited to, total or partial
> reproduction, communication or distribution in any form) by persons other
> than the designated recipient(s) is prohibited. If you have received it by
> mistake, please notify the sender by return email and delete this message
> from your system. Please note that emails are susceptible to change.
> ONTOFORCE shall not be liable for the improper or incomplete transmission
> of the information contained in this communication nor for any delay in its
> receipt or damage to your system. ONTOFORCE does not guarantee that the
> integrity of this communication is free of viruses, interceptions or
> interference.