TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Ahmet Arslan
Hi,

I extracted Emails and URLs from certain TREC collections using TestUAX29URLEmailTokenizer combined with TypeTokenFilter.

High Freq. terms reveal that 
 * some e-mail addressed start with apostrophes 
 * some e-mails or URLs end with a period. 

I ran a few tests and this behaviour occurs only if the entity is the first or last term in the text.
If the entity is the middle of the text, UAXURLET strips apostrophes and dots.

For example, "Contact me at [hidden email]. or [hidden email]." 
Notice first email has a dot, while second has not.

Why UAXURLET behaves different for the first/last token? Could this be a bug?

It looks like dot and apostrophes  are legal parts of the entities but with this
abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.

I created 8 test cases to get your opinions for this one, before creating a Jira issue.

 public void testURLEndingWithDot2() throws IOException {
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are www.apache.org. and lucene.apache.org",
        new String[] {"My","Web","addresses", "are","www.apache.org","and","lucene.apache.org"},
        new String[] {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
  }

public void testEMailStartingWithApostrophe2() throws IOException {
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[hidden email] '[hidden email].",
        new String[] {"[hidden email]","[hidden email]"},
        new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
  }


P.S. I observed somehow similar phenomena with ICU tokenizer. 
ICU tokenizer sets script attribute to Latin for words that consist of numbers.
But if the whole text is composed of words that consist of numbers, script attribute is set to Common.

Thanks,
Ahmet

Reply | Threaded
Open this post in threaded view
|

Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

Robert Muir
About what you see with ICU: it is correct, you have to make sure you
handle "Common":

https://github.com/apache/lucene-solr/blob/master/lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ScriptIterator.java

It mostly behaves like
http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UScriptRun.html
as far as how it classifies runs of text, except for the differences
in the documentation.

On Sat, Aug 12, 2017 at 1:46 PM, Ahmet Arslan <[hidden email]> wrote:

> Hi,
>
> I extracted Emails and URLs from certain TREC collections using
> TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
>
> High Freq. terms reveal that
>  * some e-mail addressed start with apostrophes
>  * some e-mails or URLs end with a period.
>
> I ran a few tests and this behaviour occurs only if the entity is the first
> or last term in the text.
> If the entity is the middle of the text, UAXURLET strips apostrophes and
> dots.
>
> For example, "Contact me at [hidden email]. or
> [hidden email]."
> will produce [hidden email].  [hidden email]
> Notice first email has a dot, while second has not.
>
> Why UAXURLET behaves different for the first/last token? Could this be a
> bug?
>
> It looks like dot and apostrophes  are legal parts of the entities but with
> this
> abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.
>
> I created 8 test cases to get your opinions for this one, before creating a
> Jira issue.
>
>  public void testURLEndingWithDot2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are
> www.apache.org. and lucene.apache.org",
>         new String[] {"My","Web","addresses",
> "are","www.apache.org","and","lucene.apache.org"},
>         new String[]
> {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
>   }
>
> public void testEMailStartingWithApostrophe2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[hidden email]
> '[hidden email].",
>         new String[] {"[hidden email]","[hidden email]"},
>         new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
>   }
>
>
> P.S. I observed somehow similar phenomena with ICU tokenizer.
> ICU tokenizer sets script attribute to Latin for words that consist of
> numbers.
> But if the whole text is composed of words that consist of numbers, script
> attribute is set to Common.
>
> Thanks,
> Ahmet
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TestUAX29URLEmailTokenizer inconsistent adding dots and apostrophes to URLs and Emails

sarowe
In reply to this post by Ahmet Arslan
Hi Arslan,

UAX29URLEmailTokenizerImpl.jflex includes ASCIITLD.jflex-macro, which has this at the end:

> ) "."?   // Accept trailing root (empty) domain

So trailing dots are recognized as part of domains that are included in URLs and email addresses.  But maybe they shouldn’t be?  (Except maybe in a URL that contains trailing elements: port/path/query/fragment; in that case the trailing dot should definitely be recognized.)

I’m not sure why apostrophe and trailing-dot recognition depends on where they occur.  This is not intentional IIRC.

--
Steve
www.lucidworks.com

> On Aug 12, 2017, at 1:46 PM, Ahmet Arslan <[hidden email]> wrote:
>
> Hi,
>
> I extracted Emails and URLs from certain TREC collections using TestUAX29URLEmailTokenizer combined with TypeTokenFilter.
>
> High Freq. terms reveal that
>  * some e-mail addressed start with apostrophes
>  * some e-mails or URLs end with a period.
>
> I ran a few tests and this behaviour occurs only if the entity is the first or last term in the text.
> If the entity is the middle of the text, UAXURLET strips apostrophes and dots.
>
> For example, "Contact me at [hidden email]. or [hidden email]."
> will produce [hidden email].  [hidden email]
> Notice first email has a dot, while second has not.
>
> Why UAXURLET behaves different for the first/last token? Could this be a bug?
>
> It looks like dot and apostrophes  are legal parts of the entities but with this
> abbreviations such as W.Va. D-W.Va. v.ye. are recognized as URL.
>
> I created 8 test cases to get your opinions for this one, before creating a Jira issue.
>
>  public void testURLEndingWithDot2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "My Web addresses are www.apache.org. and lucene.apache.org",
>         new String[] {"My","Web","addresses", "are","www.apache.org","and","lucene.apache.org"},
>         new String[] {"<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<ALPHANUM>","<URL>","<ALPHANUM>","<URL>"});
>   }
>
> public void testEMailStartingWithApostrophe2() throws IOException {
>     BaseTokenStreamTestCase.assertAnalyzesTo(a, "'[hidden email] '[hidden email].",
>         new String[] {"[hidden email]","[hidden email]"},
>         new String[] {"<EMAIL>","<EMAIL>","<ALPHANUM>","<EMAIL>"});
>   }
>
>
> P.S. I observed somehow similar phenomena with ICU tokenizer.
> ICU tokenizer sets script attribute to Latin for words that consist of numbers.
> But if the whole text is composed of words that consist of numbers, script attribute is set to Common.
>
> Thanks,
> Ahmet
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]