[jira] Created: (LUCENE-1438) StandardTokenizer splits host names with hyphens into multiple tokens

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1438) StandardTokenizer splits host names with hyphens into multiple tokens

JIRA jira@apache.org
StandardTokenizer splits host names with hyphens into multiple tokens
---------------------------------------------------------------------

                 Key: LUCENE-1438
                 URL: https://issues.apache.org/jira/browse/LUCENE-1438
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis
    Affects Versions: 2.4
            Reporter: Robert Newson



StandardTokenizer does not recognize host names with hyphens as a single HOST token. Specifically "www.m-w.com" is tokenized as "www.m" and "w.com", both of "<HOST>" type.

StandardTokenizer should instead output a single HOST token for "www.m-w.com", since hyphens are a legitimate character in DNS host names.

We've a local fix to the grammar file which also required us to significantly simplify the NUM type to get the behavior we needed for host names.

here's a junit test for the desired behavior;

        public void testWithHyphens() throws Exception {
                final String host = "www.m-w.com";
                final StandardTokenizer tokenizer = new StandardTokenizer(
                                new StringReader(host));
                final Token token = new Token();
                tokenizer.next(token);
                assertEquals("<HOST>", token.type());
                assertEquals("www.m-w.com", token.term());
        }



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1438) StandardTokenizer splits host names with hyphens into multiple tokens

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657143#action_12657143 ]

Greg Shackles commented on LUCENE-1438:
---------------------------------------

I ran into this same problem.  Some examples are:

www.1-800-flowers.com gets split into www.1-800 and flowers.com

1-800-flowers.com gets split into 1-800-flowers and com


Is there any chance of this being looked at sometime soon?

> StandardTokenizer splits host names with hyphens into multiple tokens
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-1438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1438
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Robert Newson
>
> StandardTokenizer does not recognize host names with hyphens as a single HOST token. Specifically "www.m-w.com" is tokenized as "www.m" and "w.com", both of "<HOST>" type.
> StandardTokenizer should instead output a single HOST token for "www.m-w.com", since hyphens are a legitimate character in DNS host names.
> We've a local fix to the grammar file which also required us to significantly simplify the NUM type to get the behavior we needed for host names.
> here's a junit test for the desired behavior;
> public void testWithHyphens() throws Exception {
> final String host = "www.m-w.com";
> final StandardTokenizer tokenizer = new StandardTokenizer(
> new StringReader(host));
> final Token token = new Token();
> tokenizer.next(token);
> assertEquals("<HOST>", token.type());
> assertEquals("www.m-w.com", token.term());
> }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1438) StandardTokenizer splits host names with hyphens into multiple tokens

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657161#action_12657161 ]

Shai Erera commented on LUCENE-1438:
------------------------------------

These two are not so simple to tackle. They are results of several rules.

1-800-flowers.com is split that way because of the NUM rule, and the HOST rule. The HOST rule requires the token to have some alphanumeric characters, followed by one or more ("." {ALPHANUM}) strings. Therefore this string is not detected as a HOST.
If we were to change the rule to recognize that string as a HOST, then we'd be wrong for strings like "file.pdf", which is clearly not a host. So I don't think how we can satisfy everyone.

For the string www.1-800-flowers.com - the reason it's split is because of the "-" not included in the HOST definition.

But .. this was stated already in other threads w.r.t. StandardTokenizer - it is just a default tokenizer that ships with Lucene. It is not meant to be *THE* tokenizer, and what will make sense to one will not fit the other.

Personally, I think that if you want to correctly identify hosts (or emails, or any other pattern), you should use a specially written annotator (it will be interesting to see a contribution, if there isn't one yet, that integrates between UIMA and Analyzer), rather than rely on some rule that we can argue about for ages. Or ... simply copy most of StandardTokenizer's grammar and change the HOST. I believe it will generate more promising results than trying to change the HOST definition now, since it depends on other definitions, like {NUM}, which sometimes can override other definitions.

> StandardTokenizer splits host names with hyphens into multiple tokens
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-1438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1438
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Robert Newson
>
> StandardTokenizer does not recognize host names with hyphens as a single HOST token. Specifically "www.m-w.com" is tokenized as "www.m" and "w.com", both of "<HOST>" type.
> StandardTokenizer should instead output a single HOST token for "www.m-w.com", since hyphens are a legitimate character in DNS host names.
> We've a local fix to the grammar file which also required us to significantly simplify the NUM type to get the behavior we needed for host names.
> here's a junit test for the desired behavior;
> public void testWithHyphens() throws Exception {
> final String host = "www.m-w.com";
> final StandardTokenizer tokenizer = new StandardTokenizer(
> new StringReader(host));
> final Token token = new Token();
> tokenizer.next(token);
> assertEquals("<HOST>", token.type());
> assertEquals("www.m-w.com", token.term());
> }

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]