[jira] [Created] (LUCENE-3979) NGramTokenizer

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LUCENE-3979) NGramTokenizer

Markus Jelsma (Jira)
NGramTokenizer
--------------

                 Key: LUCENE-3979
                 URL: https://issues.apache.org/jira/browse/LUCENE-3979
             Project: Lucene - Java
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 3.0, 2.9.2
         Environment: n/a
            Reporter: David Mason
            Priority: Minor


org.apache.lucene.analysis.ngram.NGramTokenizer removes whitespace, making a search for literal strings like " test" and "test " equivalent to "test". Searching with relevant whitespace is sometimes desired, particularly where ngrams are used.

This could be fixed by either removing .trim() from the line shown below, or by providing a flag to specifically set trimming behaviour (keeping trim=true as the default so that existing code using this analyzer is not broken).

111: inStr = new String(chars).trim();  // remove any trailing empty strings


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3979) NGramTokenizer

Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253086#comment-13253086 ]

David Mason commented on LUCENE-3979:
-------------------------------------

I'm happy to submit a patch for this, but haven't done so for this or similar projects so will take a while to go through the wiki and get set up to make a patch.
               

> NGramTokenizer
> --------------
>
>                 Key: LUCENE-3979
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3979
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 2.9.2, 3.0
>         Environment: n/a
>            Reporter: David Mason
>            Priority: Minor
>              Labels: tokenizer, whitespace
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> org.apache.lucene.analysis.ngram.NGramTokenizer removes whitespace, making a search for literal strings like " test" and "test " equivalent to "test". Searching with relevant whitespace is sometimes desired, particularly where ngrams are used.
> This could be fixed by either removing .trim() from the line shown below, or by providing a flag to specifically set trimming behaviour (keeping trim=true as the default so that existing code using this analyzer is not broken).
> 111: inStr = new String(chars).trim();  // remove any trailing empty strings

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3979) NGramTokenizer strips whitespace, with no option to keep leading and trailing whitespace

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Mason updated LUCENE-3979:
--------------------------------

    Summary: NGramTokenizer strips whitespace, with no option to keep leading and trailing whitespace  (was: NGramTokenizer)
   

> NGramTokenizer strips whitespace, with no option to keep leading and trailing whitespace
> ----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3979
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3979
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 2.9.2, 3.0
>         Environment: n/a
>            Reporter: David Mason
>            Priority: Minor
>              Labels: tokenizer, whitespace
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> org.apache.lucene.analysis.ngram.NGramTokenizer removes whitespace, making a search for literal strings like " test" and "test " equivalent to "test". Searching with relevant whitespace is sometimes desired, particularly where ngrams are used.
> This could be fixed by either removing .trim() from the line shown below, or by providing a flag to specifically set trimming behaviour (keeping trim=true as the default so that existing code using this analyzer is not broken).
> 111: inStr = new String(chars).trim();  // remove any trailing empty strings

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]