[jira] Created: (LUCENE-2788) Make CharFilter reusable

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2788) Make CharFilter reusable

JIRA jira@apache.org
Make CharFilter reusable
------------------------

                 Key: LUCENE-2788
                 URL: https://issues.apache.org/jira/browse/LUCENE-2788
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Robert Muir
            Priority: Minor


The CharFilter API lets you wrap a Reader, altering the contents before the Tokenizer sees them.
It also allows you to correct the offsets so this is transparent to highlighting.

One problem is that the API isn't reusable, if you have a lot of short documents its going to be efficient.
Additionally there is some unnecessary wrapping in Tokenizer (see the CharReader.get in the ctor, but *not* in reset(Reader)!!!)



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2788) Make CharFilter reusable

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2788:
--------------------------------

    Attachment: LUCENE-2788.patch

here's a very quick patch (all tests pass):
* Changed CharFilter to extend FilterReader, and removed CharStream and CharReader.
* added reset(Reader) to CharFilter, so you can reset your entire charfilter chain with a new reader.
* changed Solr to re-use the charfilter chain in its Analyzers.

it would be nice to add specific reuse tests to some of these charfilters, and also to see
if theres a way we can do this with any backwards compatibility... i didnt worry about that when making the patch.


> Make CharFilter reusable
> ------------------------
>
>                 Key: LUCENE-2788
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2788
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2788.patch
>
>
> The CharFilter API lets you wrap a Reader, altering the contents before the Tokenizer sees them.
> It also allows you to correct the offsets so this is transparent to highlighting.
> One problem is that the API isn't reusable, if you have a lot of short documents its going to be efficient.
> Additionally there is some unnecessary wrapping in Tokenizer (see the CharReader.get in the ctor, but *not* in reset(Reader)!!!)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-2788) Make CharFilter reusable

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234556#comment-13234556 ]

Michael McCandless commented on LUCENE-2788:
--------------------------------------------

+1

I really like the approach here (just using FilterReader instead of our own new class).

Since the back-compat is going be tricky... maybe we should first commit this patch to trunk?
               

> Make CharFilter reusable
> ------------------------
>
>                 Key: LUCENE-2788
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2788
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2788.patch
>
>
> The CharFilter API lets you wrap a Reader, altering the contents before the Tokenizer sees them.
> It also allows you to correct the offsets so this is transparent to highlighting.
> One problem is that the API isn't reusable, if you have a lot of short documents its going to be efficient.
> Additionally there is some unnecessary wrapping in Tokenizer (see the CharReader.get in the ctor, but *not* in reset(Reader)!!!)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-2788) Make CharFilter reusable

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234860#comment-13234860 ]

Robert Muir commented on LUCENE-2788:
-------------------------------------

The patch likely needs to be brought up to speed (probably not too bad, but maybe some work).

I'm gonna be focused on 3.6 for a while, so if anyone wants to take this, feel free!
               

> Make CharFilter reusable
> ------------------------
>
>                 Key: LUCENE-2788
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2788
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2788.patch
>
>
> The CharFilter API lets you wrap a Reader, altering the contents before the Tokenizer sees them.
> It also allows you to correct the offsets so this is transparent to highlighting.
> One problem is that the API isn't reusable, if you have a lot of short documents its going to be efficient.
> Additionally there is some unnecessary wrapping in Tokenizer (see the CharReader.get in the ctor, but *not* in reset(Reader)!!!)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]