[jira] [Updated] (LUCENE-2309) Fully decouple IndexWriter from analyzers

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-2309) Fully decouple IndexWriter from analyzers

Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Male updated LUCENE-2309:
-------------------------------

    Attachment: LUCENE-2309-analyzer-based.patch

Okay, I thought about this overnight and have tried to come up with a middle ground.  Again, very proof-of-concept.

- Analyzer now moves away from exposing TokenStream (although I've left the methods there) and now returns an AttributeSource.
- Field.consume() now becomes Field.consume(AttributeConsumer, Analyzer).  Here, the Analyzer is that passed into IW.  This means that the Field can decide how it wants to expose its terms.  The default implementation uses the Analyzer, but others can do what they like.
- I've removed adding Analyzer to FieldType, but it could still be exposed as an expert option.  

The overall idea is that the Fields now control how terms are given to DocInverter.

> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2309-analyzer-based.patch, LUCENE-2309.patch
>
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]