Michael McCandless commented on LUCENE-2309:
I think there are two rather separate ideas here?
First, IW should not have to "know" how to get a TokenStream from a
IndexableField; it should only ask the Field for the token stream and get that
back and iterate its tokens.
Under the hood (in the IndexableField impl) is where the logic for
tokenized or not, Reader vs String vs pre-created token stream,
etc. should live, instead of hardwired inside indexer. Maybe an app
has a fully custom way to make a token stream for the field...
Likewise, for multi-valued fields, IW shouldn't "see" the separate
values; it should just receive a single token stream, and under the
hood (in Document/Field impl) it's concatenating separate token
streams, adding posIncr/offset gaps, etc. This too is now hardwired
in indexer but shouldn't be. Maybe an app wants to insert custom
"separator" tokens between the values...
(And I agree: as a pre-req we need to fix Analyzer to not allow
non-reused token streams; else we can't concatenate w/o attr
If IW still receives analyzer and simply passes it through when asking
for the tokenStream I think that's fine for now. In the future, I
think IW should not receive analyzer (ie, it should be agnostic to how
the app creates token streams); rather, each FieldType would hold the
analyzer for that field. However, that sounds contentious, so let's
leave it for another day.
Second, this new idea to "invert" TokenStream into an AttrConsumer,
which I think is separate? I'm actually not sure I like such an
approach... it seems more confusing for simple usage? Ie, if I want
to analyze some text and iterate over the tokens... suddenly, instead
of a few lines of local code, I have to make a class instance with a
method that receives each token? It seems more convoluted? I
mean, for Lucene's limited internal usage of token stream, this is
fine, but for others who consume token streams... it seems more
Anyway, I think we should open a separate issue for "invert
TokenStream into AttrConsumer"?
> Fully decouple IndexWriter from analyzers
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
> Attachments: LUCENE-2309-analyzer-based.patch, LUCENE-2309.patch
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields. This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there. (We'd still need existing IW code for back-compat).