Chris Male commented on LUCENE-2309:
I just want to re-iterate the point that I'm just exploring options here. I'm really glad to be getting feedback but no direction has been set in stone.
I thought about this a lot, I'm not sure we need to minimize TokenStream, I think it might be fine! There is really nothing much more minimal than this, its just attributesource + incrementToken() + reset() + end() + close()...
Okay interesting. I admit I don't really have any concrete thoughts on an 'alternative' to TokenStream, but wouldn't you agree that exposing a more limited API to the Indexer is beneficial here? I agree we're only talking about a handful of methods currently.
Yeah but thats currently an expert option, not the normal case. In general if we are saying IndexWriter doesn't take Analyzer but has some schema that is a list of Fields, then QueryParser etc need to take this list of fields also, so that queryparsing is consistent with analysis. But i'm not sure I like this either: I think it only couples QueryParser with IndexWriter.
Two things that jump out at me here. Firstly, IndexWriter does take a list of Fields, the Fields it indexes in each Document. We've also identified that people might want to apply different analysis per field, thats why we have PerFieldAnalyzerWrapper isn't it?
Secondly, we now have multiple QueryParsing implementations consolidated together. I hope over time we'll add more / different implementations which do things differently. So while I totally agree with the sentiment that we shouldn't make this confusing for users and that searching and indexing should work together, I'm certainly open to exploring other ways to do that.
I've looked at doing e.g. the first of these and I know its a huge mondo pain in the ass, these "smaller" changes are really hard and a ton of work.
Independent of what's decided here, I'm definitely happy to do the heavy lifting on the improvements you've suggested.
> Fully decouple IndexWriter from analyzers
> Key: LUCENE-2309
> URL: https://issues.apache.org/jira/browse/LUCENE-2309 > Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Labels: gsoc2011, lucene-gsoc-11, mentor
> Fix For: 4.0
> Attachments: LUCENE-2309.patch
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields. This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there. (We'd still need existing IW code for back-compat).