Refactor TokenStream implementations to receive configuration from AttributeSource

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Refactor TokenStream implementations to receive configuration from AttributeSource

Adriano Crestani

The AttributeSource/Attribute API was recently added to Lucene, it allows a dynamic communication between TokenStream with better performance, avoiding unnecessary object creation and unnecessary casting. The new query parser framework takes advantage of this new feature using Attributes to set the query parser configuration. The user creates a QueryConfigHandler (which is an AttributeSource) and add its custom Attributes to it, further, at processing time, the query processors load this configuration from the QueryConfigHandler and do whatever it needs to do with it.

I propose to do a simple refactor on all TokenStream implementations, so they start loading the configuration from Attributes. Today, for example, when you use the LengthFilter, you need to specify the min and max length at the constructor, that is fine, but when you create your own Analyzer containing N nested TokenStreams, all the configurations becomes kind of hardcoded.

The TokenStream nesting inside an Analyzer looks like the QueryNodeProcessorPipeline we have in the new QP framework, where there is a pipeline of processors, however, when you assemble the processor pipeline no configuration is specified, the user just need to specify a QueryConfigHandler (AttributeSource), where all the processors will pull the configuration from at processing time. It may look too much complex design for a simple scenario, but it's pretty useful when you have many different kind of processors/tokenstreams assembled, where which one require a lot of configuration data. With this design we separate TokenStream/processor assemble from its configuration.

Thoughts? Suggestions? Or does it sounds like nonsense? :)

Best Regards,
Adriano Crestani