New flexible query parser

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

New flexible query parser

Michael Busch
Hello,

in my team at IBM we have used a different query parser than Lucene's in
our products for quite a while. Recently we spent a significant amount
of time in refactoring the code and designing a very generic
architecture, so that this query parser can be easily used for different
products with varying query syntaxes.

This work was originally driven by Andreas Neumann (who, however, left
our team); most of the code was written by Luis Alves, who has been a
bit active in Lucene in the past, and Adriano Campos, who joined our
team at IBM half a year ago. Adriano is Apache committer and PMC member
on the Tuscany project and getting familiar with Lucene now too.

We think this code is much more flexible and extensible than the current
Lucene query parser, and would therefore like to contribute it to
Lucene. I'd like to give a very brief architecture overview here,
Adriano and Luis can then answer more detailed questions as they're much
more familiar with the code than I am.
The goal was it to separate syntax and semantics of a query. E.g. 'a AND
b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
We distinguish the semantics of the different query components, e.g.
whether and how to tokenize/lemmatize/normalize the different terms or
which Query objects to create for the terms. We wanted to be able to
write a parser with a new syntax, while reusing the underlying
semantics, as quickly as possible.
In fact, Adriano is currently working on a 100% Lucene-syntax compatible
implementation to make it easy for people who are using Lucene's query
parser to switch.

The query parser has three layers and its core is what we call the
QueryNodeTree. It is a tree that initially represents the syntax of the
original query, e.g. for 'a AND b':
   AND
  /   \
A     B

The three layers are:
1. QueryParser
2. QueryNodeProcessor
3. QueryBuilder

1. The upper layer is the parsing layer which simply transforms the
query text string into a QueryNodeTree. Currently our implementations of
this layer use javacc.
2. The query node processors do most of the work. It is in fact a
configurable chain of processors. Each processors can walk the tree and
modify nodes or even the tree's structure. That makes it possible to
e.g. do query optimization before the query is executed or to tokenize
terms.
3. The third layer is also a configurable chain of builders, which
transform the QueryNodeTree into Lucene Query objects.

Furthermore the query parser uses flexible configuration objects, which
are based on AttributeSource/Attribute. It also uses message classes that
allow to attach resource bundles. This makes it possible to translate
messages, which is an important feature of a query parser.

This design allows us to develop different query syntaxes very quickly.
Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
underlying processors and builders in a few days. We now have a 100%
compatible Lucene query parser, which means the syntax is identical and
all query parser test cases pass on the new one too using a wrapper.


Recent posts show that there is demand for query syntax improvements,
e.g improved range query syntax or operator precedence. There are
already different QP implementations in Lucene+contrib, however I think
we did not keep them all up to date and in sync. This is not too
surprising, because usually when fixes and changes are made to the main
query parser, people don't make the corresponding changes in the contrib
parsers. (I'm guilty here too)
With this new architecture it will be much easier to maintain different
query syntaxes, as the actual code for the first layer is not very much.
All syntaxes would benefit from patches and improvements we make to the
underlying layers, which will make supporting different syntaxes much
more manageable.

So if there is interest we would like to contribute this work to Lucene.
I think the amount of code (~6K LOC) is higher than in a usual patch,
but also lower than some contrib modules. So I'm not sure if we could
contribute it as a normal patch or maybe a software grant?
We could also maybe think about adding it as a contrib module initially,
and if people like it move it to the core at a later point. I'd actually
prefer this approach over committing to the core directly, as it would
make it easier to make Luis and Adriano contrib committers on the new
module, which of course makes sense as nobody knows the code better than
they do.

-Michael




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Mark Miller-3
Very interesting. Can this parser solve the Lucene query syntax
precedence issues? Would be great to match the current syntax with full
precedence support.

It sounds like a great bit of work to move forward too - I'll be the
first to sound in that the current implementation could use improvement,
and your implementation sounds great in prose. Would be nice to skim the
code though.

Many of the things that it would be nice to do (perhaps add span support
to the standard syntax with an on/off toggle?, etc) is very difficult to
build on the current architecture. What you describe indicates these
type of things might becomes easier than they are today.

My vote for contrib would depend on the state of the code - if it passes
all the tests and is truly back compat, and is not crazy slower, I don't
see why we don't move it in right away depending on confidence levels.
That would ensure use and attention that contrib often misses. The old
parser could hang around in deprecation.

- Mark

Michael Busch wrote:

> Hello,
>
> in my team at IBM we have used a different query parser than Lucene's in
> our products for quite a while. Recently we spent a significant amount
> of time in refactoring the code and designing a very generic
> architecture, so that this query parser can be easily used for different
> products with varying query syntaxes.
>
> This work was originally driven by Andreas Neumann (who, however, left
> our team); most of the code was written by Luis Alves, who has been a
> bit active in Lucene in the past, and Adriano Campos, who joined our
> team at IBM half a year ago. Adriano is Apache committer and PMC member
> on the Tuscany project and getting familiar with Lucene now too.
>
> We think this code is much more flexible and extensible than the current
> Lucene query parser, and would therefore like to contribute it to
> Lucene. I'd like to give a very brief architecture overview here,
> Adriano and Luis can then answer more detailed questions as they're much
> more familiar with the code than I am.
> The goal was it to separate syntax and semantics of a query. E.g. 'a AND
> b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
> We distinguish the semantics of the different query components, e.g.
> whether and how to tokenize/lemmatize/normalize the different terms or
> which Query objects to create for the terms. We wanted to be able to
> write a parser with a new syntax, while reusing the underlying
> semantics, as quickly as possible.
> In fact, Adriano is currently working on a 100% Lucene-syntax compatible
> implementation to make it easy for people who are using Lucene's query
> parser to switch.
>
> The query parser has three layers and its core is what we call the
> QueryNodeTree. It is a tree that initially represents the syntax of the
> original query, e.g. for 'a AND b':
>   AND
>  /   \
> A     B
>
> The three layers are:
> 1. QueryParser
> 2. QueryNodeProcessor
> 3. QueryBuilder
>
> 1. The upper layer is the parsing layer which simply transforms the
> query text string into a QueryNodeTree. Currently our implementations of
> this layer use javacc.
> 2. The query node processors do most of the work. It is in fact a
> configurable chain of processors. Each processors can walk the tree and
> modify nodes or even the tree's structure. That makes it possible to
> e.g. do query optimization before the query is executed or to tokenize
> terms.
> 3. The third layer is also a configurable chain of builders, which
> transform the QueryNodeTree into Lucene Query objects.
>
> Furthermore the query parser uses flexible configuration objects, which
> are based on AttributeSource/Attribute. It also uses message classes that
> allow to attach resource bundles. This makes it possible to translate
> messages, which is an important feature of a query parser.
>
> This design allows us to develop different query syntaxes very quickly.
> Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
> underlying processors and builders in a few days. We now have a 100%
> compatible Lucene query parser, which means the syntax is identical and
> all query parser test cases pass on the new one too using a wrapper.
>
>
> Recent posts show that there is demand for query syntax improvements,
> e.g improved range query syntax or operator precedence. There are
> already different QP implementations in Lucene+contrib, however I think
> we did not keep them all up to date and in sync. This is not too
> surprising, because usually when fixes and changes are made to the main
> query parser, people don't make the corresponding changes in the contrib
> parsers. (I'm guilty here too)
> With this new architecture it will be much easier to maintain different
> query syntaxes, as the actual code for the first layer is not very much.
> All syntaxes would benefit from patches and improvements we make to the
> underlying layers, which will make supporting different syntaxes much
> more manageable.
>
> So if there is interest we would like to contribute this work to Lucene.
> I think the amount of code (~6K LOC) is higher than in a usual patch,
> but also lower than some contrib modules. So I'm not sure if we could
> contribute it as a normal patch or maybe a software grant?
> We could also maybe think about adding it as a contrib module initially,
> and if people like it move it to the core at a later point. I'd actually
> prefer this approach over committing to the core directly, as it would
> make it easier to make Luis and Adriano contrib committers on the new
> module, which of course makes sense as nobody knows the code better than
> they do.
>
> -Michael
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Michael Busch
On 3/17/09 12:39 AM, Mark Miller wrote:
> Very interesting. Can this parser solve the Lucene query syntax
> precedence issues? Would be great to match the current syntax with
> full precedence support.
>
Yes. In fact in our product we use a slightly different query syntax. It
has operator precendence, and also <=, >= syntax for range queries.
(which was wished for in a different thread here...)

> It sounds like a great bit of work to move forward too - I'll be the
> first to sound in that the current implementation could use
> improvement, and your implementation sounds great in prose. Would be
> nice to skim the code though.
>
We're preparing a patch - should be ready soon.

> Many of the things that it would be nice to do (perhaps add span
> support to the standard syntax with an on/off toggle?, etc) is very
> difficult to build on the current architecture. What you describe
> indicates these type of things might becomes easier than they are today.
>
Yes, all these things should be much easier to add compared to the old QP.

> My vote for contrib would depend on the state of the code - if it
> passes all the tests and is truly back compat, and is not crazy
> slower, I don't see why we don't move it in right away depending on
> confidence levels. That would ensure use and attention that contrib
> often misses. The old parser could hang around in deprecation.
I think we can postpone this decision until we have submitted the code
and gotten some feedback. I personally think this is pretty solid code
with good unit tests and documentation. So I'd also be fine with adding
it to the core.

>
> - Mark
>
> Michael Busch wrote:
>> Hello,
>>
>> in my team at IBM we have used a different query parser than Lucene's in
>> our products for quite a while. Recently we spent a significant amount
>> of time in refactoring the code and designing a very generic
>> architecture, so that this query parser can be easily used for different
>> products with varying query syntaxes.
>>
>> This work was originally driven by Andreas Neumann (who, however, left
>> our team); most of the code was written by Luis Alves, who has been a
>> bit active in Lucene in the past, and Adriano Campos, who joined our
>> team at IBM half a year ago. Adriano is Apache committer and PMC member
>> on the Tuscany project and getting familiar with Lucene now too.
>>
>> We think this code is much more flexible and extensible than the current
>> Lucene query parser, and would therefore like to contribute it to
>> Lucene. I'd like to give a very brief architecture overview here,
>> Adriano and Luis can then answer more detailed questions as they're much
>> more familiar with the code than I am.
>> The goal was it to separate syntax and semantics of a query. E.g. 'a AND
>> b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
>> We distinguish the semantics of the different query components, e.g.
>> whether and how to tokenize/lemmatize/normalize the different terms or
>> which Query objects to create for the terms. We wanted to be able to
>> write a parser with a new syntax, while reusing the underlying
>> semantics, as quickly as possible.
>> In fact, Adriano is currently working on a 100% Lucene-syntax compatible
>> implementation to make it easy for people who are using Lucene's query
>> parser to switch.
>>
>> The query parser has three layers and its core is what we call the
>> QueryNodeTree. It is a tree that initially represents the syntax of the
>> original query, e.g. for 'a AND b':
>>   AND
>>  /   \
>> A     B
>>
>> The three layers are:
>> 1. QueryParser
>> 2. QueryNodeProcessor
>> 3. QueryBuilder
>>
>> 1. The upper layer is the parsing layer which simply transforms the
>> query text string into a QueryNodeTree. Currently our implementations of
>> this layer use javacc.
>> 2. The query node processors do most of the work. It is in fact a
>> configurable chain of processors. Each processors can walk the tree and
>> modify nodes or even the tree's structure. That makes it possible to
>> e.g. do query optimization before the query is executed or to tokenize
>> terms.
>> 3. The third layer is also a configurable chain of builders, which
>> transform the QueryNodeTree into Lucene Query objects.
>>
>> Furthermore the query parser uses flexible configuration objects, which
>> are based on AttributeSource/Attribute. It also uses message classes
>> that
>> allow to attach resource bundles. This makes it possible to translate
>> messages, which is an important feature of a query parser.
>>
>> This design allows us to develop different query syntaxes very quickly.
>> Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
>> underlying processors and builders in a few days. We now have a 100%
>> compatible Lucene query parser, which means the syntax is identical and
>> all query parser test cases pass on the new one too using a wrapper.
>>
>>
>> Recent posts show that there is demand for query syntax improvements,
>> e.g improved range query syntax or operator precedence. There are
>> already different QP implementations in Lucene+contrib, however I think
>> we did not keep them all up to date and in sync. This is not too
>> surprising, because usually when fixes and changes are made to the main
>> query parser, people don't make the corresponding changes in the contrib
>> parsers. (I'm guilty here too)
>> With this new architecture it will be much easier to maintain different
>> query syntaxes, as the actual code for the first layer is not very much.
>> All syntaxes would benefit from patches and improvements we make to the
>> underlying layers, which will make supporting different syntaxes much
>> more manageable.
>>
>> So if there is interest we would like to contribute this work to Lucene.
>> I think the amount of code (~6K LOC) is higher than in a usual patch,
>> but also lower than some contrib modules. So I'm not sure if we could
>> contribute it as a normal patch or maybe a software grant?
>> We could also maybe think about adding it as a contrib module initially,
>> and if people like it move it to the core at a later point. I'd actually
>> prefer this approach over committing to the core directly, as it would
>> make it easier to make Luis and Adriano contrib committers on the new
>> module, which of course makes sense as nobody knows the code better than
>> they do.
>>
>> -Michael
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Adriano Crestani-2
Hi everyone,

Very interesting. Can this parser solve the Lucene query syntax precedence issues? Would be great to match the current syntax with full precedence support.

Definitely. Actually, the one we have today already has precedence, and to pass on all Lucene test cases I had to write a processor that removes this precedence and mimic the Lucene non-precedence behavior. It was easy like this: write a processor and insert it into the processors chain, piece of cake : )

Many of the things that it would be nice to do (perhaps add span support to the standard syntax with an on/off toggle?, etc) is very difficult to build on the current architecture. What you describe indicates these type of things might becomes easier than they are today.

Yes, all these things should be much easier to add compared to the old QP.

Yes, as Michael already said, much easier with this new architecture. We would basically need to change the QueryParser so it supports any new syntax (if there is any new)...write one or more processors to handle any new logic (if there is any) and create a new SpanQueryBuilder that would create SpansQuery object instead of the regular Query objects, so we can switch between the SpanQueryBuilder  and the regular QueryBuilder whenever we want to generate SpanQuery or regular Query objects. The point is that this new architecture very flexible and incremental, completely different from the one Lucene has today.

Best Regards,
Adriano Crestani Campos

On Mon, Mar 16, 2009 at 4:49 PM, Michael Busch <[hidden email]> wrote:
On 3/17/09 12:39 AM, Mark Miller wrote:
Very interesting. Can this parser solve the Lucene query syntax precedence issues? Would be great to match the current syntax with full precedence support.

Yes. In fact in our product we use a slightly different query syntax. It has operator precendence, and also <=, >= syntax for range queries. (which was wished for in a different thread here...)


It sounds like a great bit of work to move forward too - I'll be the first to sound in that the current implementation could use improvement, and your implementation sounds great in prose. Would be nice to skim the code though.

We're preparing a patch - should be ready soon.


Many of the things that it would be nice to do (perhaps add span support to the standard syntax with an on/off toggle?, etc) is very difficult to build on the current architecture. What you describe indicates these type of things might becomes easier than they are today.

Yes, all these things should be much easier to add compared to the old QP.


My vote for contrib would depend on the state of the code - if it passes all the tests and is truly back compat, and is not crazy slower, I don't see why we don't move it in right away depending on confidence levels. That would ensure use and attention that contrib often misses. The old parser could hang around in deprecation.
I think we can postpone this decision until we have submitted the code and gotten some feedback. I personally think this is pretty solid code with good unit tests and documentation. So I'd also be fine with adding it to the core.



- Mark

Michael Busch wrote:
Hello,

in my team at IBM we have used a different query parser than Lucene's in
our products for quite a while. Recently we spent a significant amount
of time in refactoring the code and designing a very generic
architecture, so that this query parser can be easily used for different
products with varying query syntaxes.

This work was originally driven by Andreas Neumann (who, however, left
our team); most of the code was written by Luis Alves, who has been a
bit active in Lucene in the past, and Adriano Campos, who joined our
team at IBM half a year ago. Adriano is Apache committer and PMC member
on the Tuscany project and getting familiar with Lucene now too.

We think this code is much more flexible and extensible than the current
Lucene query parser, and would therefore like to contribute it to
Lucene. I'd like to give a very brief architecture overview here,
Adriano and Luis can then answer more detailed questions as they're much
more familiar with the code than I am.
The goal was it to separate syntax and semantics of a query. E.g. 'a AND
b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
We distinguish the semantics of the different query components, e.g.
whether and how to tokenize/lemmatize/normalize the different terms or
which Query objects to create for the terms. We wanted to be able to
write a parser with a new syntax, while reusing the underlying
semantics, as quickly as possible.
In fact, Adriano is currently working on a 100% Lucene-syntax compatible
implementation to make it easy for people who are using Lucene's query
parser to switch.

The query parser has three layers and its core is what we call the
QueryNodeTree. It is a tree that initially represents the syntax of the
original query, e.g. for 'a AND b':
 AND
 /   \
A     B

The three layers are:
1. QueryParser
2. QueryNodeProcessor
3. QueryBuilder

1. The upper layer is the parsing layer which simply transforms the
query text string into a QueryNodeTree. Currently our implementations of
this layer use javacc.
2. The query node processors do most of the work. It is in fact a
configurable chain of processors. Each processors can walk the tree and
modify nodes or even the tree's structure. That makes it possible to
e.g. do query optimization before the query is executed or to tokenize
terms.
3. The third layer is also a configurable chain of builders, which
transform the QueryNodeTree into Lucene Query objects.

Furthermore the query parser uses flexible configuration objects, which
are based on AttributeSource/Attribute. It also uses message classes that
allow to attach resource bundles. This makes it possible to translate
messages, which is an important feature of a query parser.

This design allows us to develop different query syntaxes very quickly.
Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
underlying processors and builders in a few days. We now have a 100%
compatible Lucene query parser, which means the syntax is identical and
all query parser test cases pass on the new one too using a wrapper.


Recent posts show that there is demand for query syntax improvements,
e.g improved range query syntax or operator precedence. There are
already different QP implementations in Lucene+contrib, however I think
we did not keep them all up to date and in sync. This is not too
surprising, because usually when fixes and changes are made to the main
query parser, people don't make the corresponding changes in the contrib
parsers. (I'm guilty here too)
With this new architecture it will be much easier to maintain different
query syntaxes, as the actual code for the first layer is not very much.
All syntaxes would benefit from patches and improvements we make to the
underlying layers, which will make supporting different syntaxes much
more manageable.

So if there is interest we would like to contribute this work to Lucene.
I think the amount of code (~6K LOC) is higher than in a usual patch,
but also lower than some contrib modules. So I'm not sure if we could
contribute it as a normal patch or maybe a software grant?
We could also maybe think about adding it as a contrib module initially,
and if people like it move it to the core at a later point. I'd actually
prefer this approach over committing to the core directly, as it would
make it easier to make Luis and Adriano contrib committers on the new
module, which of course makes sense as nobody knows the code better than
they do.

-Michael




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Paul Elschot
In reply to this post by Michael Busch

On Tuesday 17 March 2009 00:23:37 Michael Busch wrote:

> Hello,

>

> in my team at IBM we have used a different query parser than Lucene's in

> our products for quite a while. Recently we spent a significant amount

> of time in refactoring the code and designing a very generic

> architecture, so that this query parser can be easily used for different

> products with varying query syntaxes.

>

>...

>

> We think this code is much more flexible and extensible than the current

> Lucene query parser, and would therefore like to contribute it to

> Lucene. I'd like to give a very brief architecture overview here,

> Adriano and Luis can then answer more detailed questions as they're much

> more familiar with the code than I am.

> The goal was it to separate syntax and semantics of a query. E.g. 'a AND

> b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.

I'd like to try 'AND a, b, c', too, and only require brackets when nesting.

With hindsight, the brackets in the Surround language are sometimes

a bit of a burden for the user. But they do make it easier to define a syntax.

> We distinguish the semantics of the different query components, e.g.

> whether and how to tokenize/lemmatize/normalize the different terms or

> which Query objects to create for the terms. We wanted to be able to

> write a parser with a new syntax, while reusing the underlying

> semantics, as quickly as possible.

> In fact, Adriano is currently working on a 100% Lucene-syntax compatible

> implementation to make it easy for people who are using Lucene's query

> parser to switch.

>

> The query parser has three layers and its core is what we call the

> QueryNodeTree. It is a tree that initially represents the syntax of the

> original query, e.g. for 'a AND b':

> AND

> / \

> A B

>

> The three layers are:

> 1. QueryParser

> 2. QueryNodeProcessor

> 3. QueryBuilder

>

> 1. The upper layer is the parsing layer which simply transforms the

> query text string into a QueryNodeTree. Currently our implementations of

> this layer use javacc.

Lucene has moved from javacc to jflex, iirc mostly because parsing

speed. Any comments on that?

> 2. The query node processors do most of the work. It is in fact a

> configurable chain of processors. Each processors can walk the tree and

> modify nodes or even the tree's structure. That makes it possible to

> e.g. do query optimization before the query is executed or to tokenize

> terms.

> 3. The third layer is also a configurable chain of builders, which

> transform the QueryNodeTree into Lucene Query objects.

In the Surround language in contrib there are actually only two layers,

the QueryNodeProcessor layer is missing there.

> Furthermore the query parser uses flexible configuration objects, which

> are based on AttributeSource/Attribute. It also uses message classes that

> allow to attach resource bundles. This makes it possible to translate

> messages, which is an important feature of a query parser.

>

> This design allows us to develop different query syntaxes very quickly.

> Adriano wrote the Lucene-compatible syntax in a matter of hours, and the

> underlying processors and builders in a few days. We now have a 100%

> compatible Lucene query parser, which means the syntax is identical and

> all query parser test cases pass on the new one too using a wrapper.

>

>

> Recent posts show that there is demand for query syntax improvements,

> e.g improved range query syntax or operator precedence. There are

> already different QP implementations in Lucene+contrib, however I think

> we did not keep them all up to date and in sync. This is not too

> surprising, because usually when fixes and changes are made to the main

> query parser, people don't make the corresponding changes in the contrib

> parsers. (I'm guilty here too)

> With this new architecture it will be much easier to maintain different

> query syntaxes, as the actual code for the first layer is not very much.

> All syntaxes would benefit from patches and improvements we make to the

> underlying layers, which will make supporting different syntaxes much

> more manageable.

For example, an option to get rid of redundant layers of BooleanQueries

would be welcome.

>

> So if there is interest we would like to contribute this work to Lucene.

I'd like to port the Surround language onto it, and perhaps even create

a syntax extension (from the standard parser) for the result.

> ...

Regards,

Paul Elschot

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Grant Ingersoll-2
In reply to this post by Michael Busch

On Mar 16, 2009, at 7:23 PM, Michael Busch wrote:
>
> So if there is interest we would like to contribute this work to  
> Lucene.
> I think the amount of code (~6K LOC) is higher than in a usual patch,
> but also lower than some contrib modules. So I'm not sure if we could
> contribute it as a normal patch or maybe a software grant?

Yes, it would need to be done through a software grant.  I can  
facilitate that.  The process is relatively painless.  See http://incubator.apache.org/ip-clearance/index.html 
  and http://svn.apache.org/repos/asf/incubator/public/trunk/site-author/ip-clearance/ip-clearance-template.xml

Basically, file it as a patch and then let the list know, and I can  
start the paperwork.  It would really help if you would go through the  
template and make sure that it is easy for me to answer all the  
questions in it (things like licenses, MD5 hashes, etc.)

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Michael Busch
Thanks Grant, I'll go through the template tonight.

Luis and Adriano are preparing a patch - they should be ready in a day
or two. So we can simply open a Jira issue and attach as a normal patch?

-Michael

On 3/17/09 12:06 PM, Grant Ingersoll wrote:

>
> On Mar 16, 2009, at 7:23 PM, Michael Busch wrote:
>>
>> So if there is interest we would like to contribute this work to Lucene.
>> I think the amount of code (~6K LOC) is higher than in a usual patch,
>> but also lower than some contrib modules. So I'm not sure if we could
>> contribute it as a normal patch or maybe a software grant?
>
> Yes, it would need to be done through a software grant.  I can
> facilitate that.  The process is relatively painless.  See
> http://incubator.apache.org/ip-clearance/index.html and
> http://svn.apache.org/repos/asf/incubator/public/trunk/site-author/ip-clearance/ip-clearance-template.xml 
>
>
> Basically, file it as a patch and then let the list know, and I can
> start the paperwork.  It would really help if you would go through the
> template and make sure that it is easy for me to answer all the
> questions in it (things like licenses, MD5 hashes, etc.)
>
> -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Grant Ingersoll-2

On Mar 17, 2009, at 11:18 AM, Michael Busch wrote:

> Thanks Grant, I'll go through the template tonight.
>
> Luis and Adriano are preparing a patch - they should be ready in a  
> day or two. So we can simply open a Jira issue and attach as a  
> normal patch?

Yes, just don't commit it and email me back on this thread what issue  
it is.  I likely won't be able to get to it until next week at the  
earliest, but maybe we can just figure it out at ACEU.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Michael Busch
OK, sounds good. Thanks, Grant.
 
-Michael

On Tue, Mar 17, 2009 at 10:02 AM, Grant Ingersoll <[hidden email]> wrote:

On Mar 17, 2009, at 11:18 AM, Michael Busch wrote:

Thanks Grant, I'll go through the template tonight.

Luis and Adriano are preparing a patch - they should be ready in a day or two. So we can simply open a Jira issue and attach as a normal patch?

Yes, just don't commit it and email me back on this thread what issue it is.  I likely won't be able to get to it until next week at the earliest, but maybe we can just figure it out at ACEU.


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

hossman
In reply to this post by Mark Miller-3

: My vote for contrib would depend on the state of the code - if it passes all
: the tests and is truly back compat, and is not crazy slower, I don't see why
: we don't move it in right away depending on confidence levels. That would
: ensure use and attention that contrib often misses. The old parser could hang
: around in deprecation.

FWIW: It's always bugged me that the existing queryParser is in the "core"
anyway ... as i've mentioned before: I'd love to see us move towards
putting more features and add-on functionality in contribs and keeping the
core as lean as possible: just the core functionality for indexing &
searching ... when things are split up, it's easy for people who want
every lucene feature to include a bunch of jars; it's harder for people
who want to run lucene in a small footprint (embedded apps?) to extract
classes from a big jar.

so my vote would be to make it a contrib ... even if we do deprecate the
current query parser because this can be 100% back compatible -- it just
makes it a great opportunity to get query parsing out of hte core.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New flexible query parser

Michael Busch
On 3/20/09 10:58 PM, Chris Hostetter wrote:

> : My vote for contrib would depend on the state of the code - if it passes all
> : the tests and is truly back compat, and is not crazy slower, I don't see why
> : we don't move it in right away depending on confidence levels. That would
> : ensure use and attention that contrib often misses. The old parser could hang
> : around in deprecation.
>
> FWIW: It's always bugged me that the existing queryParser is in the "core"
> anyway ... as i've mentioned before: I'd love to see us move towards
> putting more features and add-on functionality in contribs and keeping the
> core as lean as possible: just the core functionality for indexing&
> searching ... when things are split up, it's easy for people who want
> every lucene feature to include a bunch of jars; it's harder for people
> who want to run lucene in a small footprint (embedded apps?) to extract
> classes from a big jar.
>    
+1. I'd love to see Lucene going into such a direction.

However, I'm a little worried about contrib's reputation. I think it
contains components with differing levels of activity, maturity and support.
So maybe instead of moving things from core into contrib to achieve the
goal you mentioned, we could create a new folder named e.g.
'components', which will contain stuff that we claim is as stable,
mature and supported as the core, just packaged into separate jars.
Those jars should then only have dependencies on the core, but not on
each other. They would also follow the same backwards-compatibility and
other requirements as the core. Thoughts?

-Michael

> so my vote would be to make it a contrib ... even if we do deprecate the
> current query parser because this can be 100% back compatible -- it just
> makes it a great opportunity to get query parsing out of hte core.
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Modularization (was: Re: New flexible query parser)

Michael Busch
On 3/21/09 12:27 AM, Michael Busch wrote:

> +1. I'd love to see Lucene going into such a direction.
>
> However, I'm a little worried about contrib's reputation. I think it
> contains components with differing levels of activity, maturity and
> support.
> So maybe instead of moving things from core into contrib to achieve
> the goal you mentioned, we could create a new folder named e.g.
> 'components', which will contain stuff that we claim is as stable,
> mature and supported as the core, just packaged into separate jars.
> Those jars should then only have dependencies on the core, but not on
> each other. They would also follow the same backwards-compatibility
> and other requirements as the core. Thoughts?

I guess something very similar has been proposed and discussed here:
http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894
(same link that Hoss sent while having his deja vu)...

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modularization (was: Re: New flexible query parser)

Michael McCandless-2
I think we are mixing up source code modularity with
bundling/packaging.

Honestly, I would not mind much where the source code lives in svn, so
long as a developer, upon downloading Lucene 2.9, can go to *one*
place (javadocs) for Lucene's "queries & filters" and see
{Int,Long}NumberRangeFilter in there.

We are not there today: a developer must first realize there's a whole
separate place to look for "other" queries (contrib/queries).  Then
the developer browses that and likely becomes confused/misled by what
TrieRangeQuery means (is it a letter trie?).

My goal here is Lucene's consumability -- when someone new says "hey I
heard about this great search library called Lucene; let me go try it
out" I want that first impression to be as solid as possible.  I think
this is very important for growing Lucene's community.  This is why
"out of the box" defaults are so crucial (eg changing IW from flushing
every 10 docs to every 16 MB gained sizable throughput).

How many times have we seen a review, article, blog post, etc.,
comparing Lucene to other search libraries only to incorrectly
complain because "Lucene can't do XYZ" or "Lucene's indexing
performance is poor", etc, because they didn't dig in to learn all the
tunings/options/tricks we all know you are supposed to do?  (It
frustrates me to end when this happens).  This then hurts Lucene's
adoption because others read such articles and conclude Lucene is a
non-starter.

We all ought to be concerned with Lucene's adoption & growth with time
(I am), and first-impression consumability / out of the box defaults
are big drivers of that.

What if (maybe for 3.0, since we can mix in 1.5 sources at that
point?) we change how Lucene is bundled, such that core queries and
contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
lucene-analyzers-3.0.jar would include contrib/analyzers/* and
org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.

Mike

Michael Busch wrote:

> On 3/21/09 12:27 AM, Michael Busch wrote:
>> +1. I'd love to see Lucene going into such a direction.
>>
>> However, I'm a little worried about contrib's reputation. I think  
>> it contains components with differing levels of activity, maturity  
>> and support.
>> So maybe instead of moving things from core into contrib to achieve  
>> the goal you mentioned, we could create a new folder named e.g.  
>> 'components', which will contain stuff that we claim is as stable,  
>> mature and supported as the core, just packaged into separate jars.  
>> Those jars should then only have dependencies on the core, but not  
>> on each other. They would also follow the same backwards-
>> compatibility and other requirements as the core. Thoughts?
>
> I guess something very similar has been proposed and discussed here: http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894
> (same link that Hoss sent while having his deja vu)...
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Modularization (was: Re: New flexible query parser)

Uwe Schindler

> Honestly, I would not mind much where the source code lives in svn, so
> long as a developer, upon downloading Lucene 2.9, can go to *one*
> place (javadocs) for Lucene's "queries & filters" and see
> {Int,Long}NumberRangeFilter in there.
>
> We are not there today: a developer must first realize there's a whole
> separate place to look for "other" queries (contrib/queries).  Then
> the developer browses that and likely becomes confused/misled by what
> TrieRangeQuery means (is it a letter trie?).

That is a problem. The contrib/queries is a typical example of a
contribution that is almost always used in third-party projects (Solr):
It is stable and does not depend on other thing like the core and is 1.4
compatible (at the moment). Other contributions have external dependencies
or need another java version than the core.
I would split both types of contributions and would give the stable and
only-on-core depending ones a higher ranking (like put them into the
top-level changes list). E.g. when we release 2.9, nobody will realize, that
there is a new TrieRangeFilter in contrib/queries, because it is not in the
top-level changes list. Or the new contrib/spatial should have a visibility.
 

> My goal here is Lucene's consumability -- when someone new says "hey I
> heard about this great search library called Lucene; let me go try it
> out" I want that first impression to be as solid as possible.  I think
> this is very important for growing Lucene's community.  This is why
> "out of the box" defaults are so crucial (eg changing IW from flushing
> every 10 docs to every 16 MB gained sizable throughput).
>
> How many times have we seen a review, article, blog post, etc.,
> comparing Lucene to other search libraries only to incorrectly
> complain because "Lucene can't do XYZ" or "Lucene's indexing
> performance is poor", etc, because they didn't dig in to learn all the
> tunings/options/tricks we all know you are supposed to do?  (It
> frustrates me to end when this happens).  This then hurts Lucene's
> adoption because others read such articles and conclude Lucene is a
> non-starter.

I know this problem. And about the contrib queries: Most developments that
use Lucene (e.g. Solr) use always some of the contrib jars. And almost
everytime contrib/queries. But starters like the journalists writing those
articles, only take the core and test something with it.

So splitting up the whole Lucene in different parts is better (so these
people must always think about all available packages and which they need
for their project):

> We all ought to be concerned with Lucene's adoption & growth with time
> (I am), and first-impression consumability / out of the box defaults
> are big drivers of that.
>
> What if (maybe for 3.0, since we can mix in 1.5 sources at that
> point?) we change how Lucene is bundled, such that core queries and
> contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
> lucene-analyzers-3.0.jar would include contrib/analyzers/* and
> org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.

This is even better! +1

I would propose:
- core: Indexer, Documents, IndexReader, Searcher and the default
directory-stores (fs, mmap, nio).
- queries: current core queries and contrib/queries
- queryparser (the new one? Or two different packages for old and new): this
should really be removed from core, a lot of people think, that they can
only query lucene using the queryparser and do not even try to build their
Boolean-queries manually and often fail, when it gets complicated, where the
query parser cannot help or fails, e.g. querying non-tokenized fields (but
this would depend on queries, we need that here)...
- analysis (and completely remove analyzers from core, let only be the
abstract analyzer stay there and keyword analyzer, if you want to index
without analyzer or do not need one because of only non-tokenized fields,...
- highlighting
- custom sorting separate????
- spatial
- ...

We then could change our contrib SVN accounts and have new roles like
(core-committer, queries-committer,...)

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modularization (was: Re: New flexible query parser)

Grant Ingersoll-2
In reply to this post by Michael McCandless-2

On Mar 21, 2009, at 11:26 AM, Michael McCandless wrote:
> What if (maybe for 3.0, since we can mix in 1.5 sources at that
> point?) we change how Lucene is bundled, such that core queries and
> contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
> lucene-analyzers-3.0.jar would include contrib/analyzers/* and
> org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.


Since we are just talking about packaging, why can't we have both/all  
of the above?  Individual jars, as well as one "big" jar, that  
contains everything (or, everything that has only dependencies we can  
ship, or "everything" that we deem important for an OOTB experience).  
I, for one, find it annoying to have to go get snowball, analyzers,  
spellchecking and highlighting separate in most cases b/c I almost  
always use all of them and don't particularly care if there are extra  
classes in a JAR, but can appreciate the need to do that in specific  
instances where leaner versions are needed.  After all, the Ant magic  
to do all of this is pretty trivial given we just need to combine the  
various jars into a single jar (while keeping the indiv. ones)

If there is a sense that some contribs aren't maintained or aren't as  
"good", then we need to ask ourselves whether they are:
1. stable and solid and don't need much care and are doing just fine  
thank you very much, or,
2. need to be archived, since they only serve as a distraction, or
3. in need of a new champion to maintain/promote them

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modularization

Michael Busch
In reply to this post by Michael McCandless-2
On 3/21/09 11:26 AM, Michael McCandless wrote:

> I think we are mixing up source code modularity with
> bundling/packaging.
>
> Honestly, I would not mind much where the source code lives in svn, so
> long as a developer, upon downloading Lucene 2.9, can go to *one*
> place (javadocs) for Lucene's "queries & filters" and see
> {Int,Long}NumberRangeFilter in there.
> We are not there today: a developer must first realize there's a whole
> separate place to look for "other" queries (contrib/queries).  Then
> the developer browses that and likely becomes confused/misled by what
> TrieRangeQuery means (is it a letter trie?).
>
> My goal here is Lucene's consumability -- when someone new says "hey I
> heard about this great search library called Lucene; let me go try it
> out" I want that first impression to be as solid as possible.  I think
> this is very important for growing Lucene's community.  This is why
> "out of the box" defaults are so crucial (eg changing IW from flushing
> every 10 docs to every 16 MB gained sizable throughput).
>
So this guy landing on http://lucene.apache.org/java/docs/index.html 
sees the "Overview" section first. That one only gives a very short
introduction to what Lucene is. He might then look at "Features", which
is also not very specific. I think the next thing would then be to look
for the documentation of the newest release, so he would click on
"Lucene 2.4.1 Documentation". The landing page doesn't say much, except
tells you to go look for the javadocs and other docs in the menu. So
maybe the "Getting Started" link might the first one to go to, but it's
also pretty far down the list. So probably he would click on the
javadocs first. Now he encounters "All, Core, Demo, Contrib". Until now,
he hasn't read the word "Contrib" anywhere. We basically have nowhere
documentation that introduces the concept of contribs, or where to find
them, I think? Even the "Contributions" section talks about something
else. So that guy probably looks then trough the  demo and examples and
ends up using only core features until becoming more familiar with
Lucene as a whole. Maybe he actually ends up buying LIA(2) :)

> How many times have we seen a review, article, blog post, etc.,
> comparing Lucene to other search libraries only to incorrectly
> complain because "Lucene can't do XYZ" or "Lucene's indexing
> performance is poor", etc, because they didn't dig in to learn all the
> tunings/options/tricks we all know you are supposed to do?  (It
> frustrates me to end when this happens).  This then hurts Lucene's
> adoption because others read such articles and conclude Lucene is a
> non-starter.
>
> We all ought to be concerned with Lucene's adoption & growth with time
> (I am), and first-impression consumability / out of the box defaults
> are big drivers of that.
>
> point?) we change how Lucene is bundled, such that core queries and
> contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
> lucene-analyzers-3.0.jar would include contrib/analyzers/* and
> org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.
>

So yeah I like this and 3.0 is a good opportunity to do this. I think a
big part of this work should be good documentation. As you mentioned,
Mike, it should be very simple to get an overview of what the different
modules are. So there should be the list of the different modules,
together with a short description for each of them and infos about where
to find them (which jar). Then by clicking on e.g. queries, the user
would see the list of all queries we support.

But I think we should still have "main modules", such as core, queries,
analyzers, ... and separately e.g. "sandbox modules?", for the things
currently in contrib that are experimental or, as Mark called them,
"graveyard contribs" :) ... even though we might then as well ask the
questions if we can not really bury the latter ones...

> Mike
>
> Michael Busch wrote:
>
>> On 3/21/09 12:27 AM, Michael Busch wrote:
>>> +1. I'd love to see Lucene going into such a direction.
>>>
>>> However, I'm a little worried about contrib's reputation. I think it
>>> contains components with differing levels of activity, maturity and
>>> support.
>>> So maybe instead of moving things from core into contrib to achieve
>>> the goal you mentioned, we could create a new folder named e.g.
>>> 'components', which will contain stuff that we claim is as stable,
>>> mature and supported as the core, just packaged into separate jars.
>>> Those jars should then only have dependencies on the core, but not
>>> on each other. They would also follow the same
>>> backwards-compatibility and other requirements as the core. Thoughts?
>>
>> I guess something very similar has been proposed and discussed here:
>> http://www.nabble.com/Moving-SweetSpotSimilarity-out-of-contrib-to19267437.html#a19320894 
>>
>> (same link that Hoss sent while having his deja vu)...
>>
>> -Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modularization

Michael McCandless-2
> Maybe he actually ends up buying LIA(2) :)

LIA/2 suffers the same false dichotomy, and it drives me crazy there
too: we put all "contrib" packages in a different chapter, even though
it'd make much more sense to cover all analyzers in one chapter, all
queries in one chapter, etc.

I find myself cross-referencing over to TrieRangeQuery in Chapter 8,
from LIA's search chapter (Chapter 3), and it's awkward.

> So yeah I like this and 3.0 is a good opportunity to do this. I
> think a big part of this work should be good documentation. As you
> mentioned, Mike, it should be very simple to get an overview of what
> the different modules are.  So there should be the list of the
> different modules, together with a short description for each of
> them and infos about where to find them (which jar).  Then by
> clicking on e.g. queries, the user would see the list of all queries
> we support.

I agree: revamping the web-site for a better top-down introduction of
Lucene's features should be part of 3.0.

And I don't think the sudden separation of "core" vs "contrib" should
be so prominent (or even visible); it's really a detail of how we
manage source control.

When looking at the website I'd like read that Lucene can do hit
highlighting, powerful query parsing, spell checking, analyze
different languages, etc.  I could care less that some of these happen
to live under a "contrib" subdirectory somewhere in the source control
system.

> But I think we should still have "main modules", such as core,
> queries, analyzers, ... and separately e.g. "sandbox modules?", for
> the things currently in contrib that are experimental or, as Mark
> called them, "graveyard contribs" :) ... even though we might then
> as well ask the questions if we can not really bury the latter
> ones...

Could we, instead, adopt some standard way (in the package javadocs)
of stating the maturity/activity/back compat policies/etc of a given
package?

> Since we are just talking about packaging, why can't we have
> both/all of the above?  Individual jars, as well as one "big" jar,
> that contains everything (or, everything that has only dependencies
> we can ship, or "everything" that we deem important for an OOTB
> experience).  I, for one, find it annoying to have to go get
> snowball, analyzers, spellchecking and highlighting separate in most
> cases b/c I almost always use all of them and don't particularly
> care if there are extra classes in a JAR, but can appreciate the
> need to do that in specific instances where leaner versions are
> needed.  After all, the Ant magic to do all of this is pretty
> trivial given we just need to combine the various jars into a single
> jar (while keeping the indiv. ones)

+1

So I think the beginnings of a rough proposal is taking shape, for 3.0:

  1. Fix web site to give a better intro to Lucene's features, without
     exposing core vs. contrib false (to the Lucene consumer)
     distinction

  2. When releasing, we make a single JAR holding core & contrib
     classes for a given area.  The final JAR files don't contain a
     "core" vs "contrib" distinction.

  3. We create a "bundled" JAR that has the common packages
     "typically" needed (index/search core, analyzers, queries,
     highlighter, spellchecker)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modularization (was: Re: New flexible query parser)

DM Smith
In reply to this post by Grant Ingersoll-2

On Mar 21, 2009, at 7:23 AM, Grant Ingersoll wrote:

>
> On Mar 21, 2009, at 11:26 AM, Michael McCandless wrote:
>> What if (maybe for 3.0, since we can mix in 1.5 sources at that
>> point?) we change how Lucene is bundled, such that core queries and
>> contrib/query/* are in one JAR (lucene-query-3.0.jar)?  And
>> lucene-analyzers-3.0.jar would include contrib/analyzers/* and
>> org/apache/lucene/analysis/*.  And lucene-queryparser.jar, etc.
>
>
> Since we are just talking about packaging, why can't we have both/
> all of the above?  Individual jars, as well as one "big" jar, that  
> contains everything (or, everything that has only dependencies we  
> can ship, or "everything" that we deem important for an OOTB  
> experience).  I, for one, find it annoying to have to go get  
> snowball, analyzers, spellchecking and highlighting separate in most  
> cases b/c I almost always use all of them and don't particularly  
> care if there are extra classes in a JAR, but can appreciate the  
> need to do that in specific instances where leaner versions are  
> needed.  After all, the Ant magic to do all of this is pretty  
> trivial given we just need to combine the various jars into a single  
> jar (while keeping the indiv. ones)
>
> If there is a sense that some contribs aren't maintained or aren't  
> as "good", then we need to ask ourselves whether they are:
> 1. stable and solid and don't need much care and are doing just fine  
> thank you very much, or,
> 2. need to be archived, since they only serve as a distraction, or
> 3. in need of a new champion to maintain/promote them

 From a user's perspective (i.e. mine):
I like the idea regarding having more jars. Specifically, I'd like a  
jar that was devoted alone to reading an index. Ultimately, I'd like  
it to work in a J2ME environment, but that is entirely a different  
thread.

There are parts that are needed for both reading and writing  
(directory, analyzers, tokens, and such). And there are parts dealing  
with writing.

There is a distinction between core and contrib regarding backward  
compatibility and quality (perhaps perceived quality).

To me the hardest part in wrapping my head around contrib is that I am  
not clear on why something is in contrib, what it can do, whether it  
is just an example, an alternate way of doing something or it is  
useful exactly as provided.

There are parts of contrib that I see as essential to my application  
(pretty much Grant's list), that I can use as is. While there are many  
different applications of Lucene, my guess is that a non-trivial  
application of Lucene needs to use various contribs. Some contribs are  
high quality and I think deserve the kind of attention that core gets.

What I'd like to see is not more stuff move into core from contrib.  
But rather that we have two levels of contrib: One recommended for use  
and maintained at the same level as core. The other is stuff that is  
"use if you find it useful, and at your own risk". That is, as it is  
today.

I understand the desire to have one jar do it all. Nothing wrong with  
having that too, perhaps lucene-essentials.jar that holds all useful,  
recommended, highly maintained, well-explained stuff.

As to the whole question of the oobe for reviewers, today, it is what  
does Lucene-core.jar do. With more jars it would be what does this  
core collection of jars do or what does lucene-esssentials.

-- DM Smith





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modularization

Michael Busch
In reply to this post by Michael McCandless-2
On 3/21/09 1:36 PM, Michael McCandless wrote:
And I don't think the sudden separation of "core" vs "contrib" should
be so prominent (or even visible); it's really a detail of how we
manage source control.

When looking at the website I'd like read that Lucene can do hit
highlighting, powerful query parsing, spell checking, analyze
different languages, etc.  I could care less that some of these happen
to live under a "contrib" subdirectory somewhere in the source control
system.

  
OK, so I think we all agree about the packaging. But I believe it is also important
how the source code is organized. Maybe Lucene consumers don't care too much,
however, Lucene is an open source project. So we also want to attract possible
contributors with a nicely organized code base. If there is a clear separation between
the different components on a source code level, becoming familiar with Lucene as a
contributor might not be so overwhelming.

Besides that, I think a one-to-one mapping between the packaging and the source code
has no disadvantages. (and it would certainly make the build scripts easier!)

  
But I think we should still have "main modules", such as core,
queries, analyzers, ... and separately e.g. "sandbox modules?", for
the things currently in contrib that are experimental or, as Mark
called them, "graveyard contribs" :) ... even though we might then
as well ask the questions if we can not really bury the latter
ones...
    

Could we, instead, adopt some standard way (in the package javadocs)
of stating the maturity/activity/back compat policies/etc of a given
package?
  

This makes sense; e.g. we could release new modules as beta versions (= use at own risk,
no backwards-compatibility).

And if we start a new module (e.g. a GSoC project) we could exclude it from a release
easily if it's truly experimental and not in a release-able state.
So I think the beginnings of a rough proposal is taking shape, for 3.0:

  1. Fix web site to give a better intro to Lucene's features, without
     exposing core vs. contrib false (to the Lucene consumer)
     distinction

  2. When releasing, we make a single JAR holding core & contrib
     classes for a given area.  The final JAR files don't contain a
     "core" vs "contrib" distinction.

  3. We create a "bundled" JAR that has the common packages
     "typically" needed (index/search core, analyzers, queries,
     highlighter, spellchecker)

  
+1 to all three points.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


  

Reply | Threaded
Open this post in threaded view
|

Re: Modularization

Michael McCandless-2
Michael Busch <[hidden email]> wrote:

>> And I don't think the sudden separation of "core" vs "contrib"
>> should be so prominent (or even visible); it's really a detail of
>> how we manage source control.
>
>> When looking at the website I'd like read that Lucene can do hit
>> highlighting, powerful query parsing, spell checking, analyze
>> different languages, etc.  I could care less that some of these
>> happen to live under a "contrib" subdirectory somewhere in the
>> source control system.
>
> OK, so I think we all agree about the packaging. But I believe it is
> also important how the source code is organized. Maybe Lucene
> consumers don't care too much, however, Lucene is an open source
> project. So we also want to attract possible contributors with a
> nicely organized code base. If there is a clear separation between
> the different components on a source code level, becoming familiar
> with Lucene as a contributor might not be so overwhelming.

+1

We want the source code to be well organized: consumability by Lucene
developers (not just Lucene users) is also important for Lucene's
future growth.

> Besides that, I think a one-to-one mapping between the packaging and
> the source code has no disadvantages. (and it would certainly make
> the build scripts easier!)

Right.

So, towards that... why even break out contrib vs core, in source
control?  Can't we simply migrate contrib/* into core, in the right
places?

>> Could we, instead, adopt some standard way (in the package
>> javadocs) of stating the maturity/activity/back compat policies/etc
>> of a given package?
>
> This makes sense; e.g. we could release new modules as beta versions
> (= use at own risk, no backwards-compatibility).

In fact we already have a 2.9 Jira issue opened to better document the
back-compat/JDK version requirements of all packages.

I think, like we've done with core lately when a new feature is added,
we could have the default assumption be full back compatibility, but
then those classes/methods/packages that are very new and may change
simply say so clearly in their javadocs.

> And if we start a new module (e.g. a GSoC project) we could exclude
> it from a release easily if it's truly experimental and not in a
> release-able state.

Right.

>> So I think the beginnings of a rough proposal is taking shape, for
>>3.0:

>>   1. Fix web site to give a better intro to Lucene's features,
>>       without exposing core vs. contrib false (to the Lucene
>>       consumer) > distinction
>>
>>   2. When releasing, we make a single JAR holding core & contrib
>>       classes for a given area.  The final JAR files don't contain a
>>       "core" vs "contrib" distinction.
>>
>>   3. We create a "bundled" JAR that has the common packages
>>       "typically" needed (index/search core, analyzers, queries,
>>       highlighter, spellchecker)
>
> +1 to all three points.

OK.

So I guess I'm proposing adding:

   4. Move contrib/* under src/java/*, updating the javadocs to state
       back compatibility promises per class/package.

I think net/net this'd be a great simplification?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123