Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

classic Classic list List threaded Threaded
64 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Michael McCandless-2
I think this is a good idea!  LuSolr ;)  (kidding)

I agree with all of your points Yonik.

What do other people think...?

Mike

On Wed, Feb 24, 2010 at 2:20 PM, Yonik Seeley <[hidden email]> wrote:

> I've started to think that a merge of Solr and Lucene would be in the
> best interest of both projects.
>
> Recently, Solr as pulled back from using Lucene trunk (or even the
> latest version), as the increased amount of change between releases
> (and in-between releases) made it impractical to deal with. This is a
> pretty big negative for Lucene, since Solr is the biggest Lucene user
> (where people are directly exposed to lucene for the express purpose
> of developing search features).  I know Solr development has always
> benefited hugely from users using trunk, and Lucene trunk has now lost
> all the solr users.
>
> Some in Lucene development have expressed a desire to make Lucene more
> of a complete solution, rather than just a core full-text search
> library... things like a data schema, faceting, etc.  The Lucene
> project already has an enterprise search platform with these
> features... that's Solr.  Trying to pull popular pieces out of Solr
> makes life harder for Solr developers, brings our projects into
> conflict, and is often unsuccessful (witness the largely failed
> migration of FunctionQueries from Solr to Lucene).  For Lucene to
> achieve the ultimate in usability for users, it can't require Java
> experience... it needs higher level abstractions provided by Solr.
>
> The other benefit to Lucene would be to bring features to developers
> much sooner... Solr has had features years before they were developed
> in Lucene, and currently has more developers working with it.  Esp
> with Solr not using Lucene trunk, if a Solr developer wants a feature
> quickly, they cannot add it to Lucene (even if it might make sense
> there) since that introduces a big unpredictable lag - when that
> version of Lucene make it's way into Solr.
>
> The current divide is a bit unnatural.  For maximum benefit of both
> projects, it seems like Solr and Lucene should essentially merge.
> Lucene core would essentially remain as it is, but:
> 1) Solr would go back to using Lucene's trunk
> 2) For new Solr features, there would be an effort to abstract it such
> that non-Solr users could use the functionality (faceting, field
> collapsing, etc)
> 3) For new Lucene features, there would be an effort to integrate it into Solr.
> 4) Releases would be synchronized... Lucene and Solr would release at
> the same time.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Robert Muir
+1

On Fri, Feb 26, 2010 at 3:20 PM, Michael McCandless <
[hidden email]> wrote:

> I think this is a good idea!  LuSolr ;)  (kidding)
>
> I agree with all of your points Yonik.
>
> What do other people think...?
>
> Mike
>
> On Wed, Feb 24, 2010 at 2:20 PM, Yonik Seeley <[hidden email]> wrote:
> > I've started to think that a merge of Solr and Lucene would be in the
> > best interest of both projects.
> >
> > Recently, Solr as pulled back from using Lucene trunk (or even the
> > latest version), as the increased amount of change between releases
> > (and in-between releases) made it impractical to deal with. This is a
> > pretty big negative for Lucene, since Solr is the biggest Lucene user
> > (where people are directly exposed to lucene for the express purpose
> > of developing search features).  I know Solr development has always
> > benefited hugely from users using trunk, and Lucene trunk has now lost
> > all the solr users.
> >
> > Some in Lucene development have expressed a desire to make Lucene more
> > of a complete solution, rather than just a core full-text search
> > library... things like a data schema, faceting, etc.  The Lucene
> > project already has an enterprise search platform with these
> > features... that's Solr.  Trying to pull popular pieces out of Solr
> > makes life harder for Solr developers, brings our projects into
> > conflict, and is often unsuccessful (witness the largely failed
> > migration of FunctionQueries from Solr to Lucene).  For Lucene to
> > achieve the ultimate in usability for users, it can't require Java
> > experience... it needs higher level abstractions provided by Solr.
> >
> > The other benefit to Lucene would be to bring features to developers
> > much sooner... Solr has had features years before they were developed
> > in Lucene, and currently has more developers working with it.  Esp
> > with Solr not using Lucene trunk, if a Solr developer wants a feature
> > quickly, they cannot add it to Lucene (even if it might make sense
> > there) since that introduces a big unpredictable lag - when that
> > version of Lucene make it's way into Solr.
> >
> > The current divide is a bit unnatural.  For maximum benefit of both
> > projects, it seems like Solr and Lucene should essentially merge.
> > Lucene core would essentially remain as it is, but:
> > 1) Solr would go back to using Lucene's trunk
> > 2) For new Solr features, there would be an effort to abstract it such
> > that non-Solr users could use the functionality (faceting, field
> > collapsing, etc)
> > 3) For new Lucene features, there would be an effort to integrate it into
> Solr.
> > 4) Releases would be synchronized... Lucene and Solr would release at
> > the same time.
> >
> > -Yonik
> >
>



--
Robert Muir
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Simon Willnauer
+1

So many people ask me when Solr will have all the lucene features and
how quickly solr keeps up. If we can make it somehow I think it would
be a huge improvement. Except of mark millers resume :)

simon

On Fri, Feb 26, 2010 at 10:11 PM, Robert Muir <[hidden email]> wrote:

> +1
>
> On Fri, Feb 26, 2010 at 3:20 PM, Michael McCandless <
> [hidden email]> wrote:
>
>> I think this is a good idea!  LuSolr ;)  (kidding)
>>
>> I agree with all of your points Yonik.
>>
>> What do other people think...?
>>
>> Mike
>>
>> On Wed, Feb 24, 2010 at 2:20 PM, Yonik Seeley <[hidden email]> wrote:
>> > I've started to think that a merge of Solr and Lucene would be in the
>> > best interest of both projects.
>> >
>> > Recently, Solr as pulled back from using Lucene trunk (or even the
>> > latest version), as the increased amount of change between releases
>> > (and in-between releases) made it impractical to deal with. This is a
>> > pretty big negative for Lucene, since Solr is the biggest Lucene user
>> > (where people are directly exposed to lucene for the express purpose
>> > of developing search features).  I know Solr development has always
>> > benefited hugely from users using trunk, and Lucene trunk has now lost
>> > all the solr users.
>> >
>> > Some in Lucene development have expressed a desire to make Lucene more
>> > of a complete solution, rather than just a core full-text search
>> > library... things like a data schema, faceting, etc.  The Lucene
>> > project already has an enterprise search platform with these
>> > features... that's Solr.  Trying to pull popular pieces out of Solr
>> > makes life harder for Solr developers, brings our projects into
>> > conflict, and is often unsuccessful (witness the largely failed
>> > migration of FunctionQueries from Solr to Lucene).  For Lucene to
>> > achieve the ultimate in usability for users, it can't require Java
>> > experience... it needs higher level abstractions provided by Solr.
>> >
>> > The other benefit to Lucene would be to bring features to developers
>> > much sooner... Solr has had features years before they were developed
>> > in Lucene, and currently has more developers working with it.  Esp
>> > with Solr not using Lucene trunk, if a Solr developer wants a feature
>> > quickly, they cannot add it to Lucene (even if it might make sense
>> > there) since that introduces a big unpredictable lag - when that
>> > version of Lucene make it's way into Solr.
>> >
>> > The current divide is a bit unnatural.  For maximum benefit of both
>> > projects, it seems like Solr and Lucene should essentially merge.
>> > Lucene core would essentially remain as it is, but:
>> > 1) Solr would go back to using Lucene's trunk
>> > 2) For new Solr features, there would be an effort to abstract it such
>> > that non-Solr users could use the functionality (faceting, field
>> > collapsing, etc)
>> > 3) For new Lucene features, there would be an effort to integrate it into
>> Solr.
>> > 4) Releases would be synchronized... Lucene and Solr would release at
>> > the same time.
>> >
>> > -Yonik
>> >
>>
>
>
>
> --
> Robert Muir
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Marvin Humphrey
In reply to this post by Michael McCandless-2
On Fri, Feb 26, 2010 at 03:20:58PM -0500, Michael McCandless wrote:
> I think this is a good idea!  LuSolr ;)  (kidding)
>
> I agree with all of your points Yonik.
>
> What do other people think...?

My ideal would be to go the opposite direction: shrink Lucene to a minimal
specification, and put all serious functionality into plugins.

On the other hand, making giant bloatware official policy seems like the
natural progression for Lucene. ;)

Marvin Humphrey

Reply | Threaded
Open this post in threaded view
|

RE: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Uwe Schindler
-1, I dont use Solr, I still want to be able to use Lucene without any Solr bloat! I tend to Marvin's comment.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Marvin Humphrey [mailto:[hidden email]]
> Sent: Friday, February 26, 2010 10:24 PM
> To: [hidden email]
> Subject: Re: Factor out a standalone, shared analysis package for
> Nutch/Solr/Lucene?
>
> On Fri, Feb 26, 2010 at 03:20:58PM -0500, Michael McCandless wrote:
> > I think this is a good idea!  LuSolr ;)  (kidding)
> >
> > I agree with all of your points Yonik.
> >
> > What do other people think...?
>
> My ideal would be to go the opposite direction: shrink Lucene to a
> minimal
> specification, and put all serious functionality into plugins.
>
> On the other hand, making giant bloatware official policy seems like
> the
> natural progression for Lucene. ;)
>
> Marvin Humphrey


Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Mark Miller-3
You would still be able to. I still have some misgivings too, but this
should not be one of them. Lucene would still exist without Solr for
those that don't use Solr.

On 02/26/2010 04:44 PM, Uwe Schindler wrote:

> -1, I dont use Solr, I still want to be able to use Lucene without any Solr bloat! I tend to Marvin's comment.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>
>    
>> -----Original Message-----
>> From: Marvin Humphrey [mailto:[hidden email]]
>> Sent: Friday, February 26, 2010 10:24 PM
>> To: [hidden email]
>> Subject: Re: Factor out a standalone, shared analysis package for
>> Nutch/Solr/Lucene?
>>
>> On Fri, Feb 26, 2010 at 03:20:58PM -0500, Michael McCandless wrote:
>>      
>>> I think this is a good idea!  LuSolr ;)  (kidding)
>>>
>>> I agree with all of your points Yonik.
>>>
>>> What do other people think...?
>>>        
>> My ideal would be to go the opposite direction: shrink Lucene to a
>> minimal
>> specification, and put all serious functionality into plugins.
>>
>> On the other hand, making giant bloatware official policy seems like
>> the
>> natural progression for Lucene. ;)
>>
>> Marvin Humphrey
>>      
>
>    


--
- Mark

http://www.lucidimagination.com



Reply | Threaded
Open this post in threaded view
|

RE: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

steve_rowe
In reply to this post by Michael McCandless-2
On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
> I've started to think that a merge of Solr and Lucene would be in the
> best interest of both projects.

The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging:

1. Transfer Solr stuff that logically belongs in Lucene over to Lucene.
2. Make Solr depend on Lucene trunk.
3. Block any future commits to either project that don't have a coordinating change for the other project.
4. Coordinate releases.

Done.

Steve

Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Yonik Seeley-2-2
On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe <[hidden email]> wrote:
> On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
>> I've started to think that a merge of Solr and Lucene would be in the
>> best interest of both projects.
>
> The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging:

Everything is virtual here anyway :-)
I agree with Mike that a single dev list is highly desirable.  There
would still be separate downloads.  What to do with some of the other
stuff is unspecified.

Committers would need to be merged though - that's the only way to
make a change across projects w/o breaking stuff.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Michael McCandless-2
To make this more concrete, I think this is roughly what's being
proposed:

  * Merging the dev lists into a single list.

  * Merging committers.

  * When a change it committed to Lucene, it must pass all Solr
    tests.

  * Release both at once.

These things would not change:

  * Most importantly, the source code would remain factored into
    separate dirs/modules.

  * User's lists should remain separate.

  * Web sites would remain separate.

  * Solr & Lucene are still separate downloads, separate JARs,
    seperate subdirs in the source tree, etc.

The outside world still sees Solr & Lucene as separate entities.  It's
only that they would now be developed/released in synchrony.

There are some important gains by doing this:

  * Single source for all the code dup we now have across the
    projects (my original reason, specifically on analyzers, for
    starting this).

  * Whenever a new feature is added to Lucene, we'd work through what
    the impact is to Solr.  This can still mean we separately develop
    exposure in Solr, but it'd get us to at least more immediately
    think about it.

  * Solr is Lucene's biggest direct user -- most people who use Lucene
    use it through Solr -- so having it more closely integrated means
    we know sooner if we broke something.

  * Right now I could test whether flex breaks anything in Solr.  I
    can't do that now since Solr is isn't upgraded to 3.1.

Recent big changes (eg segment based searching, Version, attr based
tokenstream api) caused alot of work in Solr that could've been much
smoother had Solr "been there" as we were working through them.

Recent new features, eg near-real-time search, which are unavailable
in Solr still, would have at least had some discussion about how to
expose this in Solr.

Over time (and we don't have to do this right on day 1) we can make
core capabilities available to pure Lucene.  EG core Lucene users
should be able to use faceting, use a schema, etc.

I think this idea makes alot of sense and I think now is a good time
to do it.  Yes, this a big change, but I think the gains are sizable.
As Lucene & Solr diverge more, it'll only become harder and harder to
merge.

Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers
to 3.0, is aging... while other changes to analyzers are being
proposed (SOLR-1799).  If we were integrated (or at least single
source for analyzers), Robert would already have committed it.

Mike

On Fri, Feb 26, 2010 at 5:20 PM, Yonik Seeley
<[hidden email]> wrote:

> On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe <[hidden email]> wrote:
>> On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
>>> I've started to think that a merge of Solr and Lucene would be in the
>>> best interest of both projects.
>>
>> The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging:
>
> Everything is virtual here anyway :-)
> I agree with Mike that a single dev list is highly desirable.  There
> would still be separate downloads.  What to do with some of the other
> stuff is unspecified.
>
> Committers would need to be merged though - that's the only way to
> make a change across projects w/o breaking stuff.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Ian Holsman (Lists)
I'm not a committer here (or on SOLR), so I can't vote, but I'm
generally against this. but on the flip side I've been using SOLR for
quite a while.

firstly SOLR is not the only application that uses lucene as a webservice.

waiting for SOLR developers to implement re-factorings and changes made
to the core will hamper lucene development.
and things like katta, elastic search, neo4j, and zoie will be treated
like 2nd class citizens and suffer.

It will also hamper innovative new developments, as now 'oh.. this will
break SOLR', or 'SOLR can't use that easily' will stop them. I'm curious
how the NRT enhancements and payload changes would have gone if they had
to wait for SOLR to change stuff to make them work. and most of the SOLR
dev's are on the lucene dev list anyway.

SOLR should just be treated like any API user of lucene and lucene
should not be limited by SOLR.


as for the original reason.. I support breaking out the analyzers and
making them more generic, or pushing down the changes SOLR (and nutch
and whoever)
have made back into the core.

as for the assertion that SOLR is the largest user of lucene, I don't
even know how you could back that up, and even if it is today, that
might change tomorrow.
The web is a fickle place.

so.. I'm pretty happy with how things are going today. lucene is a
library that other things can include. SOLR is a webservice using lucene.


On 2/28/10 5:57 AM, Michael McCandless wrote:

> To make this more concrete, I think this is roughly what's being
> proposed:
>
>    * Merging the dev lists into a single list.
>
>    * Merging committers.
>
>    * When a change it committed to Lucene, it must pass all Solr
>      tests.
>
>    * Release both at once.
>
> These things would not change:
>
>    * Most importantly, the source code would remain factored into
>      separate dirs/modules.
>
>    * User's lists should remain separate.
>
>    * Web sites would remain separate.
>
>    * Solr&  Lucene are still separate downloads, separate JARs,
>      seperate subdirs in the source tree, etc.
>
> The outside world still sees Solr&  Lucene as separate entities.  It's
> only that they would now be developed/released in synchrony.
>
> There are some important gains by doing this:
>
>    * Single source for all the code dup we now have across the
>      projects (my original reason, specifically on analyzers, for
>      starting this).
>
>    * Whenever a new feature is added to Lucene, we'd work through what
>      the impact is to Solr.  This can still mean we separately develop
>      exposure in Solr, but it'd get us to at least more immediately
>      think about it.
>
>    * Solr is Lucene's biggest direct user -- most people who use Lucene
>      use it through Solr -- so having it more closely integrated means
>      we know sooner if we broke something.
>
>    * Right now I could test whether flex breaks anything in Solr.  I
>      can't do that now since Solr is isn't upgraded to 3.1.
>
> Recent big changes (eg segment based searching, Version, attr based
> tokenstream api) caused alot of work in Solr that could've been much
> smoother had Solr "been there" as we were working through them.
>
> Recent new features, eg near-real-time search, which are unavailable
> in Solr still, would have at least had some discussion about how to
> expose this in Solr.
>
> Over time (and we don't have to do this right on day 1) we can make
> core capabilities available to pure Lucene.  EG core Lucene users
> should be able to use faceting, use a schema, etc.
>
> I think this idea makes alot of sense and I think now is a good time
> to do it.  Yes, this a big change, but I think the gains are sizable.
> As Lucene&  Solr diverge more, it'll only become harder and harder to
> merge.
>
> Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers
> to 3.0, is aging... while other changes to analyzers are being
> proposed (SOLR-1799).  If we were integrated (or at least single
> source for analyzers), Robert would already have committed it.
>
> Mike
>
> On Fri, Feb 26, 2010 at 5:20 PM, Yonik Seeley
> <[hidden email]>  wrote:
>    
>> On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe<[hidden email]>  wrote:
>>      
>>> On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
>>>        
>>>> I've started to think that a merge of Solr and Lucene would be in the
>>>> best interest of both projects.
>>>>          
>>> The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging:
>>>        
>> Everything is virtual here anyway :-)
>> I agree with Mike that a single dev list is highly desirable.  There
>> would still be separate downloads.  What to do with some of the other
>> stuff is unspecified.
>>
>> Committers would need to be merged though - that's the only way to
>> make a change across projects w/o breaking stuff.
>>
>> -Yonik
>>
>>      
>    

Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Shalin Shekhar Mangar
In reply to this post by Michael McCandless-2
On Sun, Feb 28, 2010 at 4:27 PM, Michael McCandless <
[hidden email]> wrote:

> To make this more concrete, I think this is roughly what's being
> proposed:
>
>  * Merging the dev lists into a single list.
>
>  * Merging committers.
>
>  * When a change it committed to Lucene, it must pass all Solr
>    tests.
>
>  * Release both at once.
>
> These things would not change:
>
>  * Most importantly, the source code would remain factored into
>    separate dirs/modules.
>
>  * User's lists should remain separate.
>
>  * Web sites would remain separate.
>
>  * Solr & Lucene are still separate downloads, separate JARs,
>    seperate subdirs in the source tree, etc.
>
> The outside world still sees Solr & Lucene as separate entities.  It's
> only that they would now be developed/released in synchrony.
>
>
+1

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Mattmann, Chris A (3010)
In reply to this post by Ian Holsman (Lists)
Hi All,

+1, I'm with Ian on this one. Loose coupling is always better in these types of situations...

Cheers,
Chris



On 2/28/10 8:07 AM, "Ian Holsman" <[hidden email]> wrote:

I'm not a committer here (or on SOLR), so I can't vote, but I'm
generally against this. but on the flip side I've been using SOLR for
quite a while.

firstly SOLR is not the only application that uses lucene as a webservice.

waiting for SOLR developers to implement re-factorings and changes made
to the core will hamper lucene development.
and things like katta, elastic search, neo4j, and zoie will be treated
like 2nd class citizens and suffer.

It will also hamper innovative new developments, as now 'oh.. this will
break SOLR', or 'SOLR can't use that easily' will stop them. I'm curious
how the NRT enhancements and payload changes would have gone if they had
to wait for SOLR to change stuff to make them work. and most of the SOLR
dev's are on the lucene dev list anyway.

SOLR should just be treated like any API user of lucene and lucene
should not be limited by SOLR.


as for the original reason.. I support breaking out the analyzers and
making them more generic, or pushing down the changes SOLR (and nutch
and whoever)
have made back into the core.

as for the assertion that SOLR is the largest user of lucene, I don't
even know how you could back that up, and even if it is today, that
might change tomorrow.
The web is a fickle place.

so.. I'm pretty happy with how things are going today. lucene is a
library that other things can include. SOLR is a webservice using lucene.


On 2/28/10 5:57 AM, Michael McCandless wrote:

> To make this more concrete, I think this is roughly what's being
> proposed:
>
>    * Merging the dev lists into a single list.
>
>    * Merging committers.
>
>    * When a change it committed to Lucene, it must pass all Solr
>      tests.
>
>    * Release both at once.
>
> These things would not change:
>
>    * Most importantly, the source code would remain factored into
>      separate dirs/modules.
>
>    * User's lists should remain separate.
>
>    * Web sites would remain separate.
>
>    * Solr&  Lucene are still separate downloads, separate JARs,
>      seperate subdirs in the source tree, etc.
>
> The outside world still sees Solr&  Lucene as separate entities.  It's
> only that they would now be developed/released in synchrony.
>
> There are some important gains by doing this:
>
>    * Single source for all the code dup we now have across the
>      projects (my original reason, specifically on analyzers, for
>      starting this).
>
>    * Whenever a new feature is added to Lucene, we'd work through what
>      the impact is to Solr.  This can still mean we separately develop
>      exposure in Solr, but it'd get us to at least more immediately
>      think about it.
>
>    * Solr is Lucene's biggest direct user -- most people who use Lucene
>      use it through Solr -- so having it more closely integrated means
>      we know sooner if we broke something.
>
>    * Right now I could test whether flex breaks anything in Solr.  I
>      can't do that now since Solr is isn't upgraded to 3.1.
>
> Recent big changes (eg segment based searching, Version, attr based
> tokenstream api) caused alot of work in Solr that could've been much
> smoother had Solr "been there" as we were working through them.
>
> Recent new features, eg near-real-time search, which are unavailable
> in Solr still, would have at least had some discussion about how to
> expose this in Solr.
>
> Over time (and we don't have to do this right on day 1) we can make
> core capabilities available to pure Lucene.  EG core Lucene users
> should be able to use faceting, use a schema, etc.
>
> I think this idea makes alot of sense and I think now is a good time
> to do it.  Yes, this a big change, but I think the gains are sizable.
> As Lucene&  Solr diverge more, it'll only become harder and harder to
> merge.
>
> Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers
> to 3.0, is aging... while other changes to analyzers are being
> proposed (SOLR-1799).  If we were integrated (or at least single
> source for analyzers), Robert would already have committed it.
>
> Mike
>
> On Fri, Feb 26, 2010 at 5:20 PM, Yonik Seeley
> <[hidden email]>  wrote:
>
>> On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe<[hidden email]>  wrote:
>>
>>> On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
>>>
>>>> I've started to think that a merge of Solr and Lucene would be in the
>>>> best interest of both projects.
>>>>
>>> The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging:
>>>
>> Everything is virtual here anyway :-)
>> I agree with Mike that a single dev list is highly desirable.  There
>> would still be separate downloads.  What to do with some of the other
>> stuff is unspecified.
>>
>> Committers would need to be merged though - that's the only way to
>> make a change across projects w/o breaking stuff.
>>
>> -Yonik
>>
>>
>




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Marvin Humphrey
In reply to this post by Michael McCandless-2
On Sun, Feb 28, 2010 at 05:57:05AM -0500, Michael McCandless wrote:
> Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers
> to 3.0, is aging... while other changes to analyzers are being
> proposed (SOLR-1799).  If we were integrated (or at least single
> source for analyzers), Robert would already have committed it.

Is Analyzer's interface mature and stable enough to break out?  Massive
patches which can't be applied easily... that doesn't seem like a good sign.

On the other hand, if Analyzers are installed independently, they can have
their own version, which could advance independently of Lucene.  The need for
matchVersion would go away in the context of analysis, to be replaced by
a traditional versioning system which I think users would find easier to
grok.

Marvin Humphrey

Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Michael Busch
In reply to this post by Michael McCandless-2
I'm not very happy with this proposal. I certainly understand what is
being tried to achieve though. I'd like to see a tighter integration
and communication between Lucene core and SOLR too, but the proposed
requirements seem much too strict. For example, I think it's a good
idea for SOLR to ride on Lucene's trunk again. This will show
potential problems of API changes and new features in Lucene much more
quickly. It will also help SOLR to use new Lucene features much more
quickly.

However, I'm -1 for these points:

  * When a change it committed to Lucene, it must pass all Solr tests.
  * Release both at once.

SOLR is a consumer of Lucene's API. So what this requirement basically
translates to is that I, as a Lucene committer, now have to not only
make sure Lucene's backwards-compatibility is ensured, but also that I
make all necessary changes in SOLR. So I have to know much more code
suddenly and potentionally make many more changes. But this doesn't
help all the other Lucene consumers out there. I invested several
weeks upgrading our software at IBM to 3.0 APIs, because I had 5000
compile errors.
I think the Lucene backwards-compatibility policy is very strict
already and it often takes more time working on bw-compat than the
actual feature. With the additional requirement above this will get
worse, and I'm afraid it might slow down Lucene's progress.

I don't disagree that things like moving function queries from SOLR to
Lucene have failed - but we have to ask why they weren't added to
Lucene in the first place. Was there ever a discussion whether those
queries should be added to Lucene or SOLR when they were developed? Or I'd
also love to see a powerful facet engine in Lucene, and SOLR would
build its faceting features on top of those APIs.

So I'm +1 for better communication (maybe even merging the dev lists) and
especially talking about where a new feature should live before
working on a patch.

  Michael

On 2/28/10 2:57 AM, Michael McCandless wrote:

> To make this more concrete, I think this is roughly what's being
> proposed:
>
>    * Merging the dev lists into a single list.
>
>    * Merging committers.
>
>    * When a change it committed to Lucene, it must pass all Solr
>      tests.
>
>    * Release both at once.
>
> These things would not change:
>
>    * Most importantly, the source code would remain factored into
>      separate dirs/modules.
>
>    * User's lists should remain separate.
>
>    * Web sites would remain separate.
>
>    * Solr&  Lucene are still separate downloads, separate JARs,
>      seperate subdirs in the source tree, etc.
>
> The outside world still sees Solr&  Lucene as separate entities.  It's
> only that they would now be developed/released in synchrony.
>
> There are some important gains by doing this:
>
>    * Single source for all the code dup we now have across the
>      projects (my original reason, specifically on analyzers, for
>      starting this).
>
>    * Whenever a new feature is added to Lucene, we'd work through what
>      the impact is to Solr.  This can still mean we separately develop
>      exposure in Solr, but it'd get us to at least more immediately
>      think about it.
>
>    * Solr is Lucene's biggest direct user -- most people who use Lucene
>      use it through Solr -- so having it more closely integrated means
>      we know sooner if we broke something.
>
>    * Right now I could test whether flex breaks anything in Solr.  I
>      can't do that now since Solr is isn't upgraded to 3.1.
>
> Recent big changes (eg segment based searching, Version, attr based
> tokenstream api) caused alot of work in Solr that could've been much
> smoother had Solr "been there" as we were working through them.
>
> Recent new features, eg near-real-time search, which are unavailable
> in Solr still, would have at least had some discussion about how to
> expose this in Solr.
>
> Over time (and we don't have to do this right on day 1) we can make
> core capabilities available to pure Lucene.  EG core Lucene users
> should be able to use faceting, use a schema, etc.
>
> I think this idea makes alot of sense and I think now is a good time
> to do it.  Yes, this a big change, but I think the gains are sizable.
> As Lucene&  Solr diverge more, it'll only become harder and harder to
> merge.
>
> Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers
> to 3.0, is aging... while other changes to analyzers are being
> proposed (SOLR-1799).  If we were integrated (or at least single
> source for analyzers), Robert would already have committed it.
>
> Mike
>
> On Fri, Feb 26, 2010 at 5:20 PM, Yonik Seeley
> <[hidden email]>  wrote:
>    
>> On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe<[hidden email]>  wrote:
>>      
>>> On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
>>>        
>>>> I've started to think that a merge of Solr and Lucene would be in the
>>>> best interest of both projects.
>>>>          
>>> The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather than physically merging:
>>>        
>> Everything is virtual here anyway :-)
>> I agree with Mike that a single dev list is highly desirable.  There
>> would still be separate downloads.  What to do with some of the other
>> stuff is unspecified.
>>
>> Committers would need to be merged though - that's the only way to
>> make a change across projects w/o breaking stuff.
>>
>> -Yonik
>>
>>      
>    

Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Shalin Shekhar Mangar
In reply to this post by Ian Holsman (Lists)
On Sun, Feb 28, 2010 at 9:37 PM, Ian Holsman <[hidden email]> wrote:

>
> waiting for SOLR developers to implement re-factorings and changes made to
> the core will hamper lucene development.
> and things like katta, elastic search, neo4j, and zoie will be treated like
> 2nd class citizens and suffer.
>

Lucene changes don't need to be in Solr immediately and they won't be, until
somebody has the itch. Many Lucene bugs have been caught by Solr's tests and
making sure that a change passes Solr's test suite is a good thing. A Lucene
change that fails Solr's tests is either a bug or a backwards-incompatible
API change. If it is the latter then I believe changing Solr is a good
lesson in the magnitude of changes needed in a typical Lucene application.
Possibly, those lessons can lead to a more flexible/simpler API. This is
relevant for new features as well. For example, look at how the trie range
query was affected when Solr came into the picture.

I know that many Lucene developers like to use newer features as soon as
possible. But seriously, how many update their Lucene applications to
support these changes in sync with a patch or even trunk? _*Striving*_ to
keep Solr in sync with Lucene will give instant feedback which, I think,
will help us build better APIs and give Lucene users a better experience.

Consider another argument: Solr's use of Lucene can be advertised as a
best-practice which can be a huge help for Lucene users. You want to know
how to add caching on top of Lucene? See Solr. Replication? See Solr etc.

As far as the other projects are concerned, I don't see why they will be
treated as 2nd class citizens. The Lucene core will continue to be separate
and if some of Solr's features are available to those projects in an easy to
assimilate Java API, they too benefit from it. It is a win-win situation.


> It will also hamper innovative new developments, as now 'oh.. this will
> break SOLR', or 'SOLR can't use that easily' will stop them. I'm curious how
> the NRT enhancements and payload changes would have gone if they had to wait
> for SOLR to change stuff to make them work. and most of the SOLR dev's are
> on the lucene dev list anyway.
>

Again, nobody is proposing that all new features must have corresponding
support in Solr. New features are anyway designed to be backward compatible
and all the proposal says is that the changes should not break Solr, which
makes sense.

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Mark Miller-3
In reply to this post by Michael Busch
On 02/28/2010 12:52 PM, Michael Busch wrote:
> ... I think it's a good
> idea for SOLR to ride on Lucene's trunk again...
> However, I'm -1 for these points:
>
>  * When a change it committed to Lucene, it must pass all Solr tests.
>  * Release both at once.
>
>

These are huge reasons why we *don't* want SOLR to ride on Lucene's
trunk anymore.

bq. but we have to ask why they weren't added to Lucene in the first place.

Because the two communities are fairly separate in a lot of ways. This
is one of the things a potential merge would solve. We can say that the
projects should communicate more all we look - the history of saying
such things implies there will be no changes though.

I'm still +0 here, but I'm starting to lean towards merge just sitting
here disagreeing with everyone arguing against :)

Solr is actually part of the project "Lucene" along with Lucene-Java.
The divide now is actually almost unnatural considering how things
are organized.

To those arguing that this would make Solr a first class citizen of
Lucene over other search solutions that use Lucene, that actually
already is the case, and the way things are setup, it should be. Solr is
part of the Lucene project. Other Lucene search engines are not. That
doesn't mean we shouldn't consider Lucene changes in the context of all
the projects that may use it, but Solr already is a first class citizen.
Its not just some project using Lucene - its *the* Lucene project's
Search Server. Lucene devs *should* consider Solr when developing on
Lucene Java - they are the same project - Lucene.

--
- Mark

http://www.lucidimagination.com



Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Mark Miller-3
In reply to this post by Shalin Shekhar Mangar
On 02/28/2010 01:32 PM, Shalin Shekhar Mangar wrote:
> A Lucene
> change that fails Solr's tests is either a bug or a backwards-incompatible
> API change.
>    

Not always. I still argue that per segment searching was a valid change
that was backwards compatible - but it broke Solr because Solr ignores
MultiSearcher and went on the assumption that a single Searcher had
access to the entire index. That's somewhat against the design of
Lucene, which doesn't (and can't) make such assumptions.

--
- Mark

http://www.lucidimagination.com



Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Mattmann, Chris A (3010)
In reply to this post by Mark Miller-3
Hi Mark,

Thanks for the feedback. My concern is that if the two communities are pretty separate, then it is going to be more difficult merging them, and it's not always a good thing to take separated modules (or communities) and integrate them into a monolith, whether it be physically in the code, or community-wise. I and a bunch of others learned the hard way in OODT-ville at NASA, and we moved towards a more loosely coupled solution, even at the expense of the difficulty in "being out of date" from time to time. This experience makes it difficult for me to support such a move...

Thanks!

Cheers,
Chris



On 2/28/10 10:39 AM, "Mark Miller" <[hidden email]> wrote:

On 02/28/2010 12:52 PM, Michael Busch wrote:
> ... I think it's a good
> idea for SOLR to ride on Lucene's trunk again...
> However, I'm -1 for these points:
>
>  * When a change it committed to Lucene, it must pass all Solr tests.
>  * Release both at once.
>
>

These are huge reasons why we *don't* want SOLR to ride on Lucene's
trunk anymore.

bq. but we have to ask why they weren't added to Lucene in the first place.

Because the two communities are fairly separate in a lot of ways. This
is one of the things a potential merge would solve. We can say that the
projects should communicate more all we look - the history of saying
such things implies there will be no changes though.

I'm still +0 here, but I'm starting to lean towards merge just sitting
here disagreeing with everyone arguing against :)

Solr is actually part of the project "Lucene" along with Lucene-Java.
The divide now is actually almost unnatural considering how things
are organized.

To those arguing that this would make Solr a first class citizen of
Lucene over other search solutions that use Lucene, that actually
already is the case, and the way things are setup, it should be. Solr is
part of the Lucene project. Other Lucene search engines are not. That
doesn't mean we shouldn't consider Lucene changes in the context of all
the projects that may use it, but Solr already is a first class citizen.
Its not just some project using Lucene - its *the* Lucene project's
Search Server. Lucene devs *should* consider Solr when developing on
Lucene Java - they are the same project - Lucene.

--
- Mark

http://www.lucidimagination.com






++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Jason Rutherglen
In reply to this post by Mark Miller-3
I think it's Solr rather than SOLR. :-)  A little birdy told me so...

On Sun, Feb 28, 2010 at 10:39 AM, Mark Miller <[hidden email]> wrote:

> On 02/28/2010 12:52 PM, Michael Busch wrote:
>>
>> ... I think it's a good
>> idea for SOLR to ride on Lucene's trunk again...
>> However, I'm -1 for these points:
>>
>>  * When a change it committed to Lucene, it must pass all Solr tests.
>>  * Release both at once.
>>
>>
>
> These are huge reasons why we *don't* want SOLR to ride on Lucene's trunk
> anymore.
>
> bq. but we have to ask why they weren't added to Lucene in the first place.
>
> Because the two communities are fairly separate in a lot of ways. This is
> one of the things a potential merge would solve. We can say that the
> projects should communicate more all we look - the history of saying such
> things implies there will be no changes though.
>
> I'm still +0 here, but I'm starting to lean towards merge just sitting here
> disagreeing with everyone arguing against :)
>
> Solr is actually part of the project "Lucene" along with Lucene-Java. The
> divide now is actually almost unnatural considering how things
> are organized.
>
> To those arguing that this would make Solr a first class citizen of Lucene
> over other search solutions that use Lucene, that actually already is the
> case, and the way things are setup, it should be. Solr is part of the Lucene
> project. Other Lucene search engines are not. That doesn't mean we shouldn't
> consider Lucene changes in the context of all the projects that may use it,
> but Solr already is a first class citizen. Its not just some project using
> Lucene - its *the* Lucene project's Search Server. Lucene devs *should*
> consider Solr when developing on Lucene Java - they are the same project -
> Lucene.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

Grant Ingersoll-2
In reply to this post by Michael Busch

On Feb 28, 2010, at 9:52 AM, Michael Busch wrote:

> I'm not very happy with this proposal. I certainly understand what is
> being tried to achieve though. I'd like to see a tighter integration
> and communication between Lucene core and SOLR too, but the proposed
> requirements seem much too strict. For example, I think it's a good
> idea for SOLR to ride on Lucene's trunk again. This will show
> potential problems of API changes and new features in Lucene much more
> quickly. It will also help SOLR to use new Lucene features much more quickly.
>
> However, I'm -1 for these points:
>
> * When a change it committed to Lucene, it must pass all Solr tests.

Not sure why more tests would be a negative.  The Solr tests exercise quite a bit of Lucene functionality as well.

-Grant
1234