Moving SweetSpotSimilarity out of contrib

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Moving SweetSpotSimilarity out of contrib

Shai Erera
Hi,

Following Doron's quality work enhancements in TREC 2007 (http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team), I was wondering if it's possible to move the SweetSpotSimilarity to Lucene's main code stream (out of "contrib" that is).
It shows significant improvement over the default similarity.

I'm not suggesting to replace the DefaultSimilarity (as the default) with SweetSpot, but just expose SweetSpot as part of Lucene's core. It will help me use it, since I cannot use the contrib packages easily in my environment (legal issues), but can use Lucene's core more freely.

Any objections?

Thanks,
Shai
Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Grant Ingersoll-2

On Sep 2, 2008, at 6:07 AM, Shai Erera wrote:

Hi,

Following Doron's quality work enhancements in TREC 2007 (http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team), I was wondering if it's possible to move the SweetSpotSimilarity to Lucene's main code stream (out of "contrib" that is).
It shows significant improvement over the default similarity.


My understanding is it requires a bit of tuning, right?  I'd want to make sure people have the right information to use it intelligently, but otherwise, it seems reasonable.  

I'm not suggesting to replace the DefaultSimilarity (as the default) with SweetSpot, but just expose SweetSpot as part of Lucene's core. It will help me use it, since I cannot use the contrib packages easily in my environment (legal issues), but can use Lucene's core more freely.

This strikes me as really odd. The contrib modules are released under the exact same terms as the core, but heh, I'm not a lawyer...  Is there anything you think we should be concerned with?

-Grant

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Shai Erera
From a legal standpoint, whenever we need to use open-source code, somebody has to inspect the code and 'approve' it. This inspection makes sure there's no use of 3rd party libraries, to which we'd need to get open-source clearance as well.

This process was done for Lucene core, but not for contrib, in my company. AFAIU, this process should be done by a company if it wants to (usually mandatory when you integrate open-source code in your products). Therefore I don't think the Lucene community should be concerned with this.

The only thing that the community can do is to move as much as possible to the core, so that if a company inspects the code, it will cover as much as possible. Of course, this may sound too 'broad' of a statement and I definitely don't think everything should belong to 'core'. My understanding is that the 'contrib' packages include 3rd party libraries (like Snowball), while there are packages which do not require and 3rd party libs (like SweetSpotSimiliarity). For those that require 3rd party libs, it makes sense to leave them in contrib. For those that don't, per-request, it might make sense to move them to 'core' in order to encourage people to use them. That's why I was asking if it's a problem to move SweetSpot to 'core'.

As for your questions on SweetSpot, from what I understand in the code, an application should configure it with different values, depnding on the TF computation method it wants to use (hyperbolic or baseline). The default implementation in SweetSpot for tf() is to use the baseline method, while an application can extend SweetSpot and override tf() to use the hyperbolic one.
An application can also configure the length norm parameters for different fields.

From what I read, the code is well documented. Perhaps Doron can some high-level documentation on what's the benefit of each tf() computation method, or give some references. But the defaults seem to make sense, so an application can definitely start with the default (if it wants to).

Shai

On Tue, Sep 2, 2008 at 2:34 PM, Grant Ingersoll <[hidden email]> wrote:

On Sep 2, 2008, at 6:07 AM, Shai Erera wrote:

Hi,

Following Doron's quality work enhancements in TREC 2007 (http://wiki.apache.org/lucene-java/TREC_2007_Million_Queries_Track_-_IBM_Haifa_Team), I was wondering if it's possible to move the SweetSpotSimilarity to Lucene's main code stream (out of "contrib" that is).
It shows significant improvement over the default similarity.


My understanding is it requires a bit of tuning, right?  I'd want to make sure people have the right information to use it intelligently, but otherwise, it seems reasonable.  

I'm not suggesting to replace the DefaultSimilarity (as the default) with SweetSpot, but just expose SweetSpot as part of Lucene's core. It will help me use it, since I cannot use the contrib packages easily in my environment (legal issues), but can use Lucene's core more freely.

This strikes me as really odd. The contrib modules are released under the exact same terms as the core, but heh, I'm not a lawyer...  Is there anything you think we should be concerned with?

-Grant


Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

hossman

: >From a legal standpoint, whenever we need to use open-source code, somebody
: has to inspect the code and 'approve' it. This inspection makes sure there's
: no use of 3rd party libraries, to which we'd need to get open-source
: clearance as well.
:
: This process was done for Lucene core, but not for contrib, in my company.
: AFAIU, this process should be done by a company if it wants to (usually
: mandatory when you integrate open-source code in your products). Therefore I
: don't think the Lucene community should be concerned with this.

You should talk to whomever you need to talk to at your company about
revising the appraoch you are taking -- the core vs contrib distinction in
Lucene-Java is one of our own making that is completly artificial.  With
Lucene 2.4 we could decide to split what is currently known as the "core"
into 27 different directories, none of which are called core, and all of
which have an interdependency on eachother.  We're not likely to, but we
could -- and then where woud your company be?

What you should be concerned with is what gets released: a Lucene-Java
release contains all of the Lucene-Java core code as well as the contrib
code ... it should be considered one cohesive unit release by the Apache
Lucene project.  Things like Solr, Nutch, Mahout on the other hand -- they
are released seperately by the Apache Lucene project.

: The only thing that the community can do is to move as much as possible to
: the core, so that if a company inspects the code, it will cover as much as

Doing this would actually be a complete reversal of the goals discussed in
the near past:  increasing our use of the contrib structure for new
features that aren't inherently tied to the "guts" of Lucene.  The goal
being to keep the "core" jar as small as possible for people who want to
develop apps with a small foot print.

At one point there was even talk of refactoring additional code out of the
core and into a contrib (this was already done with some analyzers when
Lucene became a TLP)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Nadav Har'El
On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving SweetSpotSimilarity out of contrib":

>
> : >From a legal standpoint, whenever we need to use open-source code, somebody
> : has to inspect the code and 'approve' it. This inspection makes sure there's
> : no use of 3rd party libraries, to which we'd need to get open-source
> : clearance as well.
>
> You should talk to whomever you need to talk to at your company about
> revising the appraoch you are taking -- the core vs contrib distinction in
> Lucene-Java is one of our own making that is completly artificial.  With
> Lucene 2.4 we could decide to split what is currently known as the "core"
> into 27 different directories, none of which are called core, and all of
> which have an interdependency on eachother.  We're not likely to, but we
> could -- and then where woud your company be?

I can't really defend the lawyers (sometimes you get the feeling that they
are out to slow you down, rather than help you :( ), but let me try to explain
where this sort of thinking comes from, because I think it is actually quite
common.

Lucene makes the claim that it has the "apache license", so that any company
can (to make a long story short) use this code. But when a company sets out
to use Lucene, can it take this claim at face value? After all, what happens
if somebody steals some proprietary code and puts it up on the web claiming it
has the apache license - does it give the users of that stolen code any
rights? Of course not, because the rights weren't the distributor's to give
out in the first place.

So it is quite natural that when a company wants to use use some open-source
code it doesn't take the license at face value, and rather does some "due
diligance" to verify that the people who published this code really owned
the rights to it. E.g., the company lawyers might want to do some background
checks on the committers, look at the project's history (e.g., that it doesn't
have some "out of the blue" donations from vague sources), check the code and
comments for suspicious strings, patterns, and so on.

When you need to inspect the code, naturally you need to decide what you
inspect. This particular company chose to inspect only the Lucene core,
perhaps because it is smaller, has fewer contributors, and has the vast
majority of what most Lucene users need. Inspecting all the contrib - with
all its foreign language analyzers, stuff like gdata and other rarely used
stuff - may be too hard for them. But then, the question I would ask is -
why not inspect the core *and* the few contribs that interest you? For
example, SweetSpotSimilarity (which you need) and other generally useful
stuff like Highlighter and SnowballAnalyzer.

> Doing this would actually be a complete reversal of the goals discussed in
> the near past:  increasing our use of the contrib structure for new
> features that aren't inherently tied to the "guts" of Lucene.  The goal
> being to keep the "core" jar as small as possible for people who want to
> develop apps with a small foot print.

I agree that this is an important goal.

> At one point there was even talk of refactoring additional code out of the
> core and into a contrib (this was already done with some analyzers when
> Lucene became a TLP)

--
Nadav Har'El                        |      Wednesday, Sep  3 2008, 3 Elul 5768
IBM Haifa Research Lab              |-----------------------------------------
                                    |Promises are like babies: fun to make,
http://nadav.harel.org.il           |but hell to deliver.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Shai Erera
Thanks all for the "legal" comments.

Can we consider moving the SweetSpotSimilarity to "core" because of the quality improvements it introduces to search? I tried to emphasize that that's the main reason, but perhaps I didn't do a good job at that, since the discussion has turned into a legal issue :-).

On Wed, Sep 3, 2008 at 3:21 PM, Nadav Har'El <[hidden email]> wrote:
On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving SweetSpotSimilarity out of contrib":
>
> : >From a legal standpoint, whenever we need to use open-source code, somebody
> : has to inspect the code and 'approve' it. This inspection makes sure there's
> : no use of 3rd party libraries, to which we'd need to get open-source
> : clearance as well.
>
> You should talk to whomever you need to talk to at your company about
> revising the appraoch you are taking -- the core vs contrib distinction in
> Lucene-Java is one of our own making that is completly artificial.  With
> Lucene 2.4 we could decide to split what is currently known as the "core"
> into 27 different directories, none of which are called core, and all of
> which have an interdependency on eachother.  We're not likely to, but we
> could -- and then where woud your company be?

I can't really defend the lawyers (sometimes you get the feeling that they
are out to slow you down, rather than help you :( ), but let me try to explain
where this sort of thinking comes from, because I think it is actually quite
common.

Lucene makes the claim that it has the "apache license", so that any company
can (to make a long story short) use this code. But when a company sets out
to use Lucene, can it take this claim at face value? After all, what happens
if somebody steals some proprietary code and puts it up on the web claiming it
has the apache license - does it give the users of that stolen code any
rights? Of course not, because the rights weren't the distributor's to give
out in the first place.

So it is quite natural that when a company wants to use use some open-source
code it doesn't take the license at face value, and rather does some "due
diligance" to verify that the people who published this code really owned
the rights to it. E.g., the company lawyers might want to do some background
checks on the committers, look at the project's history (e.g., that it doesn't
have some "out of the blue" donations from vague sources), check the code and
comments for suspicious strings, patterns, and so on.

When you need to inspect the code, naturally you need to decide what you
inspect. This particular company chose to inspect only the Lucene core,
perhaps because it is smaller, has fewer contributors, and has the vast
majority of what most Lucene users need. Inspecting all the contrib - with
all its foreign language analyzers, stuff like gdata and other rarely used
stuff - may be too hard for them. But then, the question I would ask is -
why not inspect the core *and* the few contribs that interest you? For
example, SweetSpotSimilarity (which you need) and other generally useful
stuff like Highlighter and SnowballAnalyzer.

> Doing this would actually be a complete reversal of the goals discussed in
> the near past:  increasing our use of the contrib structure for new
> features that aren't inherently tied to the "guts" of Lucene.  The goal
> being to keep the "core" jar as small as possible for people who want to
> develop apps with a small foot print.

I agree that this is an important goal.

> At one point there was even talk of refactoring additional code out of the
> core and into a contrib (this was already done with some analyzers when
> Lucene became a TLP)

--
Nadav Har'El                        |      Wednesday, Sep  3 2008, 3 Elul 5768
IBM Haifa Research Lab              |-----------------------------------------
                                   |Promises are like babies: fun to make,
http://nadav.harel.org.il           |but hell to deliver.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Mark Miller-3
In reply to this post by Nadav Har'El
I think its a fair question that, regardless of the legal mumbo jumbo
provoking it, can be considered on the merits that it should be - is it
something important enough to bulk up the core with the trade off being
more people will find it helpful and can use it with slightly less hassle?

I have seen discussion about about core vs contrib before, and from what
I saw, the distinction and rules are not quite clear. I would think
though, if the new Similarity is really that much better than the old,
it might actually benefit in core. There is no doubt core gets more
attention on both the user and developer side, and important pieces with
general usages should probably be there.

I havn't used it myself, so I won't guess (too much <g>), but the
question to me seems to be, is SweetSpot important enough to move to
core? Are there enough good reasons? And even if so, is it ready to move
to core? Contrib also seems to be somewhat of a possible incubation area...

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

mark harwood
In reply to this post by Shai Erera
Not tried SweetSpot so can't comment on worthiness of moving to core but agree with the principle that we can't let the hassles of a company's "due diligence" testing dictate the shape of core vs contrib.

For anyone concerned with the overhead of doing these checks a company/product of potential interest is "Black Duck".
I don't work for them and don't offer any endorsement but simply point them out as something you might want to take a look at.

Cheers
Mark



----- Original Message ----
From: Nadav Har'El <[hidden email]>
To: [hidden email]
Sent: Wednesday, 3 September, 2008 13:21:34
Subject: Re: Moving SweetSpotSimilarity out of contrib

On Tue, Sep 02, 2008, Chris Hostetter wrote about "Re: Moving SweetSpotSimilarity out of contrib":

>
> : >From a legal standpoint, whenever we need to use open-source code, somebody
> : has to inspect the code and 'approve' it. This inspection makes sure there's
> : no use of 3rd party libraries, to which we'd need to get open-source
> : clearance as well.
>
> You should talk to whomever you need to talk to at your company about
> revising the appraoch you are taking -- the core vs contrib distinction in
> Lucene-Java is one of our own making that is completly artificial.  With
> Lucene 2.4 we could decide to split what is currently known as the "core"
> into 27 different directories, none of which are called core, and all of
> which have an interdependency on eachother.  We're not likely to, but we
> could -- and then where woud your company be?

I can't really defend the lawyers (sometimes you get the feeling that they
are out to slow you down, rather than help you :( ), but let me try to explain
where this sort of thinking comes from, because I think it is actually quite
common.

Lucene makes the claim that it has the "apache license", so that any company
can (to make a long story short) use this code. But when a company sets out
to use Lucene, can it take this claim at face value? After all, what happens
if somebody steals some proprietary code and puts it up on the web claiming it
has the apache license - does it give the users of that stolen code any
rights? Of course not, because the rights weren't the distributor's to give
out in the first place.

So it is quite natural that when a company wants to use use some open-source
code it doesn't take the license at face value, and rather does some "due
diligance" to verify that the people who published this code really owned
the rights to it. E.g., the company lawyers might want to do some background
checks on the committers, look at the project's history (e.g., that it doesn't
have some "out of the blue" donations from vague sources), check the code and
comments for suspicious strings, patterns, and so on.

When you need to inspect the code, naturally you need to decide what you
inspect. This particular company chose to inspect only the Lucene core,
perhaps because it is smaller, has fewer contributors, and has the vast
majority of what most Lucene users need. Inspecting all the contrib - with
all its foreign language analyzers, stuff like gdata and other rarely used
stuff - may be too hard for them. But then, the question I would ask is -
why not inspect the core *and* the few contribs that interest you? For
example, SweetSpotSimilarity (which you need) and other generally useful
stuff like Highlighter and SnowballAnalyzer.

> Doing this would actually be a complete reversal of the goals discussed in
> the near past:  increasing our use of the contrib structure for new
> features that aren't inherently tied to the "guts" of Lucene.  The goal
> being to keep the "core" jar as small as possible for people who want to
> develop apps with a small foot print.

I agree that this is an important goal.

> At one point there was even talk of refactoring additional code out of the
> core and into a contrib (this was already done with some analyzers when
> Lucene became a TLP)

--
Nadav Har'El                        |      Wednesday, Sep  3 2008, 3 Elul 5768
IBM Haifa Research Lab              |-----------------------------------------
                                    |Promises are like babies: fun to make,
http://nadav.harel.org.il           |but hell to deliver.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

hossman
In reply to this post by Mark Miller-3

: saw, the distinction and rules are not quite clear. I would think though, if
: the new Similarity is really that much better than the old, it might actually
: benefit in core. There is no doubt core gets more attention on both the user
: and developer side, and important pieces with general usages should probably
: be there.

I see a Chicken/Egg argument here ... Perhaps contribs would get more
attention if we used them more -- as in: put more stuff in them.

: I havn't used it myself, so I won't guess (too much <g>), but the question to
: me seems to be, is SweetSpot important enough to move to core? Are there
: enough good reasons? And even if so, is it ready to move to core? Contrib also
: seems to be somewhat of a possible incubation area...

I think that's the wrong question to ask.  I would rather ask the question
"Is X decoupled enough from Lucene internals that it can be a contrib?"  
Things like IndexWriter, IndexReader, Document and TokenStream really need
to be "core" ... but things like the QueryParser, and most of our
analyzers don't.  Having lots of loosely coupled mini-libraries that
respect good API boundaries seems more reusable and generally saner then
"all of this code is useful and lots of people wnat it so throw it into
the kitchen sink"

We don't need to go hog wild gutting things out of the core ... but i
don't think we should be adding new things to the core just becuase they
are "generally useful".


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Mark Miller-3
I would agree with you if I was wrong about the contrib/core attention
thing, but I don't think I am. It seems as if you have been arguing that
contrib is really just an extension of core, on par with core, but just
in different libs, and to keep core lean and mean, anything not needed
in core shouldn't be there - sounds like an idea I could get behind, but
seems to ignore the reality:

The user/dev focus definitely seems to be on core. Some of contrib is a
graveyard in terms of dev and use I think. I think its still entangled
in its "sandbox" roots.

Contrib lacks many requirements of core code - it can be java 1.5, it
doesn't have to be backward compatible, etc. Putting something in core
ensures its treated as a Lucene first class citizen, stuff in contrib is
not held to such strict standards.

Even down to the people working on the code, there is a lower bar to
become a contrib commiter than a full core committer (see my contrib
committer status <g>).

Its not that I don't like what you propose, but I don't buy it as very
viable the way things are now. IMO we would need to do some work to make
it a reality. It can be said thats the way it is, but my view of things
doesnt jive with it.

I may have mis written "generally useful". What I meant was, if the
sweet spot sim is better than the default sim, but a bit harder to use
because of config, perhaps it is "core" enough to go there, as often it
may be better to use. Again, I fully believe it would get more attention
and be 'better' maintained. I did not mean to set the bar at "generally
useful" and I apologize for my imprecise language (one of my many faults).

> I think that's the wrong question to ask.  I would rather ask the question
> "Is X decoupled enough from Lucene internals that it can be a contrib?"  
> Things like IndexWriter, IndexReader, Document and TokenStream really need
> to be "core" ... but things like the QueryParser, and most of our
> analyzers don't.  Having lots of loosely coupled mini-libraries that
> respect good API boundaries seems more reusable and generally saner then
> "all of this code is useful and lots of people wnat it so throw it into
> the kitchen sink"
>
> We don't need to go hog wild gutting things out of the core ... but i
> don't think we should be adding new things to the core just becuase they
> are "generally useful".
>
>
> -Hoss
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Moving SweetSpotSimilarity out of contrib

steve_rowe
In reply to this post by hossman
On 09/03/2008 at 2:00 PM, Chris Hostetter wrote:

> On 09/03/2008 at 8:40 AM, Mark Miller wrote:
> > I havn't used it myself, so I won't guess (too much <g>), but the
> > question to me seems to be, is SweetSpot important enough to move to
> > core? Are there enough good reasons? And even if so, is it ready to
> > move to core? Contrib also seems to be somewhat of a possible
> > incubation area...
>
> I think that's the wrong question to ask.  I would rather ask the
> question "Is X decoupled enough from Lucene internals that it can be a
> contrib?" Things like IndexWriter, IndexReader, Document and TokenStream
> really need to be "core" ... but things like the QueryParser, and most
> of our analyzers don't.  Having lots of loosely coupled mini-libraries
> that respect good API boundaries seems more reusable and generally saner
> then "all of this code is useful and lots of people wnat it so throw it
> into the kitchen sink"
>
> We don't need to go hog wild gutting things out of the core ... but i
> don't think we should be adding new things to the core just
> becuase they are "generally useful".

One of core's requirements is: no external dependencies.  Although many contrib components meet this requirement, there is no structural differentiation between them and those that don't.  So from the point of view of simplifying lawyers' licensing labors :), it might make sense to split off a "contrib-no-ext-deps".

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Michael McCandless-2
In reply to this post by Mark Miller-3
Another important driver is the "out-of-the-box experience".

It's crucial that Lucene has good starting defaults for everything
because many developers will stick with these defaults and won't
discover the wiki page that says you need to do X, Y and Z to get
better relevance, indexing speed, searching speed, etc.  This then
makes Lucene look bad, not only to these Lucene users but then also to
the end users who use their apps that say "Powered by Lucene".

It also affects Lucene's adoption/growth over time: when a potential
new user is just "trying Lucene out" we want our defaults to shine
because those new users will walk away if Lucene doesn't compare well
to other engines that are well-tuned out-of-the-box.

I remember a while back we discussed an article comparing performance
of various search engines and we were disappointed that the author
didn't do X, Y and Z to let Lucene compete fairly.  If we had good
defaults that wouldn't have happened (or, at least to a lesser
extent).

Obviously we can't default everything perfectly since at some point
there are hard tradeoffs to be made and every app is different, but if
SweetSpotSimilarity really gives better relevance for many/most apps,
and doesn't have any downsides (I haven't looked closely myself), I
think we should get it into core?

You know... it's almost like we need a "standard distro" (drawing
analogy to Linux) for Lucene, which would be the core plus cherry-pick
certain important contrib modules (highlighter, SweetSpotSimilarity,
snowball, spellchecker, etc.) and bundle them together.  See,
highlighting is obviously well "decoupled" from Lucene's core, so it
should remain in contrib, yet is also cleary a very important function
in nearly every search engine.

Mike

Mark Miller wrote:

> I would agree with you if I was wrong about the contrib/core  
> attention thing, but I don't think I am. It seems as if you have  
> been arguing that contrib is really just an extension of core, on  
> par with core, but just in different libs, and to keep core lean and  
> mean, anything not needed in core shouldn't be there - sounds like  
> an idea I could get behind, but seems to ignore the reality:
>
> The user/dev focus definitely seems to be on core. Some of contrib  
> is a graveyard in terms of dev and use I think. I think its still  
> entangled in its "sandbox" roots.
>
> Contrib lacks many requirements of core code - it can be java 1.5,  
> it doesn't have to be backward compatible, etc. Putting something in  
> core ensures its treated as a Lucene first class citizen, stuff in  
> contrib is not held to such strict standards.
>
> Even down to the people working on the code, there is a lower bar to  
> become a contrib commiter than a full core committer (see my contrib  
> committer status <g>).
>
> Its not that I don't like what you propose, but I don't buy it as  
> very viable the way things are now. IMO we would need to do some  
> work to make it a reality. It can be said thats the way it is, but  
> my view of things doesnt jive with it.
>
> I may have mis written "generally useful". What I meant was, if the  
> sweet spot sim is better than the default sim, but a bit harder to  
> use because of config, perhaps it is "core" enough to go there, as  
> often it may be better to use. Again, I fully believe it would get  
> more attention and be 'better' maintained. I did not mean to set the  
> bar at "generally useful" and I apologize for my imprecise language  
> (one of my many faults).
>> I think that's the wrong question to ask.  I would rather ask the  
>> question "Is X decoupled enough from Lucene internals that it can  
>> be a contrib?"  Things like IndexWriter, IndexReader, Document and  
>> TokenStream really need to be "core" ... but things like the  
>> QueryParser, and most of our analyzers don't.  Having lots of  
>> loosely coupled mini-libraries that respect good API boundaries  
>> seems more reusable and generally saner then "all of this code is  
>> useful and lots of people wnat it so throw it into the kitchen sink"
>>
>> We don't need to go hog wild gutting things out of the core ... but  
>> i don't think we should be adding new things to the core just  
>> becuase they are "generally useful".
>>
>>
>> -Hoss
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

mark harwood
 >>Another important driver is the "out-of-the-box experience".
 >>we need a "standard distro" ...which would be the core plus
cherry-pick certain important contrib modules (highlighter,
 >> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them
together.

Is that not Solr, or at least the start of a path that ultimately ends
up there?
I suspect any attempts at "bundling" Lucene code may snowball until
you've rebuilt Solr.

If anything I suspect a more interesting initiative might be to
"unbundle" Solr and see some more of it's features emerge as standalone
modules in Lucene/contrib (or a suitably renamed area e.g. "extensions")?





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Michael McCandless-2

markharw00d wrote:

> >>Another important driver is the "out-of-the-box experience".
> >>we need a "standard distro" ...which would be the core plus cherry-
> pick certain important contrib modules (highlighter,
> >> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them  
> together.
> Is that not Solr, or at least the start of a path that ultimately  
> ends up there?
> I suspect any attempts at "bundling" Lucene code may snowball until  
> you've rebuilt Solr.

Yeah I guess it is... though Solr includes the whole webapp too,  
whereas I think there's a natural bundle that wouldn't include that.

Still, I think it's important for Lucene itself to have strong  
defaults out of the box.

> If anything I suspect a more interesting initiative might be to  
> "unbundle" Solr and see some more of it's features emerge as  
> standalone modules in Lucene/contrib (or a suitably renamed area  
> e.g. "extensions")?

I like that!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Yonik Seeley-2
On Wed, Sep 3, 2008 at 4:55 PM, Michael McCandless
<[hidden email]> wrote:
>> I suspect any attempts at "bundling" Lucene code may snowball until you've
>> rebuilt Solr.
>
> Yeah I guess it is... though Solr includes the whole webapp too, whereas I
> think there's a natural bundle that wouldn't include that.

One thing we are looking at for Solr2 is making it more useful for
advanced embedded users.
I expect a non-webapp version too.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Grant Ingersoll-2
In reply to this post by Michael McCandless-2

On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote:
>
> Obviously we can't default everything perfectly since at some point
> there are hard tradeoffs to be made and every app is different, but if
> SweetSpotSimilarity really gives better relevance for many/most apps,
> and doesn't have any downsides (I haven't looked closely myself), I
> think we should get it into core?

Well, we only have 2 data points here:  Hoss' original position that  
it was helpful, and Doron's Million Query work.  Has anyone else  
reported benefit?  And in that regard, the difference between OOTB and  
SweetSpot was 0.154 vs. 0.162 for MAP.  Not a huge amount, but still  
useful.  In that regard, there are other length normalization  
functions (namely approaches that don't favor very short documents as  
much) that I've seen benefit applications as well, but as Erik is  
(in)famous for saying "it depends".  In fact, if we go solely based on  
the million query work, we'd be better off having the Query Parser  
create phrase queries automatically for any query w/ more than 1 term  
(0.19 vs 0.154) before we even touch length normalization.

I've long argued that Lucene needs to take on the relevance question  
more head on, and in an open source way, until then, we are merely  
guessing at what's better, w/o empirical evidence that can be easily  
reproduced.   TREC is just one data point, and is often discounted as  
being all that useful in the real world.

I'm on the fence, though.  I agree w/ Hoss that core should be "core"  
and I don't think we want to throw more and more into core, but I also  
agree w/ Mike in that we want good, intelligent defaults for what we  
do have in core.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

Doron Cohen-2
My thought was to move SSS to core as a step towards
making it the default, if and when there is more evidence it is
better than current default - it just felt right as a cautious
step - I mean first move it to core so that it is more exposed
and used, an only after a while, maybe, if there are mostly
positive evidences, make it the default.

On Thu, Sep 4, 2008 at 12:04 AM, Grant Ingersoll <[hidden email]> wrote:

On Sep 3, 2008, at 3:00 PM, Michael McCandless wrote:

Obviously we can't default everything perfectly since at some point
there are hard tradeoffs to be made and every app is different, but if
SweetSpotSimilarity really gives better relevance for many/most apps,
and doesn't have any downsides (I haven't looked closely myself), I
think we should get it into core?

Well, we only have 2 data points here:  Hoss' original position that it was helpful, and Doron's Million Query work.  Has anyone else reported benefit?  And in that regard, the difference between OOTB and SweetSpot was 0.154 vs. 0.162 for MAP.  Not a huge amount, but still useful.  In that regard, there are other length normalization functions (namely approaches that don't favor very short documents as much) that I've seen benefit applications as well, but as Erik is (in)famous for saying "it depends".  In fact, if we go solely based on the million query work, we'd be better off having the Query Parser create phrase queries automatically for any query w/ more than 1 term (0.19 vs 0.154) before we even touch length normalization.

I've long argued that Lucene needs to take on the relevance question more head on, and in an open source way, until then, we are merely guessing at what's better, w/o empirical evidence that can be easily reproduced.   TREC is just one data point, and is often discounted as being all that useful in the real world.

I'm on the fence, though.  I agree w/ Hoss that core should be "core" and I don't think we want to throw more and more into core, but I also agree w/ Mike in that we want good, intelligent defaults for what we do have in core.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

hossman
In reply to this post by Mark Miller-3

: Contrib lacks many requirements of core code - it can be java 1.5, it doesn't
: have to be backward compatible, etc. Putting something in core ensures its
: treated as a Lucene first class citizen, stuff in contrib is not held to such
: strict standards.

"Contribs" as an idea lack those requirements -- but that doesn't mean
individual contribs can't enforce them in order to be "more solid"
contribs on par with core.

The bottom line is that contribs are about modularization, and
compartmentilization of features.  We want to be able to build small
compact jars with well defined dependencies so that if someone wants basic
indexing plus highlighting they know exactly what jars they need ... they
don't have to worry about being surprized at run time by a dependency on
some random class in o.a.l.misc, and they don't have to load every
Lucene jar (or one monolithic Lucene jar) just to play it safe.

At the end of the day there's no reason not to think of the "core" lucene
code base as anything other then a contrib which is not allowed to have
dependencies, and needs to be Java 1.4 compatible.  We could easily
imagine revamping the Lucene code base something like this...

        mv contrib modules
        mkdir modules/core
        mv src/java modules/core/src
        mv src/test modules/core/test
        mv src/demo modules demo

...it would be a natural migration of what we have now (and it would
simplify the build process quite a bit).

: Its not that I don't like what you propose, but I don't buy it as very viable
: the way things are now. IMO we would need to do some work to make it a
: reality. It can be said thats the way it is, but my view of things doesnt jive
: with it.

I won't disagree that contribs may seem like second class citizens at the
moment; I jus think that i would be better to make steps to elevate the
concept of contribs in peoples minds (by moving more things into
contribs, solidifying the policies arround individual contribs,
etc...) then to feed the perception by "promoting" things out of a contrib
and into the core without any technical reason for doing so (ie: a new
feature that requires tighter dependency; making SSS the default
Similarity; etc...)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

hossman
In reply to this post by Michael McCandless-2

: Another important driver is the "out-of-the-box experience".

I honestly have no idea what an OOTB experience for Lucene-Java means ...
For Solr i understand, For Nutch i understand ... for a java library????  

The closest thing we can do to describing an OOTB experience is making a
good demo ... and there's no reason the demo can't utilize contrib jars
and tweak settings to be better tuned then the default (if the default is
the way it is for back-compat reasons)

: It's crucial that Lucene has good starting defaults for everything
: because many developers will stick with these defaults and won't
: discover the wiki page that says you need to do X, Y and Z to get
: better relevance, indexing speed, searching speed, etc.  This then
: makes Lucene look bad, not only to these Lucene users but then also to
: the end users who use their apps that say "Powered by Lucene".

Butthen we get into that back-compat concern issue.

Sith something like Solr, we have hardcoded defaults and then we have
recommeded setings.  the recommended settings go in the example configs
that ship with every release, but we leave the hardcoded defaults as
backwards compatible as possible except in extreme cases -- in those
cases, we make sure there's a simple setting to restore the old behavior.

For a library like Lucene-Java, the nearest equivilent i can think of is a
global properties file (eeecch) or some static factories for producing
objects that have different compatibility garuntees.  id:
Similarity.getDefaultSimilarity() is garunteed to allways return an
equivilent Similiarity impl for all 1.X
releases, but Similarity.getRecommendedSimilarityForShortText() might
change btween every release;  dito for things like "new
StandardAnalyzer(...)" vs "Analyzer.getRecommendedAnalyser(...)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Moving SweetSpotSimilarity out of contrib

hossman
In reply to this post by Doron Cohen-2

: My thought was to move SSS to core as a step towards
: making it the default, if and when there is more evidence it is
: better than current default - it just felt right as a cautious
: step - I mean first move it to core so that it is more exposed

If people really want to make SSS the default similarity, then of
course it would be neccessary to move it into the core ... but i can't
think of any reason for the intermediate step.  

SSS defaults to being functionaly equivilent to DefaultSimilarity -- it's
only if you call one of hte setters to specify a Sweetspot (or subclass it
to get the hyperbolic tf function) that it's behavior will differ from the
current default.  (unless it has a really blatent bug)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12