Index-time Boosting

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Index-time Boosting

Walter Underwood, Netflix
I'm trying to figure out how to set per-field boosts in Solr at index time.
For example, if I want the title to be boosted by a factor of 8, I could
do that in a query, or I could add the title text with a boost of 8 to the
default text field along with the body text (with a boost of 1).

For other engines I've worked with, this gives a lot more performance at
the cost of some flexibility -- you need to reindex to change the
weightings.

I don't see an obvious way to do this in a Solr schema, though it might
make sense to add a boost attribute to copyField.

Any ideas? Did I miss something?

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
On 10/20/06, Walter Underwood <[hidden email]> wrote:
> I'm trying to figure out how to set per-field boosts in Solr at index time.

The update XML syntax supports both document boosts and field boosts.
http://wiki.apache.org/solr/UpdateXmlMessages
A document boost is simply multiplied into the boost for each field
(this is standard lucene... nothing is done differently in Solr w.r.t.
boosts).

> For example, if I want the title to be boosted by a factor of 8, I could
> do that in a query, or I could add the title text with a boost of 8 to the
> default text field along with the body text (with a boost of 1).

Ahhh. there's the problem.  Boosts in Lucene are per document per
*field*.  You can't boost some tokens over others in the same field,
and multi-valued fields in Lucene act as if they are catenated for
indexing purposes (position gaps aside).

The index-boost is currently part of the "norms", and is an eight byte
float that's the product of the length normalization factor and the
index-time boost.  For any given indexed field, there is only one norm
per document.  If you look at a lucene index, these are the .f0, .f1,
.f2 files (a norm array for each indexed field).  Since they contain
one byte per document, you can easily tell how many documents are in
each segment by a simple glance at these files.

> For other engines I've worked with, this gives a lot more performance at
> the cost of some flexibility -- you need to reindex to change the
> weightings.

Index time boosting only makes sense when you boost the fields of some
documents over the same fields of other documents.  If you *always*
boost title in every document, it makes no sense to use an index-time
boost... it is no faster than a query time boost, and is less
flexible.

> I don't see an obvious way to do this in a Solr schema, though it might
> make sense to add a boost attribute to copyField.

Given the current lucene restrictions, this wouldn't seem to be useful.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Mike Klaas
In reply to this post by Walter Underwood, Netflix
On 10/20/06, Walter Underwood <[hidden email]> wrote:
> I'm trying to figure out how to set per-field boosts in Solr at index time.
> For example, if I want the title to be boosted by a factor of 8, I could
> do that in a query, or I could add the title text with a boost of 8 to the
> default text field along with the body text (with a boost of 1).
<>
> Any ideas? Did I miss something?

Index-time boosts can be set per-document or per-document-field.
There is no facility for setting the boost of a part of text added to
a field (as you suggest above) (which is really a shame, as such
functionality would lend huge flexibility to index-time boosing!).

You, can, however, easily set index-time boosts for fields in solr:

<doc>
   <field name="title" boost="8.1">Doc Title</field>
</doc

You must do this for every document.  (Be careful for multi-valued
fields--you should only set the boost for _one_ value input to the
field).  I doubt you will see any performance difference compared to
query-time boosting.  There are a few optimizations in solr that only
trigger when boosts are one, but I'm not sure exactly what those are.

Finally, it can be much faster to search a single field rather than
multiple fields.  One hacky way of achieving this is to make a field
which receives a single copy of contents and eight copies of title.
This is imperfect, as it messes up length normalization and
summarizing.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
On 10/20/06, Mike Klaas <[hidden email]> wrote:
> Index-time boosts can be set per-document or per-document-field.
> There is no facility for setting the boost of a part of text added to
> a field (as you suggest above) (which is really a shame, as such
> functionality would lend huge flexibility to index-time boosing!).

I wonder what the index-size cost would be though...
Anyway, there has been discussion of flexible indexing on the Lucene
list in the past few months, with one application being
boost-per-position.

> You must do this for every document.  (Be careful for multi-valued
> fields--you should only set the boost for _one_ value input to the
> field).

Good point... I believe they are all multiplied together in Lucene.

> There are a few optimizations in solr that only
> trigger when boosts are one, but I'm not sure exactly what those are.

There were optimizations that hoisted mandatory boolean clauses with a
zero boost into a cached filter (I got that optimization from
Doug/Nutch).  That optimization is no longer in the normal code paths
that return DocSets/DocLists, and it probably doesn't matter given
that one can now explicitly specify filter queries themselves via fq
params.

Is fq documented anywhere???  It's very useful for speeding up complex
queries since they are cached independently from the main query.
Just yesterday I sped up some queries from an average latency of .550
seconds to .004 seconds by pulling out some mandatory clauses that
matched the majority of documents in the index into a fq.

> Finally, it can be much faster to search a single field rather than
> multiple fields.  One hacky way of achieving this is to make a field
> which receives a single copy of contents and eight copies of title.
> This is imperfect, as it messes up length normalization and
> summarizing.

And you can't make the title field count 8 times as much :-)

I've seen people simply *add* the title field multiple times to the
general search field in an attempt to boost it.  I can't say how well
it worked.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Walter Underwood, Netflix
In reply to this post by Mike Klaas
On 10/20/06 10:24 AM, "Mike Klaas" <[hidden email]> wrote:

> Finally, it can be much faster to search a single field rather than
> multiple fields.  One hacky way of achieving this is to make a field
> which receives a single copy of contents and eight copies of title.
> This is imperfect, as it messes up length normalization and
> summarizing.

Matching a token eight times is probably faster than fetching
a second field. For titles, the normalization probably should
be turned off anyway. Normalization is really there to compare
1000 word docs with 8000 word docs, not 3 word titles with 6 word
titles.

Maybe I'll try one searchable field per weight and check that
for performance. Any rule of thumbs about how the performance
changes when different numbers of fields are searched?

Thanks for all the help. I'm trying to avoid premature optimization,
but I'm starting with a load of 1-2 million queries/day, so I need
to be ready to make it perform.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
In reply to this post by Yonik Seeley-2
On 10/20/06, Yonik Seeley <[hidden email]> wrote:
> On 10/20/06, Mike Klaas <[hidden email]> wrote:
> > Finally, it can be much faster to search a single field rather than
> > multiple fields.  One hacky way of achieving this is to make a field
> > which receives a single copy of contents and eight copies of title.

> I've seen people simply *add* the title field multiple times to the
> general search field in an attempt to boost it.

Ahh, I read that too fast... that's what you were saying.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Re: Index-time Boosting

Mike Klaas
In reply to this post by Walter Underwood, Netflix
On 10/20/06, Walter Underwood <[hidden email]> wrote:

> On 10/20/06 10:24 AM, "Mike Klaas" <[hidden email]> wrote:
>
> > Finally, it can be much faster to search a single field rather than
> > multiple fields.  One hacky way of achieving this is to make a field
> > which receives a single copy of contents and eight copies of title.
> > This is imperfect, as it messes up length normalization and
> > summarizing.
>
> Matching a token eight times is probably faster than fetching
> a second field.

Definitely.  Particularly if you are no using span queries, in which
case, the eight times is just a change in count.

> For titles, the normalization probably should
> be turned off anyway. Normalization is really there to compare
> 1000 word docs with 8000 word docs, not 3 word titles with 6 word
> titles.

Ah, but normalization is extremely valuable to make the title weigh
more heavily than the 1000-word content field.  I generally leave the
default normalization for title fields, and do a hack for content
fields where I set a minimum length (you generally don't prefer 5-word
docs to 1000-word docs)

> Maybe I'll try one searchable field per weight and check that
> for performance. Any rule of thumbs about how the performance
> changes when different numbers of fields are searched?

With OR queries I'd expect it to be linear for similarly-sized fields.
 Smaller fields will be much faster than longer ones, of course
(searching title+contents should be much less than double the cost of
search just contents).

> Thanks for all the help. I'm trying to avoid premature optimization,
> but I'm starting with a load of 1-2 million queries/day, so I need
> to be ready to make it perform.

With that kind of query load, your optimization work should be largely
focused on caching, imo.  Also consider that solr should be able to
scale up in terms of query rate well by adding another server, using
the built-in replication, and throwing a load-balancer in front of it.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Yonik Seeley-2
In reply to this post by Walter Underwood, Netflix
On 10/20/06, Walter Underwood <[hidden email]> wrote:

> On 10/20/06 10:24 AM, "Mike Klaas" <[hidden email]> wrote:
>
> > Finally, it can be much faster to search a single field rather than
> > multiple fields.  One hacky way of achieving this is to make a field
> > which receives a single copy of contents and eight copies of title.
> > This is imperfect, as it messes up length normalization and
> > summarizing.
>
> Matching a token eight times is probably faster than fetching
> a second field. For titles, the normalization probably should
> be turned off anyway. Normalization is really there to compare
> 1000 word docs with 8000 word docs, not 3 word titles with 6 word
> titles.

Right.  Depending on the nature of your titles, turning off length
normalization can sometimes improve relevance.

> Maybe I'll try one searchable field per weight and check that
> for performance. Any rule of thumbs about how the performance
> changes when different numbers of fields are searched?

If it's a disjunction, it's pretty linear I'd say.  I think
time(A OR B) will be close to time(A) + time(B)

> Thanks for all the help. I'm trying to avoid premature optimization,
> but I'm starting with a load of 1-2 million queries/day, so I need
> to be ready to make it perform.

That definitely seems doable.
How big is your index?
What's the form of your queries (AND, or sloppy phrase queries I'd imagine?)

If this is for netflix (and isn't confidential), are you just
searching across DVD info/description, or in customer comments too?
If it is DVD's you're searching, that can't be a large collection, and
you should be in really good shape.  You might even try indexing
things in separate fields and searching across all those fields while
assigning boosts separately... it should be fast enough.  You might
also check out the dismax handler if you haven't yet.
Any future plans for utilizing the faceted search?

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Walter Underwood, Netflix
On 10/20/06 11:17 AM, "Yonik Seeley" <[hidden email]> wrote:

> That definitely seems doable.
> How big is your index?

Currently, the index is 65,000 movies plus actors and directors. A pretty
small corpus.

> What's the form of your queries (AND, or sloppy phrase queries I'd imagine?)

We'll start with "OR", because I think an all-terms default is a really
bad idea. If someone is searches for "X-Men 3: The Final Battle", we
need to show them "X-Men 3: The Last Stand".

We'll need some sort of fuzzy matching and sloppy phrases. You should
see the misspellings for "Napoleon Dynamite" ("NEPOLINIAN DYNOMITE").

> If this is for netflix (and isn't confidential), are you just
> searching across DVD info/description, or in customer comments too?

We'll start with the basics and test other things. We are always
testing something new.

> If it is DVD's you're searching, that can't be a large collection, and
> you should be in really good shape.  You might even try indexing
> things in separate fields and searching across all those fields while
> assigning boosts separately... it should be fast enough.  You might
> also check out the dismax handler if you haven't yet.
> Any future plans for utilizing the faceted search?

We have a well-developed browsing design, so I'd rather not mix
facets in with that. Two other things work against using facets:
most of our queries are known-item searches, and I think that
facets work best when there is very broad agreement on the categories.
For example, clothing and food work well, but the subject facets
at Barnes and Noble don't help at all.

I have not checked out dismax.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Erik Hatcher

On Oct 20, 2006, at 2:34 PM, Walter Underwood wrote:
> We have a well-developed browsing design, so I'd rather not mix
> facets in with that. Two other things work against using facets:
> most of our queries are known-item searches, and I think that
> facets work best when there is very broad agreement on the categories.
> For example, clothing and food work well, but the subject facets
> at Barnes and Noble don't help at all.

What about adding user feedback (err tagging), you can make those be  
facets and intersect them in very interesting ways (along with any  
other categorizations the resources themselves carry along).  Let me  
tag things and find 'em my own way :)

        Erik, a satisfied Netflix customer

Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Chris Hostetter-3
In reply to this post by Yonik Seeley-2

: Is fq documented anywhere???  It's very useful for speeding up complex

i just added it to CommonQueryParameters



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Index-time Boosting

Chris Hostetter-3
In reply to this post by Walter Underwood, Netflix

: We'll start with "OR", because I think an all-terms default is a really
: bad idea. If someone is searches for "X-Men 3: The Final Battle", we
: need to show them "X-Men 3: The Last Stand".

BooleanQuery.setMinimumNumberShouldMatch can help reduce the cruft with
this approach .. the dismax handler enables it using hte param "mm"

: I have not checked out dismax.




-Hoss