Two Solr Announcements: CNET Product Search and DisMax

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Two Solr Announcements: CNET Product Search and DisMax

Chris Hostetter-3

I've got two related announcements to make, which I think are pretty
cool...

The first is that the Search result pages for CNET Shopper.com are now
powered by Solr.  You may be thinking "Didn't he announce that last year?"
... not quite.  CNET's faceted product listing pages for browsing products
by category have been powered by by Solr for about a year now, but up
until a few weeks ago, searching for products by keywords was still
powered by a legacy system.  I was working hard to come up with a good
mechanism for building Lucene queries based on user input, that would
allow us to leverage our "domain expertise" about consumer technology
products to ensure that users got the best matches.

Which brings me to my second announcement:  I've just committed a new
SolrQueryHandler called the "DisMaxQueryHandler" into the Solr subversion
repository.

This query handler supports a simplified version of the Lucene QueryParser
syntax.  Quotes can be used to group phrases, and +/- can be used to
denote mandatory and optional clauses ... but all other Lucene query
parser special characters are escaped to simplify the user experience.
The handler takes responsibility for building a good query from the user's
input using BooleanQueries containing DisjunctionMaxQueries across fields
and boosts you specify It also allows you to provide additional boosting
queries, boosting functions, and filtering queries to artificially affect
the outcome of all searches. These options can all be specified as init
parameters for the handler in your solrconfig.xml or overridden the Solr
query URL.

The code in this plugin is what is now powering CNET product search.

I've updated the "example" solrconfig.xml to take advantage of it, you can
take it for a spin right now if you build from scratch using subversion,
otherwise you'll have to wait for the solr-2006-05-21.zip nightly release
due out in a few hours.  Once you've got it, the javadocs for
DisMaxRequestHandler contain the details about all of the options it
supports, and here are a few URLs you can try out using the product data
in the exampledocs directory...

Normal results for the word "video" using the StandardRequestHandler with
the default search field...
  http://localhost:8983/solr/select/?q=video&fl=name+score&qt=standard

The "dismax" handler is configured to search across the text, features,
name, sku, id, manu, and cat fields all with varying boosts designed to
ensure that "better" matches appear first, specifically: documents which
match on the name and cat fields get higher scores...
  http://localhost:8983/solr/select/?q=video&qt=dismax

...note that this instance is also configured with a default field list,
which can be overridden in the URL...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=*,score

You can also override which fields are searched on, and how much boost
each field gets...
  http://localhost:8983/solr/select/?q=video&qt=dismax&qf=features^20.0+text^0.3

Another instance of the handler is registered using the qt "instock" and
has slightly different configuration options, notably: a filter for (you
guessed it) inStock:true)...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=name,score,inStock
  http://localhost:8983/solr/select/?q=video&qt=instock&fl=name,score,inStock

One of the other really cool features in this handler, is robust
support for specifying the "BooleanQuery.minimumNumberShouldMatch" you
want to be used based on how many terms are in your users query.
These allows flexibility for typos and partial matches.  For the
dismax handler, 1 and 2 word queries require that all of the optional
clauses match, but for 3-5 word queries one missing word is allowed...
  http://localhost:8983/solr/select/?q=belkin+ipod&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+apple&qt=dismax

Just like the StandardRequestHandler, it supports the debugQuery
option to viewing the parsed query, and the score explanations for each
doc...

http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax&debugQuery=1
  http://localhost:8983/solr/select/?q=video+card&qt=dismax&debugQuery=1


...That's the overall gist of it.  I hope other people find it useful
out of the box -- and even if it doesn't meet your needs, hopefully it
gives you some good ideas of the types of things that can be done in a
SolrRequestHandler that aren't supported natively with the Lucene
QueryParser.  If you do decide to write your own handler, make sure to
take a look at the new SolrPluginUtils class as well -- it provides
some nice reusable methods that came in handy when writing the
DisMaxRequestHandler.




-Hoss

Reply | Threaded
Open this post in threaded view
|

facet id values for indexed documents

maustin
I have a database of products to search.  I plan to have a design with
categories/facet groups/facets.  Similar to this:
<categories>
     <category id="1" label="TopLevel" query="+cat:1">
         <category id="2" label="SecondLevel" query="+cat:2">
             <category id="3" label="ThirdLevel" query="+cat:3">
                <group id="1" label="Price">
                      <facet id="1" label="Under $20" query="+price:[0 TO
20]" />
                      <facet id="2" label="$21 - $40" query="+price:[21 TO
40]" />
                      <facet id="3" label="$41 - $60" query="+price:[41 TO
60]" />
                      <facet id="4" label="Over $60" query="+price:61 TO
9999]" />
                </group>
                <group id="2" label="Manufacturer">
                        <facet id="5" label="Sony" query="+mfg:sony" />
                </group>
            </category>.......

.......

After looking over the fields in the default solr schema, I am having
trouble deciding where to put my facet id values.  Seems like I should add a
field called tags, or facets, or something similar.  Also, I'm adding a
description field.  I guess I'm trying to verify that there is not a better
solution or field value that I should be using to keep my product
descriptions and tags/facet values?  Because these two fields are probably
common, I was wondered why it wasn't in the default schema?

Thanks,
Mike

Reply | Threaded
Open this post in threaded view
|

Re: facet id values for indexed documents

Chris Hostetter-3

: After looking over the fields in the default solr schema, I am having
: trouble deciding where to put my facet id values.  Seems like I should add a
: field called tags, or facets, or something similar.  Also, I'm adding a
: description field.  I guess I'm trying to verify that there is not a better
: solution or field value that I should be using to keep my product
: descriptions and tags/facet values?  Because these two fields are probably
: common, I was wondered why it wasn't in the default schema?

First off, there is no default solr schema ... just a sample schema
provided with the example to give you an idea of all the different types
of things that are possible in a Solr schema -- by all means you should
make your own schema to fit the needs of your index.

Second: there's really no need to put the information about what facets
you have into documents in your index -- you certainly could do that (In
my case, in addition to having one document per product, i have one
"metadata document" per category which contains stored fields with
structured information about what facets i want) but you could just as
easily store that information externally in configuration files -- either
directly in the solrconfig.xml so that it's passed to your request
handlers init()  method, or in a file whose name is specified in the
solrconfig.xml.

once you've done that you can parse the facet information and store it in
memory in any datastructure you want ... based on your sample XML, I'm
guessing you'd have a structure that ultimately contained a lot of records
with numeric identifiers pointing at label/Query pairs.  When your handler
recieves a request, it can walk your datastructure looking at the IDs
specified in the request to figure out which Queries to execute to filter
the main result set, and which queries to execute to get a facet count ...
the filterCache should make sure that all of the individual DocSets are
cached (and warmed when the index changes)

Your example XML file looks like upi are persuing a very generic,
reusable, general purpose config format for faceted searching, along the
lines of some brainstorming that was done a while back...

http://www.nabble.com/metadata+about+result+sets--t1243321.html

please keep the rest of us in the loop on your progress ... even if your
project isn't something you would able to contribute back to Apache, any
insights you have while working on generic faceted searching would be
valuable to others.


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Two Solr Announcements: CNET Product Search and DisMax

Darren Vengroff-2
In reply to this post by Chris Hostetter-3
Chris,

Cool stuff.  Congratulations on launching.

I have a few scaling questions I hope you might be able to answer for me.
I'm keen to understand how solr performs under production loads with
significant real-time update traffic.  Specifically,

1. How many searches per second are you currently handling?
2. How big is the solr fleet?
3. What is your update rate?
4. What is the propogation delay from master to slave, i.e. how often do you
propogate and how long does it take per box?
5. What is your optimization schedule and how does it affect overall
performance of the system?

If anyone else out there has similar data from large-scale experiences with
solr, I'd love to hear those too.

Thanks,

-D

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Saturday, May 20, 2006 3:18 PM
To: [hidden email]
Subject: Two Solr Announcements: CNET Product Search and DisMax


I've got two related announcements to make, which I think are pretty
cool...

The first is that the Search result pages for CNET Shopper.com are now
powered by Solr.  You may be thinking "Didn't he announce that last year?"
... not quite.  CNET's faceted product listing pages for browsing products
by category have been powered by by Solr for about a year now, but up
until a few weeks ago, searching for products by keywords was still
powered by a legacy system.  I was working hard to come up with a good
mechanism for building Lucene queries based on user input, that would
allow us to leverage our "domain expertise" about consumer technology
products to ensure that users got the best matches.

Which brings me to my second announcement:  I've just committed a new
SolrQueryHandler called the "DisMaxQueryHandler" into the Solr subversion
repository.

This query handler supports a simplified version of the Lucene QueryParser
syntax.  Quotes can be used to group phrases, and +/- can be used to
denote mandatory and optional clauses ... but all other Lucene query
parser special characters are escaped to simplify the user experience.
The handler takes responsibility for building a good query from the user's
input using BooleanQueries containing DisjunctionMaxQueries across fields
and boosts you specify It also allows you to provide additional boosting
queries, boosting functions, and filtering queries to artificially affect
the outcome of all searches. These options can all be specified as init
parameters for the handler in your solrconfig.xml or overridden the Solr
query URL.

The code in this plugin is what is now powering CNET product search.

I've updated the "example" solrconfig.xml to take advantage of it, you can
take it for a spin right now if you build from scratch using subversion,
otherwise you'll have to wait for the solr-2006-05-21.zip nightly release
due out in a few hours.  Once you've got it, the javadocs for
DisMaxRequestHandler contain the details about all of the options it
supports, and here are a few URLs you can try out using the product data
in the exampledocs directory...

Normal results for the word "video" using the StandardRequestHandler with
the default search field...
  http://localhost:8983/solr/select/?q=video&fl=name+score&qt=standard

The "dismax" handler is configured to search across the text, features,
name, sku, id, manu, and cat fields all with varying boosts designed to
ensure that "better" matches appear first, specifically: documents which
match on the name and cat fields get higher scores...
  http://localhost:8983/solr/select/?q=video&qt=dismax

...note that this instance is also configured with a default field list,
which can be overridden in the URL...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=*,score

You can also override which fields are searched on, and how much boost
each field gets...
 
http://localhost:8983/solr/select/?q=video&qt=dismax&qf=features^20.0+text^0
.3

Another instance of the handler is registered using the qt "instock" and
has slightly different configuration options, notably: a filter for (you
guessed it) inStock:true)...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=name,score,inStock
 
http://localhost:8983/solr/select/?q=video&qt=instock&fl=name,score,inStock

One of the other really cool features in this handler, is robust
support for specifying the "BooleanQuery.minimumNumberShouldMatch" you
want to be used based on how many terms are in your users query.
These allows flexibility for typos and partial matches.  For the
dismax handler, 1 and 2 word queries require that all of the optional
clauses match, but for 3-5 word queries one missing word is allowed...
  http://localhost:8983/solr/select/?q=belkin+ipod&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+apple&qt=dismax

Just like the StandardRequestHandler, it supports the debugQuery
option to viewing the parsed query, and the score explanations for each
doc...

http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax&debugQu
ery=1
  http://localhost:8983/solr/select/?q=video+card&qt=dismax&debugQuery=1


...That's the overall gist of it.  I hope other people find it useful
out of the box -- and even if it doesn't meet your needs, hopefully it
gives you some good ideas of the types of things that can be done in a
SolrRequestHandler that aren't supported natively with the Lucene
QueryParser.  If you do decide to write your own handler, make sure to
take a look at the new SolrPluginUtils class as well -- it provides
some nice reusable methods that came in handy when writing the
DisMaxRequestHandler.




-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Two Solr Announcements: CNET Product Search and DisMax

Chris Hostetter-3

: I have a few scaling questions I hope you might be able to answer for me.
: I'm keen to understand how solr performs under production loads with
: significant real-time update traffic.  Specifically,

These are all really good questions ... unfortunately I'm not sure that
I'm permitted to give out specific answers to some of them.  As far as
understanding Solr's ability to stand up under load, I'll see if I can get
some time/permission to run some benchmarks and publish the numbers (or
perhaps Yonik can do this as part of his prep for presenting at
ApacheConEU ... what do you think Yonik?)

: 1. How many searches per second are you currently handling?
: 2. How big is the solr fleet?

I'm going to have to put Q1 and Q2 in the "decline to state" category.

: 3. What is your update rate?

Hard to say ... I can tell you that our index contains roughly N
documents, and doing some greps of our logs I can see that on a typical
day our "master" server recieves about N/2 "<add>" commands ... but this
doesn't mean that half of our logical data is changing every day, most of
those updates are to the same logical documents over and over but it does
give you an idea of the amount of churn in lucene documents that's taking
place in a 24 hour period.

I should also point out that most of these updates are coming in big
spurts, but we've never encountered a situation where waiting for Solr to
index a document was a bottleneck -- pulling document data from our
primary datastore always takes longer then indexing the docs.

: 4. What is the propogation delay from master to slave, i.e. how often do you
: propogate and how long does it take per box?
: 5. What is your optimization schedule and how does it affect overall
: performance of the system?

The answers to Q4 and Q5 are related, and involve telling a much longer
story...

A year ago when we first started using Solr for faceted browsing, it had
Lucene 1.4.3 under the hood.  Our updating strategy involved issuing
commit commands after every batch of udpates (where a single batch was
never bigger then 2000 documents) with snapshooter configured in a
postCommit listener, and snappuller on the slaves running every 10
minutes.  We optimized twice a day, but while optimizing we disabled the
processes that sent updates because optimizing could easily take 20-30
minutes.  The index had thousands of indexed fields to support the
faceting we wanted and this was the cause of our biggest performance
issue: the space needed for all those field norms.  (Yonik implimented the
OMIT_NORMS option in lucene 1.9 to deal with this).

When we upgraded to Lucne 1.9 and started adding support for text
searching, our index got significantly smaller (even though we were adding
a lot of new tokenized fields) thanks to being able to turn off norms for
all of those existing faceting fields.  The other great thing about using
1.9 was that optimizing got a lot faster (I'm not certain if it's just
becuase of the reduced number of fields with norms, or if some other
improvement was made to how optimize works in lucene 1.9).  Optimizing our
index now typically only takes ~1 minute, the longest i've seen it take is
5 minutes.

While doing a lot of prelaunch profiling, we discovered that under extreme
loads, there was a huge differnce in the outliers between an optimized and
a non-optimized index -- we always knew querying an optimized index was
faster on average then querying an unoptimized index, we just didn't
realize how big the gap got when you looked at the non-average cases.

Sooo... since optimize times got so much shorter, and the benefits of
allways querying an optimized index were so easy to see, we changed the
solrconfig.xml for out master to only snapshoot on postOptimize, modified
our optimize cron to run every 30 minutes, and modified the snappuller
crons on the slaves to check for new snapshots more often (5 minutes i
think)

This means we are only ever snappulling complete copies of our index,
twice and hour.  So thetypical max delay in how long it takes for an
update on the master to show up on the slave is ~35 minutes -- the average
delay being 15-20 minutes

If we were concerned about reducing this delay we could, (even with our
current strategy of only pulling optimized indexes to the salves) but this
is faste enough for our purposes, and allows us to really take advantage
of the filterCaches on the slaves.


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Two Solr Announcements: CNET Product Search and DisMax

Darren Vengroff-2
Thanks Hoss.  This is really useful information.

I understand you may not be able to answer 1 and 2 directly, so how about if
I combine them into one question that doesn't require you to release quite
as much information.  Could you tell my how many tps you do per box, and a
rough spec of what the boxes are?  I.e. the ratio of the answers to
questions 1 and 2.

Thanks,

-D

-----Original Message-----
From: Chris Hostetter [mailto:[hidden email]]
Sent: Saturday, May 27, 2006 1:23 PM
To: [hidden email]
Subject: RE: Two Solr Announcements: CNET Product Search and DisMax


: I have a few scaling questions I hope you might be able to answer for me.
: I'm keen to understand how solr performs under production loads with
: significant real-time update traffic.  Specifically,

These are all really good questions ... unfortunately I'm not sure that
I'm permitted to give out specific answers to some of them.  As far as
understanding Solr's ability to stand up under load, I'll see if I can get
some time/permission to run some benchmarks and publish the numbers (or
perhaps Yonik can do this as part of his prep for presenting at
ApacheConEU ... what do you think Yonik?)

: 1. How many searches per second are you currently handling?
: 2. How big is the solr fleet?

I'm going to have to put Q1 and Q2 in the "decline to state" category.

: 3. What is your update rate?

Hard to say ... I can tell you that our index contains roughly N
documents, and doing some greps of our logs I can see that on a typical
day our "master" server recieves about N/2 "<add>" commands ... but this
doesn't mean that half of our logical data is changing every day, most of
those updates are to the same logical documents over and over but it does
give you an idea of the amount of churn in lucene documents that's taking
place in a 24 hour period.

I should also point out that most of these updates are coming in big
spurts, but we've never encountered a situation where waiting for Solr to
index a document was a bottleneck -- pulling document data from our
primary datastore always takes longer then indexing the docs.

: 4. What is the propogation delay from master to slave, i.e. how often do
you
: propogate and how long does it take per box?
: 5. What is your optimization schedule and how does it affect overall
: performance of the system?

The answers to Q4 and Q5 are related, and involve telling a much longer
story...

A year ago when we first started using Solr for faceted browsing, it had
Lucene 1.4.3 under the hood.  Our updating strategy involved issuing
commit commands after every batch of udpates (where a single batch was
never bigger then 2000 documents) with snapshooter configured in a
postCommit listener, and snappuller on the slaves running every 10
minutes.  We optimized twice a day, but while optimizing we disabled the
processes that sent updates because optimizing could easily take 20-30
minutes.  The index had thousands of indexed fields to support the
faceting we wanted and this was the cause of our biggest performance
issue: the space needed for all those field norms.  (Yonik implimented the
OMIT_NORMS option in lucene 1.9 to deal with this).

When we upgraded to Lucne 1.9 and started adding support for text
searching, our index got significantly smaller (even though we were adding
a lot of new tokenized fields) thanks to being able to turn off norms for
all of those existing faceting fields.  The other great thing about using
1.9 was that optimizing got a lot faster (I'm not certain if it's just
becuase of the reduced number of fields with norms, or if some other
improvement was made to how optimize works in lucene 1.9).  Optimizing our
index now typically only takes ~1 minute, the longest i've seen it take is
5 minutes.

While doing a lot of prelaunch profiling, we discovered that under extreme
loads, there was a huge differnce in the outliers between an optimized and
a non-optimized index -- we always knew querying an optimized index was
faster on average then querying an unoptimized index, we just didn't
realize how big the gap got when you looked at the non-average cases.

Sooo... since optimize times got so much shorter, and the benefits of
allways querying an optimized index were so easy to see, we changed the
solrconfig.xml for out master to only snapshoot on postOptimize, modified
our optimize cron to run every 30 minutes, and modified the snappuller
crons on the slaves to check for new snapshots more often (5 minutes i
think)

This means we are only ever snappulling complete copies of our index,
twice and hour.  So thetypical max delay in how long it takes for an
update on the master to show up on the slave is ~35 minutes -- the average
delay being 15-20 minutes

If we were concerned about reducing this delay we could, (even with our
current strategy of only pulling optimized indexes to the salves) but this
is faste enough for our purposes, and allows us to really take advantage
of the filterCaches on the slaves.


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Two Solr Announcements: CNET Product Search and DisMax

Chris Hostetter-3

: I understand you may not be able to answer 1 and 2 directly, so how about if
: I combine them into one question that doesn't require you to release quite
: as much information.  Could you tell my how many tps you do per box, and a
: rough spec of what the boxes are?  I.e. the ratio of the answers to
: questions 1 and 2.

I can't, because of three things:
  a) i'm not comfortable stating any specifics without checking with a
     few people first.
  b) I don't know what the specs of our boxes are (days i have to think
     about hardware are days my manager isn't doing his job properly)
  c) i don't know what the ratio is.

...i might be able to remedy all three of those issues when i go back to
work on tuesday, but honestly I don't think that ratio is going to be as
usefull to you as you might think.  We tend to underutilize our machines o
the point that if half our tier goes down, while we're updating half of
all the products in our catalog because a premiere merchant just
discounted all of their products, *and* it just happs to be the peak
online shopping day of the year -- we still want our average response time
to be the same for our users as it was a month before.

Like I said, what would be more useful is if I (or maybe yonik) can
find some time to look up some numbers from our Performance testing to
tell you for a box of type ____ with a collection of size N how what does
the graph of X non stop concurrent users vs average response time of Y
look like while snaploading is happening every M minutes.

I'll try to get back to everybody on all of this.


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Two Solr Announcements: CNET Product Search and DisMax

Chris Hostetter-3

: Like I said, what would be more useful is if I (or maybe yonik) can
: find some time to look up some numbers from our Performance testing to
: tell you for a box of type ____ with a collection of size N how what does
: the graph of X non stop concurrent users vs average response time of Y
: look like while snaploading is happening every M minutes.


I've added a Wiki page for "Solr Performance Data" and added some info
from testing we did at CNET a while back.  I was hoping to be able to
provide some numbers using the DisMaxRequestHandler as is in Solr -- but
I didn't have any time ot run specific tests on it -- the results i posted
to that wiki page are for a modified version that does more work
-- so the response times should be the "upper bound" of what you expect.


Everyone should feel free to add whatever performance data of their own
they feel confortable sharing to this wiki page.



-Hoss