real-time updates

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

real-time updates

Ryan McKinley
Is it an ok idea to design an app with solr where you assume data will
be indexed immediately?  For example, after a user uploads an image -
immediately use solr to search a collection that will include this new
image?

Essentially I'm asking if it ok to call <commit/>  often.  Up to many
times / second from multiple sources.

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: real-time updates

Walter Underwood, Netflix
On 1/24/07 1:15 AM, "Ryan McKinley" <[hidden email]> wrote:

> Is it an ok idea to design an app with solr where you assume data will
> be indexed immediately?  For example, after a user uploads an image -
> immediately use solr to search a collection that will include this new
> image?
>
> Essentially I'm asking if it ok to call <commit/>  often.  Up to many
> times / second from multiple sources.

You can do that, but Solr will flush the query cache on every
commit, so this will hurt performance. Also, a commit is more
expensive than an update, so that will use more CPU for indexing.
Doing a commit once every ten seconds or every thousand docs
(for example) would be a lot more efficient.

wunder
--
Walter Underwood
Search Guru, Netflix


Reply | Threaded
Open this post in threaded view
|

Re: real-time updates

Yonik Seeley-2
In reply to this post by Ryan McKinley
On 1/24/07, Ryan McKinley <[hidden email]> wrote:
> Is it an ok idea to design an app with solr where you assume data will
> be indexed immediately?  For example, after a user uploads an image -
> immediately use solr to search a collection that will include this new
> image?
 >
> Essentially I'm asking if it ok to call <commit/>  often.  Up to many
> times / second from multiple sources.

Really depends on your collection size, the query rate, and what your
tolerance is for queries that might take longer sometimes.

I'd cut way down or remove any autowarming.
If you normally sort by a field, you might want that as a single
static warming query.  The first sort on a field populates the
fieldCache and that can take some time.

Is it really the case that stuff needs to be *immediately* searchable
(as in, computer immediately), or do you just want a user to be able
to search for something they just added (in which case, at least
seconds should be OK, not fractions of a second).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: real-time updates

Ryan McKinley
>  >
> > Essentially I'm asking if it ok to call <commit/>  often.  Up to many
> > times / second from multiple sources.
>
> Really depends on your collection size, the query rate, and what your
> tolerance is for queries that might take longer sometimes.
>



> I'd cut way down or remove any autowarming.
> If you normally sort by a field, you might want that as a single
> static warming query.  The first sort on a field populates the
> fieldCache and that can take some time.
>
> Is it really the case that stuff needs to be *immediately* searchable
> (as in, computer immediately), or do you just want a user to be able
> to search for something they just added (in which case, at least
> seconds should be OK, not fractions of a second).
>

not computer immediately, just user immediately - within a second
(hopefully not two), but the thing that was added must show up in the
results.

My solrconfig.xml is essentially the same as the example one.  (with
listeners commented out) and autowarmCount="256"

Are you saying to change autowarmCount="0" and add a listener like:

<listener event="newSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
   ...
</listener>

thanks
Reply | Threaded
Open this post in threaded view
|

Re: real-time updates

Yonik Seeley-2
On 1/24/07, Ryan McKinley <[hidden email]> wrote:

> >  >
> > > Essentially I'm asking if it ok to call <commit/>  often.  Up to many
> > > times / second from multiple sources.
> >
> > Really depends on your collection size, the query rate, and what your
> > tolerance is for queries that might take longer sometimes.
> >
>
> > I'd cut way down or remove any autowarming.
> > If you normally sort by a field, you might want that as a single
> > static warming query.  The first sort on a field populates the
> > fieldCache and that can take some time.
> >
> > Is it really the case that stuff needs to be *immediately* searchable
> > (as in, computer immediately), or do you just want a user to be able
> > to search for something they just added (in which case, at least
> > seconds should be OK, not fractions of a second).
> >
>
> not computer immediately, just user immediately - within a second
> (hopefully not two), but the thing that was added must show up in the
> results.
>
> My solrconfig.xml is essentially the same as the example one.  (with
> listeners commented out) and autowarmCount="256"
>
> Are you saying to change autowarmCount="0" and add a listener like:
>
> <listener event="newSearcher" class="solr.QuerySenderListener">
>       <arr name="queries">
>    ...
> </listener>
Yep... and just put in anything essential (if all your queries are
sorted queries on certain fields, then make sure you do a sort on
those fields as a warming query... the first sort on a field populates
a fieldcache entry).  Or, if your collection is small enough, perhaps
you don't need any warming at all for the queries to have acceptable
latencies (even the first).

You can see how long warming is currently taking by looking at the
logs, and seeing when the searcher was opened, and when it was
registered.

Committing this fast won't work with our current replication strategy.
You will need to directly query the server(s) being updated.
Replication could still be used to provide a hot standby though.

A feature that I think would really help your case is enforcement of a
minimum commit "wait" time.  As in, if the last commit happened less
than 1 or 2 seconds ago, block the commit until the time limit has
passed.  An easier method might be an autocommit feature that is time
based, so the client doesn't have to do explicit commits... Solr would
guarantee that anything added would be committed within n seconds.
Eases the burden on clients and still allows for some aggregation.

-Yonik