Question on searcher creation limit

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on searcher creation limit

Paul Waite
Hi chaps,

I currently have a lucene app. which is a daemon listening on a port for
XML requests and servicing these. I'm intending to switch to using Solr in
the near(ish) future, but have a question.

In my daemon, for servicing incoming search requests I manage index
Searchers in a cache via the following cases:.

1) there's is no cached Searcher:
   - create a new Searcher and cache it

2) there is a cached Searcher, and index has NOT been updated since it
was created:
   - use cached Searcher.

3) there is a cached Searcher, and index HAS been updated since it
was created:
   - if the cache limit is not reached flag current Searcher as "old",
create new Searcher and cache it, else if cache limit reached return
current Searcher.

Searchers flagged as "old" get retired as soon as they finish servicing any
existing requests.


The question I have regarding Solr is: does it do something similar? If
so can someone please describe the process - thanks.

I am particularly interested in a provision for a cap on creation of new
Searchers, as described in (3) above. As my index has grown, and the system
updates it frequently, I found that an open-ended approach to creation of
new Searchers in the event of index change was leading to out-of-memory
errors.

Cheers,
Paul.


Reply | Threaded
Open this post in threaded view
|

Re: Question on searcher creation limit

Yonik Seeley-2
This explains some of it:
http://wiki.apache.org/solr/SolrCaching

So, there is normally only a single searcher handing around.
When a new searcher is opened, it is opened and warmed in the
"background" so there are two searchers for the duration of warming.

A new searcher is only opened on an explicit <commit> command... Solr
does not attempt to detect when an index has changed (doing so could
result in inconsistencies, and worse performance).

If explicit commits are done fast enough, and warming takes long
enough, one can get into the situation where there are multiple
searchers warming in the background at the same time.  We currently
hande this by ensuring clients behave.

-Yonik

On 10/23/06, Paul Waite <[hidden email]> wrote:

> Hi chaps,
>
> I currently have a lucene app. which is a daemon listening on a port for
> XML requests and servicing these. I'm intending to switch to using Solr in
> the near(ish) future, but have a question.
>
> In my daemon, for servicing incoming search requests I manage index
> Searchers in a cache via the following cases:.
>
> 1) there's is no cached Searcher:
>    - create a new Searcher and cache it
>
> 2) there is a cached Searcher, and index has NOT been updated since it
> was created:
>    - use cached Searcher.
>
> 3) there is a cached Searcher, and index HAS been updated since it
> was created:
>    - if the cache limit is not reached flag current Searcher as "old",
> create new Searcher and cache it, else if cache limit reached return
> current Searcher.
>
> Searchers flagged as "old" get retired as soon as they finish servicing any
> existing requests.
>
>
> The question I have regarding Solr is: does it do something similar? If
> so can someone please describe the process - thanks.
>
> I am particularly interested in a provision for a cap on creation of new
> Searchers, as described in (3) above. As my index has grown, and the system
> updates it frequently, I found that an open-ended approach to creation of
> new Searchers in the event of index change was leading to out-of-memory
> errors.
>
> Cheers,
> Paul.
Reply | Threaded
Open this post in threaded view
|

Re: Question on searcher creation limit

Paul Waite
Yonik Seeley wrote:

> This explains some of it:
> http://wiki.apache.org/solr/SolrCaching
>
> So, there is normally only a single searcher handing around.
> When a new searcher is opened, it is opened and warmed in the
> "background" so there are two searchers for the duration of warming.
>
> A new searcher is only opened on an explicit <commit> command... Solr
> does not attempt to detect when an index has changed (doing so could
> result in inconsistencies, and worse performance).
>
> If explicit commits are done fast enough, and warming takes long
> enough, one can get into the situation where there are multiple
> searchers warming in the background at the same time.  We currently
> hande this by ensuring clients behave.


That all looks excellent. Obviously the <commit> is aimed at sensible
batching of index updates. In our case index updates can't be batched and
are done immediately a news item hits the system.

However if I understand the above properly, we should implement a strategy
which at least limits the commit frequency, to prevent the scenario you
describe in the last para above.

Cheers,
Paul.



 
> -Yonik
>
> On 10/23/06, Paul Waite <[hidden email]> wrote:
> > Hi chaps,
> >
> > I currently have a lucene app. which is a daemon listening on a port
for
> > XML requests and servicing these. I'm intending to switch to using Solr
in

> > the near(ish) future, but have a question.
> >
> > In my daemon, for servicing incoming search requests I manage index
> > Searchers in a cache via the following cases:.
> >
> > 1) there's is no cached Searcher:
> >    - create a new Searcher and cache it
> >
> > 2) there is a cached Searcher, and index has NOT been updated since it
> > was created:
> >    - use cached Searcher.
> >
> > 3) there is a cached Searcher, and index HAS been updated since it
> > was created:
> >    - if the cache limit is not reached flag current Searcher as "old",
> > create new Searcher and cache it, else if cache limit reached return
> > current Searcher.
> >
> > Searchers flagged as "old" get retired as soon as they finish servicing
any
> > existing requests.
> >
> >
> > The question I have regarding Solr is: does it do something similar? If
> > so can someone please describe the process - thanks.
> >
> > I am particularly interested in a provision for a cap on creation of
new
> > Searchers, as described in (3) above. As my index has grown, and the
system
> > updates it frequently, I found that an open-ended approach to creation
of
> > new Searchers in the event of index change was leading to out-of-memory
> > errors.
> >
> > Cheers,
> > Paul.
>

--
Extract from Official Sweepstakes Rules:

                NO PURCHASE REQUIRED TO CLAIM YOUR PRIZE

To claim your prize without purchase, do the following: (a) Carefully
cut out your computer-printed name and address from upper right hand
corner of the Prize Claim Form. (b) Affix computer-printed name and
address -- with glue or cellophane tape (no staples or paper clips) --
to a 3x5 inch index card.  (c) Also cut out the "No" paragraph (lower
left hand corner of Prize Claim Form) and affix it to the 3x5 card
below your address label. (d) Then print on your 3x5 card, above your
computer-printed name and address the words "CARTER & VAN PEEL
SWEEPSTAKES" (Use all capital letters.)  (e) Finally place 3x5 card
(without bending) into a plain envelope [NOTE: do NOT use the the
Official Prize Claim and CVP Perfume Reply Envelope or you may be
disqualified], and mail to: CVP, Box 1320, Westbury, NY 11595.  Print
this address correctly.  Comply with above instructions carefully and
completely or you may be disqualified from receiving your prize.
Reply | Threaded
Open this post in threaded view
|

Re: Question on searcher creation limit

Yonik Seeley-2
On 10/23/06, Paul Waite <[hidden email]> wrote:
> However if I understand the above properly, we should implement a strategy
> which at least limits the commit frequency, to prevent the scenario you
> describe in the last para above.

Right.  If a new news item comes in, you could do a commit immediately and then
ensure that another commit is not done within "x" amount of time.

Some ideas on features I've never had the chance to implement are
minimumCommitFrequency (like described above) and
auto commit (let solr decide when to commit... after x docs or y
seconds w/o a commit).

I'd accept patches that baked any of that stuff into Solr :-)

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Question on searcher creation limit

Paul Waite
Yonik Seeley wrote:
> On 10/23/06, Paul Waite <[hidden email]> wrote:
> > However if I understand the above properly, we should implement a
strategy
> > which at least limits the commit frequency, to prevent the scenario you
> > describe in the last para above.
>
> Right.  If a new news item comes in, you could do a commit immediately
and then
> ensure that another commit is not done within "x" amount of time.
>
> Some ideas on features I've never had the chance to implement are
> minimumCommitFrequency (like described above) and
> auto commit (let solr decide when to commit... after x docs or y
> seconds w/o a commit).
>
> I'd accept patches that baked any of that stuff into Solr :-)
 
Yes, I was thinking along the same lines earlier when considering
how best to implement it. Certainly the place for that processing is right
inside Solr itself, because it's hard to drive the commit nicely from an
external source when its kinda asynchronous like that. You have to
maintain the state of the commit somewhere and generally external
clients are in the 'use Solr then forget it until next time' sort of mode,
and that's the way they should stay.

So I'll have a look at doing it certainly since it seems to be a
requirement for my usage of Solr - don't expect it overnight though, as
I only had my first 30 minutes look at the code yesterday. ;-)

Cheers,
Paul.
Reply | Threaded
Open this post in threaded view
|

Juggling relevance rankings

Walter Lewis-2
In reply to this post by Yonik Seeley-2
If I've missed this in the documentation just point me in the right
direction, otherwise ...

I'm interested in ways of tweaking result sets based on criteria
external to the search itself. For example:

    user searches for "foo AND bar" in a general "text" search
    search returns 100 records

I don't want to change the records in the set but just shuffle the order by
    boost the records with both foo and bar in the title field
    boost the records with "foo bar" ahead of those with "foo, baz, bop
and bar"
    boost the records with a 5 rating, more than those with a 4, or a 3
    boost the records where hasComment is true over those hasComment is
false
    boost really long records with lots of foo's and bar's relative to
the short record
        (what I tend to call "the lousy metadata wins" problem)

The key is not to produce raw sorts on any one of these, but rather to
hint the order based on various weightings of multiple factors.

Doable?

Walter Lewis
Reply | Threaded
Open this post in threaded view
|

Re: Question on searcher creation limit

Paul Waite
In reply to this post by Yonik Seeley-2
Yonik Seeley wrote:

> On 10/23/06, Paul Waite <[hidden email]> wrote:
> > However if I understand the above properly, we should implement a
> > strategy which at least limits the commit frequency, to prevent the
> > scenario you describe in the last para above.
 

> Right.  If a new news item comes in, you could do a commit immediately
> and then ensure that another commit is not done within "x" amount of
> time.

> Some ideas on features I've never had the chance to implement are
> minimumCommitFrequency (like described above) and
> auto commit (let solr decide when to commit... after x docs or y
> seconds w/o a commit).
 
> I'd accept patches that baked any of that stuff into Solr :-)
 

http://wiki.apache.org/solr/SolrConfigXml

In the UpdateHandler section I see:

<updateHandler class="solar.DirectUpdateHandler2">

    <!-- autocommit pending docs if certain criteria are met -->
    <autocommit>
      <!-- NOTE: autocommit not implemented yet -->
      <maxDocs>10000</maxDocs>
      <maxSec>3600</maxSec>
    </autocommit>

    <!-- represents a lower bound on the frequency that commits
         may occur (in seconds). NOTE: not yet implemented
    -->
    <commitIntervalLowerBound>0</commitIntervalLowerBound>



So the hard part is already done - the parameters are named! ;-)

Cheers,
Paul.
Reply | Threaded
Open this post in threaded view
|

Re: Juggling relevance rankings

Mike Klaas
In reply to this post by Walter Lewis-2
On 10/23/06, Walter Lewis <[hidden email]> wrote:
> If I've missed this in the documentation just point me in the right
> direction, otherwise ...

Tweaking relevance is a lucene question.  Solr adds almost nothing to
this--the various documents pertaining to lucene are most relevant.

Note that my ad-hoc knowledge of lucene query parser syntax is shaky,
so treat the examples as just that.

> I'm interested in ways of tweaking result sets based on criteria
> external to the search itself. For example:
>
>     user searches for "foo AND bar" in a general "text" search
>     search returns 100 records
>
> I don't want to change the records in the set but just shuffle the order by
>     boost the records with both foo and bar in the title field

Adjust your query.  Add title:(foo AND bar)^10

>     boost the records with "foo bar" ahead of those with "foo, baz, bop
> and bar"

Add an optional phrase query to your main query (note: dismax handler
can do this automatically)

>     boost the records with a 5 rating, more than those with a 4, or a 3

Use document boosts at index time, or function queries.

>     boost the records where hasComment is true over those hasComment is
> false

Use document boosts at index time.

>     boost really long records with lots of foo's and bar's relative to
> the short record
>         (what I tend to call "the lousy metadata wins" problem)

I don't really understand this one--this is typically the opposite of
what is needed.  However, this can be done by overriding the
Similarity class (and pointing solr at your custom class).

Note that the dismax handler already has facilities for doing many of
the tasks you require.  For instance, arbitrary "boost queries" can be
added to a query to tweak relevance.  You can specify a set of fields
with varying boosts over which a term will be searched, with a
corresponding set of fields for phrase queries.

The only really tricky thing to adjust here is the Similarity class,
which  requires lots of experiementation--changes tend to have
far-reaching consequences.  Other factors can be played with more
readily.  debugQuery=true is your friend.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Question on searcher creation limit

Yonik Seeley-2
In reply to this post by Paul Waite
On 10/23/06, Paul Waite <[hidden email]> wrote:

> http://wiki.apache.org/solr/SolrConfigXml
>
> In the UpdateHandler section I see:
>
> <updateHandler class="solar.DirectUpdateHandler2">
>
>     <!-- autocommit pending docs if certain criteria are met -->
>     <autocommit>
>       <!-- NOTE: autocommit not implemented yet -->
>       <maxDocs>10000</maxDocs>
>       <maxSec>3600</maxSec>
>     </autocommit>
>
>     <!-- represents a lower bound on the frequency that commits
>          may occur (in seconds). NOTE: not yet implemented
>     -->
>     <commitIntervalLowerBound>0</commitIntervalLowerBound>
>
> So the hard part is already done - the parameters are named! ;-)

Heh.  I forgot that stuff was there.  Feel free to change the
names/placement if you do have a go at it.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Juggling relevance rankings

Chris Hostetter-3
In reply to this post by Mike Klaas

: Note that the dismax handler already has facilities for doing many of
: the tasks you require.  For instance, arbitrary "boost queries" can be
: added to a query to tweak relevance.  You can specify a set of fields
: with varying boosts over which a term will be searched, with a
: corresponding set of fields for phrase queries.

yep, using the dismax handler, this would give you something close to what
you describe...

  q=foo+bar&qf=title^2+body^1&pf=title+body&mm=1000&bq=hasComment:1&bf=rating



-Hoss