Nrt and caching

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Nrt and caching

Amit Nithian
Sorry I'm a bit new to the nrt stuff in solr but I'm trying to understand
the implications of frequent commits and cache rebuilding and auto warming.
What are the best practices surrounding nrt searching and caches and query
performance.

Thanks!
Amit
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Jason Rutherglen
Hi Amit,

If the caches were per-segment, then NRT would be optimal in Solr.

Currently the caches are stored per-multiple-segments, meaning after each
'soft' commit, the cache(s) will be purged.

On Fri, Jul 6, 2012 at 9:45 PM, Amit Nithian <[hidden email]> wrote:

> Sorry I'm a bit new to the nrt stuff in solr but I'm trying to understand
> the implications of frequent commits and cache rebuilding and auto warming.
> What are the best practices surrounding nrt searching and caches and query
> performance.
>
> Thanks!
> Amit
>
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Yonik Seeley-2-2
On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
<[hidden email]> wrote:
> Currently the caches are stored per-multiple-segments, meaning after each
> 'soft' commit, the cache(s) will be purged.

Depends which caches.  Some caches are per-segment, and some caches
are top level.
It's also a trade-off... for some things, per-segment data structures
would indeed turn around quicker on a reopen, but every query would be
slower for it.

-Yonik
http://lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Jason Rutherglen
The field caches are per-segment, which are used for sorting and basic
[slower] facets.  The result set, document, filter, and multi-value facet
caches are [in Solr] per-multi-segment.

Of these, the document, filter, and multi-value facet caches could be
converted to be [performant] per-segment, as with some other Apache
licensed Lucene based search engines.

On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <[hidden email]>wrote:

> On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> <[hidden email]> wrote:
> > Currently the caches are stored per-multiple-segments, meaning after each
> > 'soft' commit, the cache(s) will be purged.
>
> Depends which caches.  Some caches are per-segment, and some caches
> are top level.
> It's also a trade-off... for some things, per-segment data structures
> would indeed turn around quicker on a reopen, but every query would be
> slower for it.
>
> -Yonik
> http://lucidimagination.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Andy-152
So If I want to use multi-value facet with NRT I'd need to convert the cache to per-segment? How do I do that?

Thanks.


________________________________
 From: Jason Rutherglen <[hidden email]>
To: [hidden email]
Sent: Saturday, July 7, 2012 11:32 AM
Subject: Re: Nrt and caching
 
The field caches are per-segment, which are used for sorting and basic
[slower] facets.  The result set, document, filter, and multi-value facet
caches are [in Solr] per-multi-segment.

Of these, the document, filter, and multi-value facet caches could be
converted to be [performant] per-segment, as with some other Apache
licensed Lucene based search engines.

On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <[hidden email]>wrote:

> On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> <[hidden email]> wrote:
> > Currently the caches are stored per-multiple-segments, meaning after each
> > 'soft' commit, the cache(s) will be purged.
>
> Depends which caches.  Some caches are per-segment, and some caches
> are top level.
> It's also a trade-off... for some things, per-segment data structures
> would indeed turn around quicker on a reopen, but every query would be
> slower for it.
>
> -Yonik
> http://lucidimagination.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Amit Nithian
Thanks for the responses. I guess my specific question is if I had
something which was dependent on the mapping between lucene document ids
and some object primary key so i could pull in external data from another
data source without a constant reindex, how would this get affected by soft
and hard commits? I'd prefer not to have to rebuild this mapping from
scratch on each soft or even hard commits if possible since those seem to
happen frequently.

Also can you explain why and how per segment caches are used and how at the
client of lucene layer one gets access or knows about this? I always
thought segments were an implementation detail where they get merged on
optimize etc so wouldn't that affect clients depending on segment level
stuff? Or what am I missing?

Thanks again!
Amit
On Jul 7, 2012 9:22 AM, "Andy" <[hidden email]> wrote:

> So If I want to use multi-value facet with NRT I'd need to convert the
> cache to per-segment? How do I do that?
>
> Thanks.
>
>
> ________________________________
>  From: Jason Rutherglen <[hidden email]>
> To: [hidden email]
> Sent: Saturday, July 7, 2012 11:32 AM
> Subject: Re: Nrt and caching
>
> The field caches are per-segment, which are used for sorting and basic
> [slower] facets.  The result set, document, filter, and multi-value facet
> caches are [in Solr] per-multi-segment.
>
> Of these, the document, filter, and multi-value facet caches could be
> converted to be [performant] per-segment, as with some other Apache
> licensed Lucene based search engines.
>
> On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <[hidden email]
> >wrote:
>
> > On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> > <[hidden email]> wrote:
> > > Currently the caches are stored per-multiple-segments, meaning after
> each
> > > 'soft' commit, the cache(s) will be purged.
> >
> > Depends which caches.  Some caches are per-segment, and some caches
> > are top level.
> > It's also a trade-off... for some things, per-segment data structures
> > would indeed turn around quicker on a reopen, but every query would be
> > slower for it.
> >
> > -Yonik
> > http://lucidimagination.com
> >
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Jason Rutherglen
In reply to this post by Andy-152
Andy,

You'd need to hack on the Solr code, specifically the SimpleFacets class.
Solr uses UnInvertedField to build an in memory doc -> terms mapping, which
would need to be cached per-segment.  Then you'd need to aggregate the
resultant per-segment counts.

There is another open source library that has taken the same basic faceting
approach (it is per-segment), and could be colloquially faster, however it
is built for Lucene 3.x at the moment.

On Sat, Jul 7, 2012 at 12:21 PM, Andy <[hidden email]> wrote:

> So If I want to use multi-value facet with NRT I'd need to convert the
> cache to per-segment? How do I do that?
>
> Thanks.
>
>
> ________________________________
>  From: Jason Rutherglen <[hidden email]>
> To: [hidden email]
> Sent: Saturday, July 7, 2012 11:32 AM
> Subject: Re: Nrt and caching
>
> The field caches are per-segment, which are used for sorting and basic
> [slower] facets.  The result set, document, filter, and multi-value facet
> caches are [in Solr] per-multi-segment.
>
> Of these, the document, filter, and multi-value facet caches could be
> converted to be [performant] per-segment, as with some other Apache
> licensed Lucene based search engines.
>
> On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <[hidden email]
> >wrote:
>
> > On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> > <[hidden email]> wrote:
> > > Currently the caches are stored per-multiple-segments, meaning after
> each
> > > 'soft' commit, the cache(s) will be purged.
> >
> > Depends which caches.  Some caches are per-segment, and some caches
> > are top level.
> > It's also a trade-off... for some things, per-segment data structures
> > would indeed turn around quicker on a reopen, but every query would be
> > slower for it.
> >
> > -Yonik
> > http://lucidimagination.com
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Andy-152
Jason,

If I just use stock Solr 4.0 without modifying the source code, does that mean multi-value faceting will be very slow when I'm constantly inserting/updating documents? 

Which open source library are you referring to? Will Solr adopt this per-segment approach any time soon?

Thanks


________________________________
 From: Jason Rutherglen <[hidden email]>
To: [hidden email]
Sent: Saturday, July 7, 2012 2:05 PM
Subject: Re: Nrt and caching
 
Andy,

You'd need to hack on the Solr code, specifically the SimpleFacets class.
Solr uses UnInvertedField to build an in memory doc -> terms mapping, which
would need to be cached per-segment.  Then you'd need to aggregate the
resultant per-segment counts.

There is another open source library that has taken the same basic faceting
approach (it is per-segment), and could be colloquially faster, however it
is built for Lucene 3.x at the moment.

On Sat, Jul 7, 2012 at 12:21 PM, Andy <[hidden email]> wrote:

> So If I want to use multi-value facet with NRT I'd need to convert the
> cache to per-segment? How do I do that?
>
> Thanks.
>
>
> ________________________________
>  From: Jason Rutherglen <[hidden email]>
> To: [hidden email]
> Sent: Saturday, July 7, 2012 11:32 AM
> Subject: Re: Nrt and caching
>
> The field caches are per-segment, which are used for sorting and basic
> [slower] facets.  The result set, document, filter, and multi-value facet
> caches are [in Solr] per-multi-segment.
>
> Of these, the document, filter, and multi-value facet caches could be
> converted to be [performant] per-segment, as with some other Apache
> licensed Lucene based search engines.
>
> On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <[hidden email]
> >wrote:
>
> > On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> > <[hidden email]> wrote:
> > > Currently the caches are stored per-multiple-segments, meaning after
> each
> > > 'soft' commit, the cache(s) will be purged.
> >
> > Depends which caches.  Some caches are per-segment, and some caches
> > are top level.
> > It's also a trade-off... for some things, per-segment data structures
> > would indeed turn around quicker on a reopen, but every query would be
> > slower for it.
> >
> > -Yonik
> > http://lucidimagination.com
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Jason Rutherglen
Multi-value faceting is fast for queries, however because it's cached
per-multi-segment, each soft commit will flush the cache, and it will be
reloaded on the first query.  As the index grows it becomes expensive to
build, as well as being RAM consuming.

I am not aware of any Jira issues open with activity regarding adding this
feature to Solr.

On Sat, Jul 7, 2012 at 8:32 PM, Andy <[hidden email]> wrote:

> Jason,
>
> If I just use stock Solr 4.0 without modifying the source code, does that
> mean multi-value faceting will be very slow when I'm constantly
> inserting/updating documents?
>
> Which open source library are you referring to? Will Solr adopt this
> per-segment approach any time soon?
>
> Thanks
>
>
> ________________________________
>  From: Jason Rutherglen <[hidden email]>
> To: [hidden email]
> Sent: Saturday, July 7, 2012 2:05 PM
> Subject: Re: Nrt and caching
>
> Andy,
>
> You'd need to hack on the Solr code, specifically the SimpleFacets class.
> Solr uses UnInvertedField to build an in memory doc -> terms mapping, which
> would need to be cached per-segment.  Then you'd need to aggregate the
> resultant per-segment counts.
>
> There is another open source library that has taken the same basic faceting
> approach (it is per-segment), and could be colloquially faster, however it
> is built for Lucene 3.x at the moment.
>
> On Sat, Jul 7, 2012 at 12:21 PM, Andy <[hidden email]> wrote:
>
> > So If I want to use multi-value facet with NRT I'd need to convert the
> > cache to per-segment? How do I do that?
> >
> > Thanks.
> >
> >
> > ________________________________
> >  From: Jason Rutherglen <[hidden email]>
> > To: [hidden email]
> > Sent: Saturday, July 7, 2012 11:32 AM
> > Subject: Re: Nrt and caching
> >
> > The field caches are per-segment, which are used for sorting and basic
> > [slower] facets.  The result set, document, filter, and multi-value facet
> > caches are [in Solr] per-multi-segment.
> >
> > Of these, the document, filter, and multi-value facet caches could be
> > converted to be [performant] per-segment, as with some other Apache
> > licensed Lucene based search engines.
> >
> > On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <
> [hidden email]
> > >wrote:
> >
> > > On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> > > <[hidden email]> wrote:
> > > > Currently the caches are stored per-multiple-segments, meaning after
> > each
> > > > 'soft' commit, the cache(s) will be purged.
> > >
> > > Depends which caches.  Some caches are per-segment, and some caches
> > > are top level.
> > > It's also a trade-off... for some things, per-segment data structures
> > > would indeed turn around quicker on a reopen, but every query would be
> > > slower for it.
> > >
> > > -Yonik
> > > http://lucidimagination.com
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Nrt and caching

Karsten R.
In reply to this post by Andy-152
Hi Andy,

Multi-value faceting is a special case of taxonomy. So it is covered by the "org.apache.lucene.facet" package (lucene/facet).
This is not per segment but works without "per IndexSearcher" cache.

So imho the taxonomy faceting will work with NRT.

Because of the new TermsEnum#ord() Method the class UnInvertedField already lost half of its code-lines. UnInvertedField would work per segment, if the "ordinal position for a term" would not change in a commit. Which is the basic idea of the taxonomy-solution.

So I am quite sure that Solr will adopt this approach any time.
I do not now about "soon".

Best regards
   Karsten

in context:
http://lucene.472066.n3.nabble.com/Nrt-and-caching-tp3993612p3993700.html

-------- Original-Nachricht --------
> Datum: Sat, 7 Jul 2012 17:32:52 -0700 (PDT)
> Von: Andy <[hidden email]>
> An: "[hidden email]" <[hidden email]>
> Betreff: Re: Nrt and caching

> Jason,
>
> If I just use stock Solr 4.0 without modifying the source code, does that
> mean multi-value faceting will be very slow when I'm constantly
> inserting/updating documents? 
>
> Which open source library are you referring to? Will Solr adopt this
> per-segment approach any time soon?
>
> Thanks
>