Caching FuzzyQuery

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Caching FuzzyQuery

Timo Nentwig-7
Hi!

Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
BooleanQueries.

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

hossman

: Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
: caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
: BooleanQueries.

Applications are certainly welcome to do this (there is nothing to stop
you from calling rewrite before passing the query to your Searcher, i
believe the overhead of calling rewrite on a query that's already been
rewritten is fairly low) but I don't think it would be a good idea to add
something like this to the core ...for starters we are trying to move
away from "hidden" caches like this that are not transparent (and
controllable) but the users because they have the potential to eat up a
lot of ram.  But also: he amount of time needed to rewrite the query is
probably not vastly more expensive then the anout of time to execute the
search .. you might as well cache the entire result keyed off of the
orriginal query (and not just the rewritten query object).


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

Timo Nentwig-7
On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> : BooleanQueries.
>
> Applications are certainly welcome to do this (there is nothing to stop
> you from calling rewrite before passing the query to your Searcher, i
> believe the overhead of calling rewrite on a query that's already been
> rewritten is fairly low) but I don't think it would be a good idea to add

Why should subsequent rewrites be faster? The query is being rewritten every
time over and over again. Even *if* you can profit by buffered IO you sill
have a plenty of string levenshtein OPs.

I'm against caching in general because you always run into some hard to
understand and examine problem but this seems to be one of the rare cases
where caching makes sense.

I attached a small test app, the index contains 2.2 million docs and 5 million
terms, I search for a pretty common term which was rewritten to 15 terms and
hit roughly 4.000 docs (I also tried a term that was rewritten to 1 term and
hit about 300 docs, made no difference):

rewritten in 809
Overall search time: 842
rewritten in 271
Overall search time: 274
rewritten in 216
Overall search time: 219
rewritten in 180
Overall search time: 182
rewritten in 184
Overall search time: 186
rewritten in 220
Overall search time: 226
rewritten in 207
Overall search time: 208
rewritten in 181
Overall search time: 183
rewritten in 183
Overall search time: 185
rewritten in 180
Overall search time: 181

$ vmstat -S M
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0    757    298     56    384    0    0    21    36   39    9  5  1 94  0

> something like this to the core ...for starters we are trying to move
> away from "hidden" caches like this that are not transparent (and

Well, at least the existing of such an decorator (which you explicitly have to
use) will give you a hint that this is performance hot spot. I took me quite
some time to figure it out...

> controllable) but the users because they have the potential to eat up a
> lot of ram.  But also: he amount of time needed to rewrite the query is
> probably not vastly more expensive then the anout of time to execute the
> search .. you might as well cache the entire result keyed off of the
> orriginal query (and not just the rewritten query object).
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Test.java (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

Yonik Seeley-2
On Dec 15, 2007 2:23 PM, Timo Nentwig <[hidden email]> wrote:

> On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> > : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> > : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> > : BooleanQueries.
> >
> > Applications are certainly welcome to do this (there is nothing to stop
> > you from calling rewrite before passing the query to your Searcher, i
> > believe the overhead of calling rewrite on a query that's already been
> > rewritten is fairly low) but I don't think it would be a good idea to add
>
> Why should subsequent rewrites be faster?

Hoss means calling rewrite on the *result* of a rewrite.
So the application would call rewrite, cache the resulting query, and
then use that already rewritten query from then on.  Lucene wouldn't
know it had already been rewritten, and would call rewrite again, but
it should be fast.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

Timo Nentwig-7
On Saturday 15 December 2007 20:48:38 Yonik Seeley wrote:

> On Dec 15, 2007 2:23 PM, Timo Nentwig <[hidden email]> wrote:
> > On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
> > > : Actually FuzzyQuery.rewrite() is pretty expensive so why not
> > > : introduce a caching decorator? A WeakHashMap with key==IndexReader
> > > : and value==LRU of BooleanQueries.
> > >
> > > Applications are certainly welcome to do this (there is nothing to stop
> > > you from calling rewrite before passing the query to your Searcher, i
> > > believe the overhead of calling rewrite on a query that's already been
> > > rewritten is fairly low) but I don't think it would be a good idea to
> > > add
> >
> > Why should subsequent rewrites be faster?
>
> Hoss means calling rewrite on the *result* of a rewrite.

Uh? That's what I mean (propose), too... But currently nothing's cached at
all.

Cache the result (BooleanQuery) of rewrite() in a WeakHashMap with key =
IndexReader and value = LRU.

> So the application would call rewrite, cache the resulting query, and
> then use that already rewritten query from then on.  Lucene wouldn't
> know it had already been rewritten, and would call rewrite again, but
> it should be fast.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

hossman

: > Hoss means calling rewrite on the *result* of a rewrite.
:
: Uh? That's what I mean (propose), too... But currently nothing's cached at
: all.
:
: Cache the result (BooleanQuery) of rewrite() in a WeakHashMap with key =
: IndexReader and value = LRU.

yes .. you can do that.  you don't need anything in the core lucene
Searchers to do it for you (ie: nothing needs refactored out, because
redundent rewrite calls are cheap) ... in the paragraph below,
"application" refers to your code, "Lucene" refers to the code in the
lucene JAR...

: > So the application would call rewrite, cache the resulting query, and
: > then use that already rewritten query from then on.  Lucene wouldn't
: > know it had already been rewritten, and would call rewrite again, but
: > it should be fast.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

hossman
In reply to this post by Timo Nentwig-7

: I'm against caching in general because you always run into some hard to
: understand and examine problem but this seems to be one of the rare cases
: where caching makes sense.

For you maybe, but i've actually seen very few instances in practice where
people who are speed concious use FuzzyQueries (or other queries with
really expensive rewrite methods) to a large degree .... in the few casees
people do:
  1) they usually go whole hog and cache search results
     (ie: query+sort => list of docs)
  2) they want direct control of the cache.

: Well, at least the existing of such an decorator (which you explicitly have to
: use) will give you a hint that this is performance hot spot. I took me quite
: some time to figure it out...

i would argue that for a lot of people something like "Solr" is that
decorator ... but if you'd like to submit a contrib that does something
like this with a nice generic API by all means please do.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching FuzzyQuery

Timo Nentwig-7
In reply to this post by Timo Nentwig-7


On Tue, 11 Dec 2007, Timo Nentwig wrote:

> Date: Tue, 11 Dec 2007 13:27:59 +0100
> From: Timo Nentwig <[hidden email]>
> Reply-To: [hidden email]
> To: [hidden email]
> Subject: Caching FuzzyQuery
>
> Hi!
>
> Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
> caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
> BooleanQueries.

So, didn't become a popular approach/topic. Anyway, I wrote an aspect and
it does the job for a couple of months now, so I'd like to share it.

HTH
Timo


In order to enable runtime AspectJ weaving start your JVM with

    -javaagent:./foo/bar/lib/aspectjweaver.jar

and place the following file called "aop.xml" in META-INF.

<aspectj>
         <aspects>
                 <aspect name="foo.bar.FuzzyQueryCachingAspect" />
         </aspects>

         <weaver options="-verbose -showWeaveInfo">
                 <include within="org.apache.lucene.search.*" />
                 <include within="org.apache.lucene.index.*" />
         </weaver>
</aspectj>


@Aspect
public class FuzzyQueryCachingAspect
{
         private static final Log LOG = LogFactory.getLog(
FuzzyQueryCachingAspect.class );
         // your cache impl goes here
         private static final Caching CACHE = Caching.instance(
Caching.FUZZY );

         @Before( "execution(public void
org.apache.lucene.index.IndexReader+.close())" )
         public void flush( final JoinPoint jp ) throws Throwable
         {
                 assert jp.getThis() == jp.getTarget();

                 final IndexReader r = (IndexReader)jp.getThis();
                 final Directory d = r.directory();

                 final String group = name( d );

                 CACHE.remove( group );
                 LOG.info( "Removed from " + CACHE + ": " + group );
         }

         @Around( "execution(public org.apache.lucene.search.Query
org.apache.lucene.search.FuzzyQuery.rewrite(org.apache.lucene.index.IndexReader))
&& args(reader)" )
         public Object cache( final IndexReader reader, final
ProceedingJoinPoint jp ) throws Throwable
         {
                 assert jp.getThis() == jp.getTarget();

                 final FuzzyQuery q = (FuzzyQuery)jp.getThis();
                 final String term = q.getTerm().text();

                 final String group = name( reader.directory() );
                 final String key = term + "~" + q.getMinSimilarity();

                 final Object e = CACHE.get( group, key );
                 if( e != null )
                 {
                         LOG.debug( term + " found in cache" );
                         return e;
                 }

                 final long t0 = System.currentTimeMillis();
                 final Query fuzzy = (Query)jp.proceed();
                 CACHE.put( group, key, fuzzy );
                 LOG.info( term + " rewritten in " +
(System.currentTimeMillis() - t0) + "ms (" + group + ")" );

                 return fuzzy;
         }

         private static final String name( final Directory d )
         {
                 if( d instanceof FSDirectory ) return
((FSDirectory)d).getFile().getAbsolutePath();

                 return String.valueOf( d.hashCode() );
         }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]