cache warming optmization

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

cache warming optmization

Erik Hatcher
I'm interested in improving my existing custom cache warming by being  
selective about what updates rather than rebuilding completely.

How can I tell what documents were updated/added/deleted from the old  
cache to the new IndexSearcher?

Thanks,
        Erik

Reply | Threaded
Open this post in threaded view
|

Re: cache warming optmization

Walter Underwood, Netflix
On 2/7/07 10:04 AM, "Erik Hatcher" <[hidden email]> wrote:

> I'm interested in improving my existing custom cache warming by being
> selective about what updates rather than rebuilding completely.
>
> How can I tell what documents were updated/added/deleted from the old
> cache to the new IndexSearcher?

We could add a system-maintained timestamp field. LDAP has that.

Knowing which documents were added or changed doesn't actually
work for this, because the new or changed documents might now
match queries that they didn't match before. Add a term to a
document, and it shows up in new queries. Those queries need
to be re-run.

In order to selectively warm, you need to know which terms
changed. Build a set of all terms in documents before they
are updated and all from the new documents. Then extract
the terms from each query. If a query has any term that
is in the set from the document changes, that query must
be re-run.

We used to do something similar manually for stemmer dictionary
changes. The same would be necessary for changes to protwords.txt.
Search for the old and new forms, and reindex only the matching
documents.

This is very efficient for stemmer changes, but I'm not sure
how well it would work for document changes. If your documents
are a good match to your queries (and I hope they are), a few
changes could match many queries, then you are back to a full
re-warm.

wunder
--
Walter Underwood
Search Guru, Netflix



Reply | Threaded
Open this post in threaded view
|

Re: cache warming optmization

Chris Hostetter-3
In reply to this post by Erik Hatcher

: I'm interested in improving my existing custom cache warming by being
: selective about what updates rather than rebuilding completely.
:
: How can I tell what documents were updated/added/deleted from the old
: cache to the new IndexSearcher?

cache warming in Solr is based mainly arround the idea of "what keys were
in the old cache?" then "what's changed?" ... because regardless of what
updates may have happened, wholesale docids shifts might have taken place.

Of course, if you are dealing with a custom cache where the values aren't
DocSetws or DocLists but your own custom objects that don't know about
indiviual docIds, this doesn't really affect you as much.

I'm not entirely sure i understand your situation, but one trick yonik
found that really improved the cache warming in a custom CacheRegenerator
i had was in dealing with big metadata documents that i was parsing into
objects for use in a custom request handler.  He pointed out that if i
put the Lucene Document in my CacheValue objects, then when
warming my newCache, i could do a search on the newSearcher, get the
Document back, and if it was the same as the Document in the value from my
oldCache i could copy it wholesale instead of redoing all of the parsing
(this was complicated by Document not supporting equals, but you get the
idea)


I suppose to try and make CacheRegenerator's lives easier, we could expose
the SolrIndexSearcher use with the oldCache -- but i'm still not sure how
usefull that would be ... "diffing" two IndexSearchers isn't very easy,
but i suppose in some cases comparing hte TermEnums for some fields
(like the uniqueKey field for example) might be helpful.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: cache warming optmization

Karl Wettin
In reply to this post by Erik Hatcher

7 feb 2007 kl. 19.04 skrev Erik Hatcher:

> I'm interested in improving my existing custom cache warming by  
> being selective about what updates rather than rebuilding completely.

I know it is not Solr, but I've made great progress on my cache that  
updates affected results only, on insert and delete. It's available  
in LUCENE-550, and based on the InstantiatedIndex and NotifiableIndex  
avilable in the same patch. Java 1.5. Perhaps that is something you  
can take a look at for some ideas.

--
karl