Leveraging filter cache in queries

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Leveraging filter cache in queries

Fabio Confalonieri
Hello,

I've just fond Lucene and Solr and I'm thinking of using them in our current project, essentially an ads portal (something very similar to www.oodle.com).

I see our needs have already surfaced in the mailing list, it's the refine search problem You have sometime called faceted browsing and which is the base of CNet browsing architecture: we have ads with different categories which have different attributes ("fields" in lucene language), say motors-car category has make,model,price,color and real-estates-houses has bathrooms ranges, bedrooms ranges, etc...

I understand You have developed Solr also to have filter cache storing bitset of search results to have a fast way to intersect those bitsets to count resulting sub-queries and present the count for refinement searches (I have read the announcement of CNet and the Nines related thread and also some other related thread).

Actually we thought of storing for every category on a MySQL database (which we use for every other non search related tasks) the possible sub-query attributes with possible values/ranges, in a similar way as You with CNet do storing the possible subqueries of a query in a lucene document.

Now what I havent understood is if the Solr StandardRequestHandler automatically creates and caches filters from normal queries submitted to Solr select servlet, possibly with some syntax clue.
I tried a query like "+field:value^0" which returns a great number of Hits (on a total test of 100.000 documents), but I see only the query cache growing and the filter cache always empty. Is this normal ? I've tried to check all the cache configuration but I don't understand if filters are auto-generated from normal queries.

A more general question: Is all the CNet logic of intersecting bitsets available through the servlet or have I to write some java code to be plugged in Solr?
In this case which is the correct level to make this, perhaps a new RequestHandler understanding some new query syntax to exploit filters.

We only need a sort on a single and precalculated rank field stored as a range field, so we don't need relevance and consequently don't nedd scores (which is a prerequisite for using BitSets, if I understand well).

Thank You, I hope to have explained well my doubts.

Fabio

PS:I think Solr and Lucene are a really great work!
I'll be happy when we have finished to add our project (a major press group here in Italy) to public websites in Solr Wiki.
Reply | Threaded
Open this post in threaded view
|

Re: Leveraging filter chache in queries

Yonik Seeley
On 5/12/06, Fabio Confalonieri <[hidden email]> wrote:
> I tried a query like "+field:value^0" which returns a great number of Hits
> (on a total test of 100.000 documents), but I see only the query cache
> growing and the filter cache always empty. Is this normal ? I've tried to
> check all the cache configuration but I don't understand if filters are
> auto-generated from normal queries.

There is currently no syntax in the standard request handler that
understands filters.

Converting certain "heavy" term queries to filters when they have a
zero boost was something Doug pointed me at and I borrowed directly
from Nutch very early on, before Solr had it's own caching.

The optimization code is still sort-of in Solr, but
 - it's not called by default anymore... people needing faceted
browsing currently need their own plugin anyway, and they can then
specify filters directly.
 - it's caching is not integrated into Solr's caching

Filters *can* be generated and used to satisfy whole queries when the
following optimization is turned on in solrconfig.xml:
   <!-- An optimization that attempts to use a filter to satisfy a search.
         If the requested sort does not include score, then the filterCache
         will be checked for a filter matching the query. If found, the filter
         will be used as the source of document ids, and then the sort will be
         applied to that.  -->
    <useFilterForSortedQuery>true</useFilterForSortedQuery>

> A more general question: Is all the CNet logic of intersecting bitsets
> available through the servlet or have I to write some java code to be
> plugged in Solr?

The nitty-gritty if getting intersection counts is in Solr, but you
still need to ask solr for each facet count individually, and you
still need to know which counts to ask for.  Thats the part you
currently still need a custom request handler for.

> In this case which is the correct level to make this, perhaps a new
> RequestHandler understanding some new query syntax to exploit filters.

Yes, a new RequestHandler.. from there the easiest way is to pass
extra parameters  (not changing the query syntax passed as "q").

> We only need a sort on a single and precalculated rank field stored as a
> range field, so we don't need relevance and consequently don't nedd scores
> (which is a prerequisite for using BitSets, if I understand well).

You can do relevancy scoring *and* do facets at the same time... there
is no incompatibility there.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Leveraging filter chache in queries

Erik Hatcher
In reply to this post by Fabio Confalonieri

On May 12, 2006, at 9:06 AM, Fabio Confalonieri wrote:

> I see our needs have already surfaced in the mailing list, it's the  
> refine
> search problem You have sometime called faceted browsing and which  
> is the
> base of CNet browsing architecture: we have ads with different  
> categories
> which have different attributes ("fields" in lucene language), say
> motors-car category has make,model,price,color and real-estates-
> houses has
> bathrooms ranges, bedrooms ranges, etc...
>
> I understand You have developed Solr also to have filter cache storing
> bitset of search results to have a fast way to intersect those  
> bitsets to
> count resulting sub-queries and present the count for refinement  
> searches (I
> have read the announcement of CNet and the Nines related thread and  
> also
> some other related thread).

As Yonik has pointed out, Solr provides some nice facilities to build  
upon, but the actual implementation is still custom for this sort of  
thing.  For example, here's the (pseudo)code for my intersecting  
BitSet (and soon to become DocSet) processing works:

   private Query createConstraintMask(final Map facetCache, String[]  
constraints, BitSet constraintMask, IndexReader reader) throws  
ParseException, IOException {
     Query query = new BooleanQuery();  // BooleanQuery used for all  
full-text expression constraints, but not for facets
     constraintMask.set(0, constraintMask.size()); // light up all  
documents initially

     if (constraints != null) {

       // Loop over all constraints, ANDing all cached bit sets with  
the constraint mask
       for (String constraint : constraints) {
         if (constraint == null || constraint.length() == 0) continue;

         // constraint looks like this: [-]field:value
         int colonPosition = constraint.indexOf(':');

         if (colonPosition <= 0) continue;

         String field = constraint.substring(0,colonPosition);
         boolean invert = false;
         if (field.startsWith("-")) {
           invert = true;
           field = field.substring(1);
         }

         String value = constraint.substring(colonPosition + 1);

         BitSet valueMask;
         if (! field.equals("?")) {
           Map fieldMap = (Map) facetCache.get(field);  // facetCache  
is from a custom Solr cache currently
           if (fieldMap == null) continue;  // field name doesn't  
correspond to predefined facets

           valueMask = (BitSet) fieldMap.get(value);
           if (valueMask == null) {
             valueMask = new BitSet(constraintMask.size());
             System.out.println("invalid value requested for field "  
+ field + ": " + value);
           }
         } else {
           Query clause = // some query from parsing "value";
           QueryFilter filter = new QueryFilter(clause); // this  
should change to get the DocSet from Solr's facilities :)
           valueMask = filter.bits(reader);
         }

         if (!invert) {
           constraintMask.and(valueMask);
         } else {
           constraintMask.andNot(valueMask);   // This is what would  
be nice for DocSet's to be capable of
         }
       }
     }

     if (((BooleanQuery)query).getClauses().length == 0) {
       query = new MatchAllDocsQuery();
     }

     return query;
   }


And then basically it gets called like this in my custom handler:

     BitSet constraintMask = new BitSet(reader.numDocs());
     Query query = query = createConstraintMask(facetCache,  
req.getParams("constraint"), constraintMask, reader);
     DocList results = req.getSearcher().getDocList(query, new  
BitDocSet(constraintMask), sort, req.getStart(), req.getLimit());

[critique of this code more than welcome!]

My client (Ruby on  Rails) is POSTing in a parameter that looks like  
this:

        constraint=#{invert}#{field}:#{constraint[:value]}

parameters.  Works really well even before my refactoring to use  
Solr's DocSet and caching capabilities, and I'm sure it'll do even  
better leveraging its provided capabilities.  Really nice stuff!

> A more general question: Is all the CNet logic of intersecting bitsets
> available through the servlet or have I to write some java code to be
> plugged in Solr?

Currently you have to piece it together.  The goal is to build these  
facilities more into the core, but we should do so based on folks  
implementing it themselves and contributing it, so that we can  
compare the needs that others have and come up with some great  
groundwork in the faceted browsing area just as Solr itself has built  
above raw Lucene.

So, let's all flesh this stuff out and compare/contrast real-world  
working implementations and factoring it on top.

As an example of another facility I've just added on top, the ability  
to return all terms that match a client-provided prefix - this is to  
enable Google Suggest-like convenience so that when someone types  
"Yo" and pauses, an Ajaxifried UI will hit my Rails app, which in  
turn will ping Solr with the prefix and a custom request handler will  
respond back with the terms that match ("Yonik" for example) for a  
specified field.  Not only that, but my implementation returns the  
number of documents that match that term constrained by the same  
types of constraints above including full-text queries.  This allows  
our users to pick people by typing a name rather than us having to  
populate a drop-down (we'll still have some kind of browser interface  
too, I'm sure) but only names of folks involved in the document set  
they are currently constraining their view to.

I've been thinking about this in a general sense - if Solr was driven  
by a slick servlet filter rather than servlets then these types of  
handlers could be plugged in a lot easier including automatic URL  
handling rather than having to twiddle web.xml.  I realize that the  
handler configuration allows this with the qt parameter, and I'm  
leveraging that myself, but I think with some HiveMind mojo to allow  
true "plugins" to drop right into the classpath and be immediately  
available (perhaps even hotly with some containers, but I personally  
would rebuild a WAR, stop/deploy/restart).

> In this case which is the correct level to make this, perhaps a new
> RequestHandler understanding some new query syntax to exploit filters.

Back to your specific case currently, yes, a new request handler is  
needed to go above and beyond what the built-in standard one  
provides.  I expect a flood of cool handlers on top of Solr :)  and  
that is why I am thinking more along the lines of a true plugin  
architecture.

> We only need a sort on a single and precalculated rank field stored  
> as a
> range field, so we don't need relevance and consequently don't nedd  
> scores
> (which is a prerequisite for using BitSets, if I understand well).

You're pretty much right on!

> PS:I think Solr and Lucene are a really great work!
> I'll be happy when we have finished to add our project (a major  
> press group
> here in Italy) to public websites in Solr Wiki.

I'm looking forward to your work on top of Solr!  I'm personally  
quite thrilled with it and really believe it'll go far.  If only I  
had more time to play with it myself rather than just contemplating  
it :)

        Erik