My Category Search Problem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

My Category Search Problem

Vijay Santhanam
Hi Lucene Users!

 

I've been playing around with dotLucene on a few projects since for about 4
months, and I've found Lucene to be exceptionally powerful, speedy and
thanks to LIA, really easy to use.

But I've hit a problem that I fear will pose a performance problem for our
architecture and Lucene installation.

 

We have an index of about 100,000 documents with about 30 fields, built from
our database.

Each document in the index contains a TOKENIZED field of Category Names, so
that each document can belong to many categories. The category field is a
tokenized string field.

 

We have a new requirement to not only allow searches across the whole index,
but to return the number of documents in each of the (150) possible
categories. This is like in an Amazon search
(http://amazon.com/s/ref=nb_ss_gw/105-0072880-3737226?url=search-alias%3Daps
&field-keywords=diamond&Go.x=0&Go.y=0&Go=Go), where a category list is
presented on the left with the number of results in each category.

 

So far, I can think of two possible ways to implement this:

 

1. Create a QueryFilter for the user enterered query, and perform a
category field search for each category.
2. Create a separate index for each category, and sequentially (or
concurrently) search across all the indexes.

 

Does anyone know which solution is better than the other?

 

Both solutions seem taxing to me because they both involve "number of
categories + 1" searches.

 

Regards,

 -V

 

 

Reply | Threaded
Open this post in threaded view
|

Re: My Category Search Problem

kapilChhabra
Hi Vijay,
I have hit the same problem in the past and have evaluated various
techniques to solve the same.
1. Using the QueryFilter
The idea is to
    a) create BitSets for each category once initially
    b) run the search and extract the BitSet for the search results
    c) Logically "AND" the result set with the category sets
    d) find the cardinality of each such result and finally display
This was working just fine in my scenario but was not scalable. The
performance decreased with the increase in the number categories.
(because of the "AND"ing in the loop)

2. Override the collect method of the HitCollector.
This method is called by lucene for every document in the search results.
The idea is to:
    a) override the method to use a HashMap (this works just fine for
me) for the category to count (hits) mapping
    b) just keep incrementing the count for each category as and when it
is encountered in the search results.
    c) the HashMap can be blank in the beginning and new categories can
be added to it when encountered.

I am currently using the second method and it works.

Hope this helps.

Regards,
kapilChhabra


Vijay Santhanam wrote:

> Hi Lucene Users!
>
>  
>
> I've been playing around with dotLucene on a few projects since for about 4
> months, and I've found Lucene to be exceptionally powerful, speedy and
> thanks to LIA, really easy to use.
>
> But I've hit a problem that I fear will pose a performance problem for our
> architecture and Lucene installation.
>
>  
>
> We have an index of about 100,000 documents with about 30 fields, built from
> our database.
>
> Each document in the index contains a TOKENIZED field of Category Names, so
> that each document can belong to many categories. The category field is a
> tokenized string field.
>
>  
>
> We have a new requirement to not only allow searches across the whole index,
> but to return the number of documents in each of the (150) possible
> categories. This is like in an Amazon search
> (http://amazon.com/s/ref=nb_ss_gw/105-0072880-3737226?url=search-alias%3Daps
> &field-keywords=diamond&Go.x=0&Go.y=0&Go=Go), where a category list is
> presented on the left with the number of results in each category.
>
>  
>
> So far, I can think of two possible ways to implement this:
>
>  
>
> 1. Create a QueryFilter for the user enterered query, and perform a
> category field search for each category.
> 2. Create a separate index for each category, and sequentially (or
> concurrently) search across all the indexes.
>
>  
>
> Does anyone know which solution is better than the other?
>
>  
>
> Both solutions seem taxing to me because they both involve "number of
> categories + 1" searches.
>
>  
>
> Regards,
>
>  -V
>
>  
>
>  
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: My Category Search Problem

Graham Stead-2
Hello Vijay,

I'm not sure whether such a change is feasible for you, but Solr has
supported facets for some time now. Solr is a front-end for Lucene that
provides a number of valuable features  not found in Lucene itself: caching,
FunctionQueries, and facets, to name a few. It has an XML interface and can
be accessed from any language.

Best regards,
-Graham

> -----Original Message-----
> From: Kapil Chhabra [mailto:[hidden email]]
> Sent: Monday, January 15, 2007 11:43 PM
> To: [hidden email]
> Subject: Re: My Category Search Problem
>
> Hi Vijay,
> I have hit the same problem in the past and have evaluated
> various techniques to solve the same.
> 1. Using the QueryFilter
> The idea is to
>     a) create BitSets for each category once initially
>     b) run the search and extract the BitSet for the search results
>     c) Logically "AND" the result set with the category sets
>     d) find the cardinality of each such result and finally
> display This was working just fine in my scenario but was not
> scalable. The performance decreased with the increase in the
> number categories.
> (because of the "AND"ing in the loop)
>
> 2. Override the collect method of the HitCollector.
> This method is called by lucene for every document in the
> search results.
> The idea is to:
>     a) override the method to use a HashMap (this works just fine for
> me) for the category to count (hits) mapping
>     b) just keep incrementing the count for each category as
> and when it is encountered in the search results.
>     c) the HashMap can be blank in the beginning and new
> categories can be added to it when encountered.
>
> I am currently using the second method and it works.
>
> Hope this helps.
>
> Regards,
> kapilChhabra
>
>
> Vijay Santhanam wrote:
> > Hi Lucene Users!
> >
> >  
> >
> > I've been playing around with dotLucene on a few projects since for
> > about 4 months, and I've found Lucene to be exceptionally powerful,
> > speedy and thanks to LIA, really easy to use.
> >
> > But I've hit a problem that I fear will pose a performance
> problem for
> > our architecture and Lucene installation.
> >
> >  
> >
> > We have an index of about 100,000 documents with about 30 fields,
> > built from our database.
> >
> > Each document in the index contains a TOKENIZED field of Category
> > Names, so that each document can belong to many categories. The
> > category field is a tokenized string field.
> >
> >  
> >
> > We have a new requirement to not only allow searches across
> the whole
> > index, but to return the number of documents in each of the (150)
> > possible categories. This is like in an Amazon search
> >
> (http://amazon.com/s/ref=nb_ss_gw/105-0072880-3737226?url=search-alias
> > %3Daps &field-keywords=diamond&Go.x=0&Go.y=0&Go=Go), where
> a category
> > list is presented on the left with the number of results in each
> > category.
> >
> >  
> >
> > So far, I can think of two possible ways to implement this:
> >
> >  
> >
> > 1. Create a QueryFilter for the user enterered query, and perform a
> > category field search for each category.
> > 2. Create a separate index for each category, and sequentially (or
> > concurrently) search across all the indexes.
> >
> >  
> >
> > Does anyone know which solution is better than the other?
> >
> >  
> >
> > Both solutions seem taxing to me because they both involve
> "number of
> > categories + 1" searches.
> >
> >  
> >
> > Regards,
> >
> >  -V
> >
> >  
> >
> >  
> >
> >
> >  
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]