grouping search results

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

grouping search results

Mike Baranczak-2
The documents in my index will contain a "category" field. (We can  
assume that the number of possible categories will be small - 10 or  
so max - and that they'll be known in advance.) I need to be able to  
present the search results to the end user like this:

- top 10 results in category "x":
        1. sdfsdfsd
        2. dfgdgfdfg
        3. [...]
- top 10 results in category "y":
        1. gffgdgf
        2. kjhjkkghj
[...]

The first thing I thought of was to construct a boolean AND query for  
every possible category (from the user's query and a term query for  
the category); but this seems like it might be causing a lot of  
redundant work. My next idea was to create a QueryFilter from the  
user's query, and run a search for each category with this filter and  
a term query. Since the QueryFilter is supposed to cache results,  
this should theoretically be more efficient. So my questions to the  
Lucene gurus are:

1) Will the QueryFilter method really be more efficient?

2) Is there yet another way to accomplish what I need?

-MB


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: grouping search results

Karl Wettin-3
On Tue, 2006-05-09 at 13:46 -0400, Mike Baranczak wrote:

> The documents in my index will contain a "category" field. (We can  
> assume that the number of possible categories will be small - 10 or  
> so max - and that they'll be known in advance.) I need to be able to  
> present the search results to the end user like this:
>
> - top 10 results in category "x":
> 1. sdfsdfsd
> 2. dfgdgfdfg
> 3. [...]
> - top 10 results in category "y":
> 1. gffgdgf
> 2. kjhjkkghj
> [...]


> 2) Is there yet another way to accomplish what I need?

Did you consider placing more than one query? Perhaps it is enough to
iterate the top 100 scoring hits?


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: grouping search results

Chris Hostetter-3
In reply to this post by Mike Baranczak-2

: redundant work. My next idea was to create a QueryFilter from the
: user's query, and run a search for each category with this filter and
: a term query. Since the QueryFilter is supposed to cache results,
: this should theoretically be more efficient. So my questions to the

if you did an approach like this, the scores for each document in each
result set would be the same for each set -- becaue the "Query" is just on
the category term -- the Users's query would only be used to Filter, so
the score value would be ignored.

: 2) Is there yet another way to accomplish what I need?

Off hte top of my head, the best way i can think to do this (if the list
of categories is fixed and known in advance as you said) is with a
HitCollector that maintains a Bounded PriorityQuery for each category.  as
it collects matches, it can look up which category they are in using the
FieldCache and add them to the appropriate queue.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: grouping search results

Mike Baranczak-2

On May 9, 2006, at 2:08 PM, Chris Hostetter wrote:

>
> : redundant work. My next idea was to create a QueryFilter from the
> : user's query, and run a search for each category with this filter  
> and
> : a term query. Since the QueryFilter is supposed to cache results,
> : this should theoretically be more efficient. So my questions to the
>
> if you did an approach like this, the scores for each document in each
> result set would be the same for each set -- becaue the "Query" is  
> just on
> the category term -- the Users's query would only be used to  
> Filter, so
> the score value would be ignored.

Damn. That's no good, then. What about doing it the opposite way:  
make a QueryFilter for each category (these could be cached between  
search sessions), and use those to filter the results from searching  
for the user's query? Would that actually be any faster than the  
original idea of constructing a boolean query for each category?


>
> : 2) Is there yet another way to accomplish what I need?
>
> Off hte top of my head, the best way i can think to do this (if the  
> list
> of categories is fixed and known in advance as you said)

Not exactly fixed, but it probably won't change too often, and it  
will definitely be known at query time. So close enough.

> is with a
> HitCollector that maintains a Bounded PriorityQuery for each  
> category.  as
> it collects matches, it can look up which category they are in  
> using the
> FieldCache and add them to the appropriate queue.

Did you mean PriorityQueue?

Can you explain what you mean by that? I'm looking at the javadocs  
for FieldCache, but there's no indication of how to obtain one.


>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: grouping search results

Chris Hostetter-3

: Damn. That's no good, then. What about doing it the opposite way:
: make a QueryFilter for each category (these could be cached between
: search sessions), and use those to filter the results from searching
: for the user's query? Would that actually be any faster than the
: original idea of constructing a boolean query for each category?

1) if your categories are defined by simple terms, there's no reason why
it would have to be a QueryFilter, you could make a simple TermFilter
class that would do the same thing

2) i don't think it woud be much faster then doing 10 seperate
BooleanQueries -- the advantage would be that in the BooleanQuery approach
the resulting scores will be affected by the category part of the query
... if it's a simple mandatory term query then there's no harm, becuase it
will affect all of the scores in that category equally -- but if it's more
complex (ie: if your portable music category is defined by the query
"name:mp3 or name:ipod or name:cd") then the TF/IDF of each term will
affect the score of the resulting documents, and could change the order.

: > is with a
: > HitCollector that maintains a Bounded PriorityQuery for each
: > category.  as
: > it collects matches, it can look up which category they are in
: > using the
: > FieldCache and add them to the appropriate queue.
:
: Did you mean PriorityQueue?

yes, sorry ... but i wasn't explicitly refering to the lucene
PriorityQueue class ... i just mean any data structure that will maintain
an ordered list of the N "biggest" items you give it.

: Can you explain what you mean by that? I'm looking at the javadocs
: for FieldCache, but there's no indication of how to obtain one.

FieldCache.DEFAULT.getStringIndex(reader, fieldName) ... or
FieldCache.DEFAULT.getInts(reader, fieldName) depending on how you store
your category info.  The resulting array can be used to lookup values by
docid very quickly for comparison (or to be used as a key in a Map of
queues)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]