Re: Trimming the list of docs returned.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

TomSolrList
Hi -

On 10/30/06, Yonik Seeley <[hidden email]> wrote:
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).

Not to be dense, but how do I use a custom HitCollector with Solr?

I've checked the wiki, and searched the mailing list, and don't see
anything. Is there a way to configure this, or do I just build a
custom version of Solr?

I have no problems doing this in Lucene, but I'm not quite sure where
to configure/code this in Solr.

Thanks,

Tom


On 10/30/06, Yonik Seeley <[hidden email]> wrote:
 > Hi Tom, I moderated your email in... you need to subscribe to prevent
 > your emails being blocked in the future.

Thanks. That's fixed, I hope. I was using the wrong address.

 > http://incubator.apache.org/solr/mailing_lists.html
 >
 > On 10/30/06, Tom <[hidden email]> wrote:
 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 >
 > You bring up an interesting problem that may be of general use.
 > Solr doesn't currently do this, but it should be possible (with some
 > work in the internals).
 >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group
 >
 > Documents belonging to only one group does make things easier.
 >
 > > I could just examine each returned document, and discard documents
 > > from groups I have seen before, but that seems slow (but I'm not sure
 > > there is a better alternative).
 > >
 > > The number of groups is fairly high percentage of the number of
 > > documents (maybe 5% of all documents), so building something like a
 > > filter for each group doesn't seem feasible.
 > >
 > > CustomHitCollector of some sort could work, but there is the comment
 > > in the javadoc about "should not call  Searcher.doc(int)
 > > or  IndexReader.document(int) on every  document number encountered."
 > > which would seem to be necessary to get the group id.
 >
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).
 >
 > It might be useful to see what Nutch does in this regard too.
 >
 > -Yonik
 >

Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

Yonik Seeley-2
On 11/8/06, Tom <[hidden email]> wrote:
> On 10/30/06, Yonik Seeley <[hidden email]> wrote:
>  > Yes, a custom hit collector would work.  Searcher.doc() would be
>  > deadly... but since each doc has at most one category, the FieldCache
>  > could be used (it quickly maps id to field value and was historically
>  > used for sorting).
>
> Not to be dense, but how do I use a custom HitCollector with Solr?

You would need a custom request handler, then just use the
SolrIndexSearcher you get with a request... it exposes all of the
Lucene IndexSearcher methods.

-Yonik


> On 10/30/06, Yonik Seeley <[hidden email]> wrote:
>  > Hi Tom, I moderated your email in... you need to subscribe to prevent
>  > your emails being blocked in the future.
>
> Thanks. That's fixed, I hope. I was using the wrong address.
>
>  > http://incubator.apache.org/solr/mailing_lists.html
>  >
>  > On 10/30/06, Tom <[hidden email]> wrote:
>  > > I'd like to be able to limit the number of documents returned from
>  > > any particular group of documents, much as Google only shows a max of
>  > > two results from any one website.
>  >
>  > You bring up an interesting problem that may be of general use.
>  > Solr doesn't currently do this, but it should be possible (with some
>  > work in the internals).
>  >
>  > > The docs are all marked as to which group they belong to. There will
>  > > probably be multiple groups returned from any search. Documents
>  > > belong to only one group
>  >
>  > Documents belonging to only one group does make things easier.
>  >
>  > > I could just examine each returned document, and discard documents
>  > > from groups I have seen before, but that seems slow (but I'm not sure
>  > > there is a better alternative).
>  > >
>  > > The number of groups is fairly high percentage of the number of
>  > > documents (maybe 5% of all documents), so building something like a
>  > > filter for each group doesn't seem feasible.
>  > >
>  > > CustomHitCollector of some sort could work, but there is the comment
>  > > in the javadoc about "should not call  Searcher.doc(int)
>  > > or  IndexReader.document(int) on every  document number encountered."
>  > > which would seem to be necessary to get the group id.
>  >
>  > Yes, a custom hit collector would work.  Searcher.doc() would be
>  > deadly... but since each doc has at most one category, the FieldCache
>  > could be used (it quickly maps id to field value and was historically
>  > used for sorting).
>  >
>  > It might be useful to see what Nutch does in this regard too.
>  >
>  > -Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

TomSolrList
Hi -

Recap:
>>  > > I'd like to be able to limit the number of documents returned from
>>  > > any particular group of documents, much as Google only shows a max of
>>  > > two results from any one website.
>>  > >
>>  > > The docs are all marked as to which group they belong to. There will
>>  > > probably be multiple groups returned from any search. Documents
>>  > > belong to only one group


It looks like that for trimming, the places I want to modify are in
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
return the top item in the group that matches, whether by score or
sort, not just the first one that goes through the HitCollector.

But since I want to enable this per request basis, I need some way to
get the parameters from the original request, and pass it down to my
implementation of ScorePriorityQueue.

I'm trying to minimize the number of changes I'd have to make, so
I've defined another flag (like SolrIndexHandler.GET_SCORES), and I
check and set it in a modified version of StandardRequestHandler.
This seems to work, and doesn't require me to change any method
signatures. Suggestions for other implementations welcome!

Index: src/java/org/apache/solr/request/StandardRequestHandler.java
===================================================================
---
src/java/org/apache/solr/request/StandardRequestHandler.java
(revision 470495)
+++
src/java/org/apache/solr/request/StandardRequestHandler.java
(working copy)
@@ -97,6 +97,10 @@
        // find fieldnames to return (fieldlist)
        String fl = p.get(SolrParams.FL);
        int flags = 0;
+      String trim = p.get("trim");
+      if ((trim == null) || !trim.equals("0"))
+       flags |= SolrIndexSearcher.TRIM_RESULTS;
+
        if (fl != null) {
          flags |= U.setReturnFields(fl, rsp);
        }

But, unsurprisingly, trimming vs. not trimming is being ignored with
regard to caching. How would I indicate that a query with trim=0 is
not the same as trim=1? I do still want to cache. But obviously, my
implementation won't work at the moment, since all queries will cache
the value generated using the results generated by the value of trim
on the initial query.

Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom





At 11:10 AM 11/8/2006, you wrote:

>On 11/8/06, Tom <[hidden email]> wrote:
>>On 10/30/06, Yonik Seeley <[hidden email]> wrote:
>>  > Yes, a custom hit collector would work.  Searcher.doc() would be
>>  > deadly... but since each doc has at most one category, the FieldCache
>>  > could be used (it quickly maps id to field value and was historically
>>  > used for sorting).
>>
>>Not to be dense, but how do I use a custom HitCollector with Solr?
>
>You would need a custom request handler, then just use the
>SolrIndexSearcher you get with a request... it exposes all of the
>Lucene IndexSearcher methods.
>
>-Yonik
>
>
>>On 10/30/06, Yonik Seeley <[hidden email]> wrote:
>>  > Hi Tom, I moderated your email in... you need to subscribe to prevent
>>  > your emails being blocked in the future.
>>
>>Thanks. That's fixed, I hope. I was using the wrong address.
>>
>>  > http://incubator.apache.org/solr/mailing_lists.html
>>  >
>>  > On 10/30/06, Tom <[hidden email]> wrote:
>>  > > I'd like to be able to limit the number of documents returned from
>>  > > any particular group of documents, much as Google only shows a max of
>>  > > two results from any one website.
>>  >
>>  > You bring up an interesting problem that may be of general use.
>>  > Solr doesn't currently do this, but it should be possible (with some
>>  > work in the internals).
>>  >
>>  > > The docs are all marked as to which group they belong to. There will
>>  > > probably be multiple groups returned from any search. Documents
>>  > > belong to only one group
>>  >
>>  > Documents belonging to only one group does make things easier.
>>  >
>>  > > I could just examine each returned document, and discard documents
>>  > > from groups I have seen before, but that seems slow (but I'm not sure
>>  > > there is a better alternative).
>>  > >
>>  > > The number of groups is fairly high percentage of the number of
>>  > > documents (maybe 5% of all documents), so building something like a
>>  > > filter for each group doesn't seem feasible.
>>  > >
>>  > > CustomHitCollector of some sort could work, but there is the comment
>>  > > in the javadoc about "should not call  Searcher.doc(int)
>>  > > or  IndexReader.document(int) on every  document number encountered."
>>  > > which would seem to be necessary to get the group id.
>>  >
>>  > Yes, a custom hit collector would work.  Searcher.doc() would be
>>  > deadly... but since each doc has at most one category, the FieldCache
>>  > could be used (it quickly maps id to field value and was historically
>>  > used for sorting).
>>  >
>>  > It might be useful to see what Nutch does in this regard too.
>>  >
>>  > -Yonik

Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

Yonik Seeley-2
On 11/15/06, Tom <[hidden email]> wrote:
> It looks like that for trimming, the places I want to modify are in
> ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
> return the top item in the group that matches, whether by score or
> sort, not just the first one that goes through the HitCollector.

Wouldn't you actually need a priority queue per group?

> But, unsurprisingly, trimming vs. not trimming is being ignored with
> regard to caching. How would I indicate that a query with trim=0 is
> not the same as trim=1? I do still want to cache.

One hack: implement a simple query that delegates to another query and
encapsulates the trim value... that way hashCode/equals won't match
unless the trim does.

-Yonik

> But obviously, my
> implementation won't work at the moment, since all queries will cache
> the value generated using the results generated by the value of trim
> on the initial query.
>
> Any suggestions for where to go poking around to fix this vs. caching?
>
> Thanks,
>
> Tom
Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

TomSolrList
At 01:35 PM 11/15/2006, you wrote:
>On 11/15/06, Tom <[hidden email]> wrote:
>>It looks like that for trimming, the places I want to modify are in
>>ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
>>return the top item in the group that matches, whether by score or
>>sort, not just the first one that goes through the HitCollector.
>
>Wouldn't you actually need a priority queue per group?

I'm still playing with implementations, but I think you just need a
max score for each group.

You can't just do a PrioirtyQueue (of either max, or PriorityQueues)
since I don't think the Lucene PriorityQueue handles entries whose
value changes after insertion.


>>But, unsurprisingly, trimming vs. not trimming is being ignored with
>>regard to caching. How would I indicate that a query with trim=0 is
>>not the same as trim=1? I do still want to cache.
>
>One hack: implement a simple query that delegates to another query and
>encapsulates the trim value... that way hashCode/equals won't match
>unless the trim does.

Not sure what you mean by "delegates to another query". Could you
clarify or give me a pointer?

I was thinking in terms of just adding some guaranteed true clause to
the end when trimming, is that similar to what you were talking about?

Thanks,

Tom



>-Yonik
>
>>But obviously, my
>>implementation won't work at the moment, since all queries will cache
>>the value generated using the results generated by the value of trim
>>on the initial query.
>>
>>Any suggestions for where to go poking around to fix this vs. caching?
>>
>>Thanks,
>>
>>Tom

Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

Yonik Seeley-2
On 11/15/06, Tom <[hidden email]> wrote:
> >One hack: implement a simple query that delegates to another query and
> >encapsulates the trim value... that way hashCode/equals won't match
> >unless the trim does.
>
> Not sure what you mean by "delegates to another query". Could you
> clarify or give me a pointer?

Something like
public class TrimmedQuery extends Query {
   Query delegate;
   public TrimmedQuert(Query delegate, int trim) {
     this.delegate = delegate;
   }
   // now override hashCode + equals to include trim and implement all other
   // methods by delegating them.
}

> I was thinking in terms of just adding some guaranteed true clause to
> the end when trimming, is that similar to what you were talking about?

Yes, that should work too.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

Chris Hostetter-3

One other thing you'll need to watch out for is the filterCache ... Solr
has a setting (i forget the name at the moment) which tells the
SolrIndexSearcher that for sorted queries, it can reuse the DocSet from a
previous invocation of the Query and sort the cached DocSet to generate
the list -- but your set of documents returned is dependent on your sort
order, so you may actually want to put the sort option in your
TrimmedQuery as well to denote the uniqueness of the set of matched
Documents.

if you think about it, a completley generalized solution would allow the
"trimming" order to be independent of the sorting order, so a user could
ask for "Books matching the word 'Lucene' trimmed so only the most popular
matching book per publisher is returned, sorted by price." .. in which
case your Query needs to know that "Publisher" i the field you grouped on,
and "Popularity"/"desc" is the trimming you applied to each group --and
now the usual DocList and DocSet caching will work flawlessly, regardless
of the fact that you sorted on "Price" this time, but next time you might
sort on "Popularity".


: Date: Wed, 15 Nov 2006 19:26:39 -0500
: From: Yonik Seeley <[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Re: Trimming the list of docs returned.
:
: On 11/15/06, Tom <[hidden email]> wrote:
: > >One hack: implement a simple query that delegates to another query and
: > >encapsulates the trim value... that way hashCode/equals won't match
: > >unless the trim does.
: >
: > Not sure what you mean by "delegates to another query". Could you
: > clarify or give me a pointer?
:
: Something like
: public class TrimmedQuery extends Query {
:    Query delegate;
:    public TrimmedQuert(Query delegate, int trim) {
:      this.delegate = delegate;
:    }
:    // now override hashCode + equals to include trim and implement all other
:    // methods by delegating them.
: }
:
: > I was thinking in terms of just adding some guaranteed true clause to
: > the end when trimming, is that similar to what you were talking about?
:
: Yes, that should work too.
:
: -Yonik
:



-Hoss