Trimming the list of docs returned.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Trimming the list of docs returned.

Tom-137
Hi -

I'd like to be able to limit the number of documents returned from
any particular group of documents, much as Google only shows a max of
two results from any one website.

The docs are all marked as to which group they belong to. There will
probably be multiple groups returned from any search. Documents
belong to only one group

I could just examine each returned document, and discard documents
from groups I have seen before, but that seems slow (but I'm not sure
there is a better alternative).

The number of groups is fairly high percentage of the number of
documents (maybe 5% of all documents), so building something like a
filter for each group doesn't seem feasible.

CustomHitCollector of some sort could work, but there is the comment
in the javadoc about "should not call  Searcher.doc(int)
or  IndexReader.document(int) on every  document number encountered."
which would seem to be necessary to get the group id.

Does Solr add anything to Lucene in this regard?

Thanks,

Tom

Reply | Threaded
Open this post in threaded view
|

Re: Trimming the list of docs returned.

Yonik Seeley-2
Hi Tom, I moderated your email in... you need to subscribe to prevent
your emails being blocked in the future.
http://incubator.apache.org/solr/mailing_lists.html

On 10/30/06, Tom <[hidden email]> wrote:
> I'd like to be able to limit the number of documents returned from
> any particular group of documents, much as Google only shows a max of
> two results from any one website.

You bring up an interesting problem that may be of general use.
Solr doesn't currently do this, but it should be possible (with some
work in the internals).

> The docs are all marked as to which group they belong to. There will
> probably be multiple groups returned from any search. Documents
> belong to only one group

Documents belonging to only one group does make things easier.

> I could just examine each returned document, and discard documents
> from groups I have seen before, but that seems slow (but I'm not sure
> there is a better alternative).
>
> The number of groups is fairly high percentage of the number of
> documents (maybe 5% of all documents), so building something like a
> filter for each group doesn't seem feasible.
>
> CustomHitCollector of some sort could work, but there is the comment
> in the javadoc about "should not call  Searcher.doc(int)
> or  IndexReader.document(int) on every  document number encountered."
> which would seem to be necessary to get the group id.

Yes, a custom hit collector would work.  Searcher.doc() would be
deadly... but since each doc has at most one category, the FieldCache
could be used (it quickly maps id to field value and was historically
used for sorting).

It might be useful to see what Nutch does in this regard too.

-Yonik