MultiSearcher & skewed IDF values

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

MultiSearcher & skewed IDF values

kkrugler
Hi all,

I'm curious as to whether MultiSearcher (as of 1.9) does a good job
of blending search results, when the various indexes being searched
have significantly different characteristics.

For example, let's say I've got two indexes. One consists of
documents where the term "platypus" almost never occurs. This index
will have a very high IDF for that term.

The second index happens to be from the portion of the crawl that was
covering biology departments in Australian universities, so the term
"platypus" is significantly more common.

If I do a search on "platypus lifespan" using MultiSearcher, will
hits from the first index get an unnatural boost because of the
corresponding high IDF in that particular slice of the data? Or
should I expect that the results will (closely) match what I'd get
back if I merged the two indexes and used a regular searcher?

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: MultiSearcher & skewed IDF values

Andrzej Białecki-2
Ken Krugler wrote:

> Hi all,
>
> I'm curious as to whether MultiSearcher (as of 1.9) does a good job of
> blending search results, when the various indexes being searched have
> significantly different characteristics.
>
> For example, let's say I've got two indexes. One consists of documents
> where the term "platypus" almost never occurs. This index will have a
> very high IDF for that term.
>
> The second index happens to be from the portion of the crawl that was
> covering biology departments in Australian universities, so the term
> "platypus" is significantly more common.
>
> If I do a search on "platypus lifespan" using MultiSearcher, will hits
> from the first index get an unnatural boost because of the
> corresponding high IDF in that particular slice of the data? Or should
> I expect that the results will (closely) match what I'd get back if I
> merged the two indexes and used a regular searcher?

Unfortunately, this is still an existing problem, and neither Nutch nor
Lucene does the right job here. Please see NUTCH-92 for more
information, and a sketch of solution for this issue.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: MultiSearcher & skewed IDF values

Doug Cutting
Andrzej Bialecki wrote:
> Unfortunately, this is still an existing problem, and neither Nutch nor
> Lucene does the right job here. Please see NUTCH-92 for more
> information, and a sketch of solution for this issue.

Lucene's MultiSearcher now implements this correctly, no?  But Nutch's
distributed search does not.  Two round trips to each node are required:
the first to get IDF information for the query, and the second to get hits.

Doug
vis
Reply | Threaded
Open this post in threaded view
|

Re: MultiSearcher & skewed IDF values

vis
In reply to this post by kkrugler
Sorry, I am on holiday until the 8th of May.

Please contact the [hidden email] for urgent matters.

Kind regards, Herman.