Duplicate Detection: Offlince vs. Search Time

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Duplicate Detection: Offlince vs. Search Time

Shailesh Kochhar-2
Hi,

I'm trying to implement a duplicate detection method that doesn't delete
duplicate pages from the index. Essentially, I want to be able to
display all the duplicate URLs for a page in the search results instead
of just the one that was kept in the index.

There are two (potentially more) ways that I can think of to implement this.

1. Offline duplicate detection which deletes the pages from the index
but stores references to the deleted pages with the copy that is kept.
The search results can then display all the URLs that have the same content.

2. Duplicate detection at search time that groups identical/similar
pages together. This method has the advantage that one could implement
duplicate detection that is sensitive to the query terms. However, it
would add a performance penalty to the search.

I not very familiar with the Nutch API though I know there's a MD5
signature based deduping method in place and a Signature class to extend
for offline duplicate detection. I was wondering if anyone had tried
search time deduping and what would be good places to try and implement it.

Any other suggestions/advice would be great.

Thanks,
   - Shailesh
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Detection: Offlince vs. Search Time

Doug Cutting
Shailesh Kochhar wrote:
> I not very familiar with the Nutch API though I know there's a MD5
> signature based deduping method in place and a Signature class to extend
> for offline duplicate detection. I was wondering if anyone had tried
> search time deduping and what would be good places to try and implement it.

Nutch already does search-time deduping.  By default it limits things to
two hits per host, but you can dedup by other fields and with other
per-dup counts.  This is available through NutchBean:

http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/NutchBean.html#search(org.apache.nutch.searcher.Query,%20int,%20int,%20java.lang.String)

and though the OpenSearch servlet.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Detection: Offlince vs. Search Time

Shailesh Kochhar
Doug Cutting wrote:

> Shailesh Kochhar wrote:
>> I not very familiar with the Nutch API though I know there's a MD5
>> signature based deduping method in place and a Signature class to
>> extend for offline duplicate detection. I was wondering if anyone had
>> tried search time deduping and what would be good places to try and
>> implement it.
>
> Nutch already does search-time deduping.  By default it limits things to
> two hits per host, but you can dedup by other fields and with other
> per-dup counts.  This is available through NutchBean:
>
> http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/NutchBean.html#search(org.apache.nutch.searcher.Query,%20int,%20int,%20java.lang.String)
>
>
> and though the OpenSearch servlet.
>

If I understand this correctly, you can only dedup by one field. This
would mean that if you were to implement and use content-based
deduplication, you'd have to give up limiting the number of hits per host.

Is this correct, or did I miss something?

   - Shailesh

Reply | Threaded
Open this post in threaded view
|

Re: Duplicate Detection: Offlince vs. Search Time

Doug Cutting
Shailesh Kochhar wrote:
> If I understand this correctly, you can only dedup by one field. This
> would mean that if you were to implement and use content-based
> deduplication, you'd have to give up limiting the number of hits per host.
>
> Is this correct, or did I miss something?

That's correct.  That's what's currently implemented.

Doug