New algo: Near duplicate detection

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

New algo: Near duplicate detection

Otis Gospodnetic-2-2
This sounds simple and apparently it's effective...should anyone want to give it a try:

http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply | Threaded
Open this post in threaded view
|

Re: New algo: Near duplicate detection

Dennis Kubes-2
I just saw that as well.  I think it is worth a go implementing this.

Dennis

Otis Gospodnetic wrote:
> This sounds simple and apparently it's effective...should anyone want to give it a try:
>
> http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
Reply | Threaded
Open this post in threaded view
|

Re: New algo: Near duplicate detection

Andrzej Białecki-2
Dennis Kubes wrote:
> I just saw that as well.  I think it is worth a go implementing this.
>
> Dennis
>
> Otis Gospodnetic wrote:
>> This sounds simple and apparently it's effective...should anyone want
>> to give it a try:
>>
>> http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html

Interesting, I agree it's worth checking. The reference to the use of
inverted indexes is intriguing - perhaps we could use the already
existing Lucene index which is being de-duplicated.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com