pages with duplicate content in search results

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

pages with duplicate content in search results

Edward Quick

Hi,

Eventhough I ran nutch dedup on my index, I still have pages with different urls but the exactly the same content (see search result example below). From what I read up on dedup this shouldn't happen though as it deletes the url with the lowest score. Is there anything else I can try to get rid of these?

Thanks,
Ed.

Item Document :- Client - TeraTerm Pro
... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online   Employee Self Service       ESS Home ... Description Document     Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)



Item Document :- Client - TeraTerm Pro
... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online   Employee Self Service       ESS Home ... Description Document     Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

Dennis Kubes-2
If you are using more than one index then dedup will not work across
indexes.  A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates.  The dedup process works on url
and byte hash.  If the content is even 1 byte different, it doesn't work.

Near duplicate detection is another set of algorithms that haven't been
implemented in Nutch yet.  On the query site you can set hte hitsPerSite
to 1 and it should limit your search results.

Dennis

Edward Quick wrote:

> Hi,
>
> Eventhough I ran nutch dedup on my index, I still have pages with different urls but the exactly the same content (see search result example below). From what I read up on dedup this shouldn't happen though as it deletes the url with the lowest score. Is there anything else I can try to get rid of these?
>
> Thanks,
> Ed.
>
> Item Document :- Client - TeraTerm Pro
> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online   Employee Self Service       ESS Home ... Description Document     Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
>
>
>
> Item Document :- Client - TeraTerm Pro
> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online   Employee Self Service       ESS Home ... Description Document     Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

vishal vachhani
Dennis,
            I am facing same problem, in my crawl content of some urls are
same but urls are different. Could you please tell me how I can set
hitsPersite to 1 . ?

--Vishal

On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[hidden email]> wrote:

> If you are using more than one index then dedup will not work across
> indexes.  A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates.  The dedup process works on url and
> byte hash.  If the content is even 1 byte different, it doesn't work.
>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
> 1 and it should limit your search results.
>
> Dennis
>
>
> Edward Quick wrote:
>
>> Hi,
>>
>> Eventhough I ran nutch dedup on my index, I still have pages with
>> different urls but the exactly the same content (see search result example
>> below). From what I read up on dedup this shouldn't happen though as it
>> deletes the url with the lowest score. Is there anything else I can try to
>> get rid of these?
>>
>> Thanks,
>> Ed.
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online   Employee Self Service       ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>
>>
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online   Employee Self Service       ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>> _________________________________________________________________
>> Make a mini you and download it into Windows Live Messenger
>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

Dennis Kubes-2
In search.jsp lines 116-119:

   int hitsPerSite = 2;                            // max hits per site
   String hitsPerSiteString = request.getParameter("hitsPerSite");
   if (hitsPerSiteString != null)
     hitsPerSite = Integer.parseInt(hitsPerSiteString);

Hope that helps.

Dennis

vishal vachhani wrote:

> Dennis,
>             I am facing same problem, in my crawl content of some urls are
> same but urls are different. Could you please tell me how I can set
> hitsPersite to 1 . ?
>
> --Vishal
>
> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[hidden email]> wrote:
>
>> If you are using more than one index then dedup will not work across
>> indexes.  A single index should dedup correctly unless the pages are not
>> exact duplicates but near duplicates.  The dedup process works on url and
>> byte hash.  If the content is even 1 byte different, it doesn't work.
>>
>> Near duplicate detection is another set of algorithms that haven't been
>> implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
>> 1 and it should limit your search results.
>>
>> Dennis
>>
>>
>> Edward Quick wrote:
>>
>>> Hi,
>>>
>>> Eventhough I ran nutch dedup on my index, I still have pages with
>>> different urls but the exactly the same content (see search result example
>>> below). From what I read up on dedup this shouldn't happen though as it
>>> deletes the url with the lowest score. Is there anything else I can try to
>>> get rid of these?
>>>
>>> Thanks,
>>> Ed.
>>>
>>> Item Document :- Client - TeraTerm Pro
>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>>> Online   Employee Self Service       ESS Home ... Description Document
>>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>>> printing or keymapping is an issue, TeraTerm ...
>>>
>>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>>
>>>
>>>
>>> Item Document :- Client - TeraTerm Pro
>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>>> Online   Employee Self Service       ESS Home ... Description Document
>>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>>> printing or keymapping is an issue, TeraTerm ...
>>>
>>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>> _________________________________________________________________
>>> Make a mini you and download it into Windows Live Messenger
>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

vishal vachhani
thank you very much!!!!!!

On Thu, Sep 25, 2008 at 9:26 PM, Dennis Kubes <[hidden email]> wrote:

> In search.jsp lines 116-119:
>
>  int hitsPerSite = 2;                            // max hits per site
>  String hitsPerSiteString = request.getParameter("hitsPerSite");
>  if (hitsPerSiteString != null)
>    hitsPerSite = Integer.parseInt(hitsPerSiteString);
>
> Hope that helps.
>
> Dennis
>
>
> vishal vachhani wrote:
>
>> Dennis,
>>            I am facing same problem, in my crawl content of some urls are
>> same but urls are different. Could you please tell me how I can set
>> hitsPersite to 1 . ?
>>
>> --Vishal
>>
>> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[hidden email]> wrote:
>>
>>  If you are using more than one index then dedup will not work across
>>> indexes.  A single index should dedup correctly unless the pages are not
>>> exact duplicates but near duplicates.  The dedup process works on url and
>>> byte hash.  If the content is even 1 byte different, it doesn't work.
>>>
>>> Near duplicate detection is another set of algorithms that haven't been
>>> implemented in Nutch yet.  On the query site you can set hte hitsPerSite
>>> to
>>> 1 and it should limit your search results.
>>>
>>> Dennis
>>>
>>>
>>> Edward Quick wrote:
>>>
>>>  Hi,
>>>>
>>>> Eventhough I ran nutch dedup on my index, I still have pages with
>>>> different urls but the exactly the same content (see search result
>>>> example
>>>> below). From what I read up on dedup this shouldn't happen though as it
>>>> deletes the url with the lowest score. Is there anything else I can try
>>>> to
>>>> get rid of these?
>>>>
>>>> Thanks,
>>>> Ed.
>>>>
>>>> Item Document :- Client - TeraTerm Pro
>>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical
>>>> Standards
>>>> Online   Employee Self Service       ESS Home ... Description Document
>>>> Technology Category: Client Name of item: TeraTerm Pro Related policy:
>>>> Unix
>>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool.
>>>> Where
>>>> printing or keymapping is an issue, TeraTerm ...
>>>>
>>>>
>>>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)<http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument%28cached%29>(explain) (anchors)
>>>>
>>>>
>>>>
>>>> Item Document :- Client - TeraTerm Pro
>>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical
>>>> Standards
>>>> Online   Employee Self Service       ESS Home ... Description Document
>>>> Technology Category: Client Name of item: TeraTerm Pro Related policy:
>>>> Unix
>>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool.
>>>> Where
>>>> printing or keymapping is an issue, TeraTerm ...
>>>>
>>>>
>>>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)<http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument%28cached%29>(explain) (anchors)
>>>> _________________________________________________________________
>>>> Make a mini you and download it into Windows Live Messenger
>>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>>>
>>>>
>>


--
Thanks and Regards,
Vishal Vachhani
M.tech, CSE dept
Indian Institute of Technology, Bombay
http://www.cse.iitb.ac.in/~vishalv
Reply | Threaded
Open this post in threaded view
|

RE: pages with duplicate content in search results

Edward Quick
In reply to this post by vishal vachhani



> Date: Thu, 25 Sep 2008 21:10:52 +0530
> From: [hidden email]
> To: [hidden email]
> Subject: Re: pages with duplicate content in search results
>
> Dennis,
>             I am facing same problem, in my crawl content of some urls are
> same but urls are different. Could you please tell me how I can set
> hitsPersite to 1 . ?

I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all hits' button). It might be possible to set this in the web.xml or nutch-site.xml though?

>
> --Vishal
>
> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[hidden email]> wrote:
>
> > If you are using more than one index then dedup will not work across
> > indexes.  A single index should dedup correctly unless the pages are not
> > exact duplicates but near duplicates.  The dedup process works on url and
> > byte hash.  If the content is even 1 byte different, it doesn't work.


I only have one index, and have only crawled one domain site which is the Intranet at my work.
The pages definitely seem to be identical. I saved the source from both pages and the sizes were exactly the same too.


> >
> > Near duplicate detection is another set of algorithms that haven't been
> > implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
> > 1 and it should limit your search results.
> >
> > Dennis
> >
> >
> > Edward Quick wrote:
> >
> >> Hi,
> >>
> >> Eventhough I ran nutch dedup on my index, I still have pages with
> >> different urls but the exactly the same content (see search result example
> >> below). From what I read up on dedup this shouldn't happen though as it
> >> deletes the url with the lowest score. Is there anything else I can try to
> >> get rid of these?
> >>
> >> Thanks,
> >> Ed.
> >>
> >> Item Document :- Client - TeraTerm Pro
> >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> >> Online   Employee Self Service       ESS Home ... Description Document
> >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> >> printing or keymapping is an issue, TeraTerm ...
> >>
> >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> >>
> >>
> >>
> >> Item Document :- Client - TeraTerm Pro
> >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> >> Online   Employee Self Service       ESS Home ... Description Document
> >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> >> printing or keymapping is an issue, TeraTerm ...
> >>
> >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> >> _________________________________________________________________
> >> Make a mini you and download it into Windows Live Messenger
> >> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> >>
> >

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: pages with duplicate content in search results

Edward Quick


> >
> > Dennis,
> >             I am facing same problem, in my crawl content of some urls are
> > same but urls are different. Could you please tell me how I can set
> > hitsPersite to 1 . ?
>
> I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all hits' button). It might be possible to set this in the web.xml or nutch-site.xml though?
>
> >
> > --Vishal
> >
> > On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[hidden email]> wrote:
> >
> > > If you are using more than one index then dedup will not work across
> > > indexes.  A single index should dedup correctly unless the pages are not
> > > exact duplicates but near duplicates.  The dedup process works on url and
> > > byte hash.  If the content is even 1 byte different, it doesn't work.
>
>
> I only have one index, and have only crawled one domain site which is the Intranet at my work.
> The pages definitely seem to be identical. I saved the source from both pages and the sizes were exactly the same too.

Also, just to add to this I checked the index with Luke which shows the two urls below with the same titles but different timestamps, digests and boosts. :-(

>
>
> > >
> > > Near duplicate detection is another set of algorithms that haven't been
> > > implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
> > > 1 and it should limit your search results.
> > >
> > > Dennis
> > >
> > >
> > > Edward Quick wrote:
> > >
> > >> Hi,
> > >>
> > >> Eventhough I ran nutch dedup on my index, I still have pages with
> > >> different urls but the exactly the same content (see search result example
> > >> below). From what I read up on dedup this shouldn't happen though as it
> > >> deletes the url with the lowest score. Is there anything else I can try to
> > >> get rid of these?
> > >>
> > >> Thanks,
> > >> Ed.
> > >>
> > >> Item Document :- Client - TeraTerm Pro
> > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> > >> Online   Employee Self Service       ESS Home ... Description Document
> > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> > >> printing or keymapping is an issue, TeraTerm ...
> > >>
> > >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> > >>
> > >>
> > >>
> > >> Item Document :- Client - TeraTerm Pro
> > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> > >> Online   Employee Self Service       ESS Home ... Description Document
> > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> > >> printing or keymapping is an issue, TeraTerm ...
> > >>
> > >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> > >> _________________________________________________________________
> > >> Make a mini you and download it into Windows Live Messenger
> > >> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > >>
> > >
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/

_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

Andrzej Białecki-2
In reply to this post by Dennis Kubes-2
Dennis Kubes wrote:
> If you are using more than one index then dedup will not work across
> indexes.

This is incorrect. DeleteDuplicates works just fine with multiple
indexes, assuming you process all indexes in the same run of
DeleteDuplicates, so that it has a global view of all input indexes.

   A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates.  The dedup process works on url
> and byte hash.  If the content is even 1 byte different, it doesn't work.

This depends on the implementation of Signature. Indeed, the default
MD5HashSignature works this way.

>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet.

Well, the existing TextProfileSignature can be used as a form of (crude)
near-duplicate detection, precisely because it is tolerant to small
changes in the input text.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

RE: pages with duplicate content in search results

Edward Quick


>
> Dennis Kubes wrote:
> > If you are using more than one index then dedup will not work across
> > indexes.
>
> This is incorrect. DeleteDuplicates works just fine with multiple
> indexes, assuming you process all indexes in the same run of
> DeleteDuplicates, so that it has a global view of all input indexes.
>
>    A single index should dedup correctly unless the pages are not
> > exact duplicates but near duplicates.  The dedup process works on url
> > and byte hash.  If the content is even 1 byte different, it doesn't work.
>
> This depends on the implementation of Signature. Indeed, the default
> MD5HashSignature works this way.
>
> >
> > Near duplicate detection is another set of algorithms that haven't been
> > implemented in Nutch yet.
>
> Well, the existing TextProfileSignature can be used as a form of (crude)
> near-duplicate detection, precisely because it is tolerant to small
> changes in the input text.

Thanks Andrzej.
How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating?

>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

Andrzej Białecki-2
Edward Quick wrote:

>
>> Dennis Kubes wrote:
>>> If you are using more than one index then dedup will not work across
>>> indexes.
>> This is incorrect. DeleteDuplicates works just fine with multiple
>> indexes, assuming you process all indexes in the same run of
>> DeleteDuplicates, so that it has a global view of all input indexes.
>>
>>    A single index should dedup correctly unless the pages are not
>>> exact duplicates but near duplicates.  The dedup process works on url
>>> and byte hash.  If the content is even 1 byte different, it doesn't work.
>> This depends on the implementation of Signature. Indeed, the default
>> MD5HashSignature works this way.
>>
>>> Near duplicate detection is another set of algorithms that haven't been
>>> implemented in Nutch yet.
>> Well, the existing TextProfileSignature can be used as a form of (crude)
>> near-duplicate detection, precisely because it is tolerant to small
>> changes in the input text.
>
> Thanks Andrzej.
> How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating?

See the following property in your nutch-site.xml: db.signature.class.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: pages with duplicate content in search results

David Jashi
In reply to this post by Dennis Kubes-2
Sorry for off-topic, but how do you make Nutch-0.9 search multiple indexes?

On Thu, Sep 25, 2008 at 4:42 PM, Dennis Kubes <[hidden email]> wrote:

> If you are using more than one index then dedup will not work across
> indexes.  A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates.  The dedup process works on url and
> byte hash.  If the content is even 1 byte different, it doesn't work.
>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
> 1 and it should limit your search results.
>
> Dennis
>
> Edward Quick wrote:
>>
>> Hi,
>>
>> Eventhough I ran nutch dedup on my index, I still have pages with
>> different urls but the exactly the same content (see search result example
>> below). From what I read up on dedup this shouldn't happen though as it
>> deletes the url with the lowest score. Is there anything else I can try to
>> get rid of these?
>>
>> Thanks,
>> Ed.
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online   Employee Self Service       ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument
>> (cached) (explain) (anchors)
>>
>>
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online   Employee Self Service       ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument
>> (cached) (explain) (anchors)
>> _________________________________________________________________
>> Make a mini you and download it into Windows Live Messenger
>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>



--
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[hidden email]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
"კავკასუს ონლაინი"
+995(32)970368
[hidden email]