Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Jon Shoberg
Has anyone looked at modifying the Fetcher code for checking of content
duplicates?  Not to any supprise, when allowing query strings in the URL
there is a ton of duplicate content and re-fetching going on.

The Wiki provided a brief overview of the Fetcher and what calls are
made.  I modified the outoutPage function from Fetcher.java to use a
MySQL DB to track hashes (MD5) of URLs and content from
ParseText.getText().  This works "OK" and is nothing more than an
obvious hack.

Has anyone has significant sucess with modifying the Fetcher or plugins
to activly manage content duplication and fetcher performance in a
better manner?

Thoughts? Ideas?

-j

Reply | Threaded
Open this post in threaded view
|

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Jon Shoberg
Jon Shoberg wrote:
> Has anyone looked at modifying the Fetcher code for checking of content
> duplicates?  Not to any supprise, when allowing query strings in the URL
> there is a ton of duplicate content and re-fetching going on.
 >
> Has anyone has significant sucess with modifying the Fetcher or plugins
> to activly manage content duplication and fetcher performance in a
> better manner?

Nutch will dedup on a merge but I am talking about managing
deduplication of content durring the fetching process.

-j


Reply | Threaded
Open this post in threaded view
|

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Michael Ji
In reply to this post by Jon Shoberg
Hi Jon:

You have an interesting approach.

We are in the similar effort to save the unneccessary
indexing and data duplication for the pages with the
same content since last successful fetching.

I am thinking to add an extra data field in
"fetchlist" data structure, which contained  content
MD5 hashing value for the previous fetching.

If the current fetching step gets same content, I will
skip parsing and indexing process.

Any comments?

Michael Ji,

--- Jon Shoberg <[hidden email]> wrote:

> Has anyone looked at modifying the Fetcher code for
> checking of content
> duplicates?  Not to any supprise, when allowing
> query strings in the URL
> there is a ton of duplicate content and re-fetching
> going on.
>
> The Wiki provided a brief overview of the Fetcher
> and what calls are
> made.  I modified the outoutPage function from
> Fetcher.java to use a
> MySQL DB to track hashes (MD5) of URLs and content
> from
> ParseText.getText().  This works "OK" and is nothing
> more than an
> obvious hack.
>
> Has anyone has significant sucess with modifying the
> Fetcher or plugins
> to activly manage content duplication and fetcher
> performance in a
> better manner?
>
> Thoughts? Ideas?
>
> -j
>
>



               
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs 
 
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Lukáš Vlček
Hi,
I need to solve related problem. I have URLs with dynamic queries and
I need to filter specific parameters only because they impact order of
items on the result page only (so from HTML point of view the result
page it is not duplicate but from information point of view it is).

Is there any easy way how can I filter specific parameters (like
"&orderBy=name") from URL before indexing?

Lukas

On 8/25/05, Michael Ji <[hidden email]> wrote:

> Hi Jon:
>
> You have an interesting approach.
>
> We are in the similar effort to save the unneccessary
> indexing and data duplication for the pages with the
> same content since last successful fetching.
>
> I am thinking to add an extra data field in
> "fetchlist" data structure, which contained  content
> MD5 hashing value for the previous fetching.
>
> If the current fetching step gets same content, I will
> skip parsing and indexing process.
>
> Any comments?
>
> Michael Ji,
>
> --- Jon Shoberg <[hidden email]> wrote:
>
> > Has anyone looked at modifying the Fetcher code for
> > checking of content
> > duplicates?  Not to any supprise, when allowing
> > query strings in the URL
> > there is a ton of duplicate content and re-fetching
> > going on.
> >
> > The Wiki provided a brief overview of the Fetcher
> > and what calls are
> > made.  I modified the outoutPage function from
> > Fetcher.java to use a
> > MySQL DB to track hashes (MD5) of URLs and content
> > from
> > ParseText.getText().  This works "OK" and is nothing
> > more than an
> > obvious hack.
> >
> > Has anyone has significant sucess with modifying the
> > Fetcher or plugins
> > to activly manage content duplication and fetcher
> > performance in a
> > better manner?
> >
> > Thoughts? Ideas?
> >
> > -j
> >
> >
>
>
>
>
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Andrzej Białecki-2
In reply to this post by Michael Ji
Michael Ji wrote:

> Hi Jon:
>
> You have an interesting approach.
>
> We are in the similar effort to save the unneccessary
> indexing and data duplication for the pages with the
> same content since last successful fetching.
>
> I am thinking to add an extra data field in
> "fetchlist" data structure, which contained  content
> MD5 hashing value for the previous fetching.
>
> If the current fetching step gets same content, I will
> skip parsing and indexing process.

Please see the patches in http://issues.apache.org/jira/browse/NUTCH-61 .


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Michael Ji
hi Andrezj:

That is exactly what I try to implement! I guess the
patch is not included in new nutch 07, right? coz at
least, I didn't find
"src/java/org/apache/nutch/db/FetchSchedule.java"
in SVN source code;

I will try to embed the patch code by myself and test
the it.

thanks,

Michael Ji,


--- Andrzej Bialecki <[hidden email]> wrote:

> Michael Ji wrote:
> > Hi Jon:
> >
> > You have an interesting approach.
> >
> > We are in the similar effort to save the
> unneccessary
> > indexing and data duplication for the pages with
> the
> > same content since last successful fetching.
> >
> > I am thinking to add an extra data field in
> > "fetchlist" data structure, which contained
> content
> > MD5 hashing value for the previous fetching.
> >
> > If the current fetching step gets same content, I
> will
> > skip parsing and indexing process.
>
> Please see the patches in
> http://issues.apache.org/jira/browse/NUTCH-61 .
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>



               
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs