Seeking help in understanding – fetch, refetch & co.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Seeking help in understanding – fetch, refetch & co.

Daniel D.-2
Hello,

 I'm trying to understand how to start with initial set of URLs and continue
fetching new URLS and re-fetching existing URLS (when they due to re-fetch).

I have run some tests in order to understand and test the software behavior.
Now I have some questions for you guys and seeking your help.

 
   1. I have set db.default.fetch.interval to 1 (in nutch-default.xml)
   but I have noticed that fetchInterval field in Page object is being set to
   current time + 7 days while URL link data is being read from the fetchlist.
   Can somebody explain why or am I not reading the code correctly?
   2. I have modified code to ignore fetchInterval value coming from the
   fetchlist, meaning that fetchInterval stays equal to the initial value -
   current time. After I do the following commands: fetch, db update
and generate
   db segments, I'm getting new fetchlist but this list doesn't include my
   original sites. Even so their next fetch time should be in past already. Can
   somebody help me to understand when those URLS will be fetch?
   3. Looks like fetcher fail to extract links from http://www.eltweb.com.
   I know that there are some formats (looks like some HTML variations also)
   that are not supported. Where can I find information what is currently
   supported?
   4. Some of the out-links discovered during the fetch (for instance:
   http://www.webct.com/software/viewpage?name=software_campus_edition or
   http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not
   included in the next fetchlist after executing [generate db segments]
   command). Is there known reason for this? Is there some documentation
   describing supported URL types.

  I'm still new to this software and tried to explain what I did and hope
this was clear enough, but I'm not sure I have asked the right questions.

 Thanks,

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: Seeking help in understanding – fetch, refetch & co.

Andrzej Białecki-2
Daniel D. wrote:

> Hello,
>
>  I'm trying to understand how to start with initial set of URLs and continue
> fetching new URLS and re-fetching existing URLS (when they due to re-fetch).
>
> I have run some tests in order to understand and test the software behavior.
> Now I have some questions for you guys and seeking your help.
>
>  
>    1. I have set db.default.fetch.interval to 1 (in nutch-default.xml)
>    but I have noticed that fetchInterval field in Page object is being set to
>    current time + 7 days while URL link data is being read from the fetchlist.
>    Can somebody explain why or am I not reading the code correctly?

Yes. This is required so that you can generate many fetchlists in rapid
succession (for parallel crawling), without getting the same pages in
many fetchlists. This is sort of equivalent to setting a flag saying
"this page is already in another fetchlist, wait 7 days before
attempting to put it in another fetchlist".

After you have fetched a segment, and you update the db, this time is
re-set to the fetchTime + fetchInterval.

>    2. I have modified code to ignore fetchInterval value coming from the
>    fetchlist, meaning that fetchInterval stays equal to the initial value -
>    current time. After I do the following commands: fetch, db update
> and generate
>    db segments, I'm getting new fetchlist but this list doesn't include my
>    original sites. Even so their next fetch time should be in past already. Can
>    somebody help me to understand when those URLS will be fetch?

It's difficult to say what changes you have made... I suggest sticking
with the current code until it makes more sense to you... ;-)

>    3. Looks like fetcher fail to extract links from http://www.eltweb.com.
>    I know that there are some formats (looks like some HTML variations also)
>    that are not supported. Where can I find information what is currently
>    supported?

This site has a content redirect (using HTML meta tags) on its home
page, and no other content. In Nutch 0.6 this was not supported, you
need to get the latest SVN version in order to crawl such sites.

>    4. Some of the out-links discovered during the fetch (for instance:
>    http://www.webct.com/software/viewpage?name=software_campus_edition or
>    http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not
>    included in the next fetchlist after executing [generate db segments]
>    command). Is there known reason for this? Is there some documentation
>    describing supported URL types.

Outlinks discovered during specific crawl are added to the WebDB (IFF
they pass the urlfilter-regex), and then if they point to pages that
pass the urlfilter they are included in the next fetchlist. This is normal.

>
>   I'm still new to this software and tried to explain what I did and hope
> this was clear enough, but I'm not sure I have asked the right questions.

Some of this information is explained better on Nutch Wiki.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Seeking help in understanding – fetch, refetch & co.

Daniel D.-2
*Andrzej,*
**
*Thanks a lot for your response. *

 

> > 1. I have set db.default.fetch.interval to 1 (in nutch-default.xml)
> > but I have noticed that fetchInterval field in Page object is being set
> to
> > current time + 7 days while URL link data is being read from the
> fetchlist.
> > Can somebody explain why or am I not reading the code correctly?
>
> Yes. This is required so that you can generate many fetchlists in rapid
> succession (for parallel crawling), without getting the same pages in
> many fetchlists. This is sort of equivalent to setting a flag saying
> "this page is already in another fetchlist, wait 7 days before
> attempting to put it in another fetchlist".
>
After you have fetched a segment, and you update the db, this time is

> re-set to the fetchTime + fetchInterval.
>
> > 2. I have modified code to ignore fetchInterval value coming from the
> > fetchlist, meaning that fetchInterval stays equal to the initial value -
> > current time. After I do the following commands: fetch, db update
> > and generate
> > db segments, I'm getting new fetchlist but this list doesn't include my
> > original sites. Even so their next fetch time should be in past already.
> Can
> > somebody help me to understand when those URLS will be fetch?
>
> It's difficult to say what changes you have made... I suggest sticking
> with the current code until it makes more sense to you... ;-)

  *My objective is to learn how to crawl Web with Nutch: Start with the
initial set o f URLs and continue discovering new pages and re-fetching
existing (when needed).*

*This modification was done for test purpose only. I have commented out
assignment of new value to the fetchInterval in the Page.readFields(). Note
that I don't have code in this computer and providing function name as I
remember it. *
*My assumption was that as I have crawled 3 original URLs and have
discovered some new URLS, I should next time see in the fetchlist my 3
original URLS + new URLS (based on specified urlfilter-regex ). I wanted to
see my original URLS re-crawled! I didn't find them in the new fetchlist and
this was my question – what am I missing here? Why those URLS are not being
included in the fetchlist even so they fetch time already past?*

> 3. Looks like fetcher fail to extract links from http://www.eltweb.com.
> > I know that there are some formats (looks like some HTML variations
> also)
> > that are not supported. Where can I find information what is currently
> > supported?
>
> This site has a content redirect (using HTML meta tags) on its home
> page, and no other content. In Nutch 0.6 this was not supported, you
> need to get the latest SVN version in order to crawl such sites.

  *Thanks, I will get newer version.*

> 4. Some of the out-links discovered during the fetch (for instance:
> > http://www.webct.com/software/viewpage?name=software_campus_edition or
> > http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not
> > included in the next fetchlist after executing [generate db segments]
> > command). Is there known reason for this? Is there some documentation
> > describing supported URL types.
>
> Outlinks discovered during specific crawl are added to the WebDB (IFF
> they pass the urlfilter-regex), and then if they point to pages that
> pass the urlfilter they are included in the next fetchlist. This is
> normal.

 *Thanks again, I will learn more about urlfilter-regex*
 * *
**

*Regards,*

* *

*Daniel*
Reply | Threaded
Open this post in threaded view
|

Re: Seeking help in understanding – fetch, refetch & co.

Andrzej Białecki-2
Daniel D. wrote:
> *My assumption was that as I have crawled 3 original URLs and have
> discovered some new URLS, I should next time see in the fetchlist my 3
> original URLS + new URLS (based on specified urlfilter-regex ). I wanted to
> see my original URLS re-crawled! I didn't find them in the new fetchlist and
> this was my question ? what am I missing here? Why those URLS are not being
> included in the fetchlist even so they fetch time already past?*

Okay, you had the default fetch interval set to 1 day, right? You need
to check this, e.g. by dumping the DB (nutch readdb db -dumppageurl).

Next re-fetch time is after 1 day. If you generate fetchlists before
that, only newly discovered pages will be added to fetchlists, because
your original pages are still not due for re-fetching.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Seeking help in understanding – fetch, refetch & co.

Daniel D.-2
Hi Andrzej,**

* ***

I was looking in the wrong place. I have modified code to ignore
fetchinterval value coming from the fetchlist. I didn't realize until now
that URLS (that are not due) are not being included in the fetchlist. It's
very easy to follow and understand code in the FetchListTool.emitFetchList()
and now it's clear to me. Thanks for your help.

 One more question:

 Where can I found information regarding the memory (dick) usage for the
WebDB and CPU usage
for bin/nutch updatedb? I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.

 Thanks,

Daniel