db_unfetched large number, but crawling not fetching any longer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

db_unfetched large number, but crawling not fetching any longer

webdev1977
I was under the impression that setting topN for crawl cycles would limit the number of items each iteration of the crawl would fetch/parse.  However, eventually after continuously running crawl cycles it would get ALL the urls.  My continuous crawl has stopped fetching/parsing and the stats from crawldb indicate that db_unfetched is 133,359.

Why is it no longer fetching urls if there are so many unfetched?
Reply | Threaded
Open this post in threaded view
|

Re: db_unfetched large number, but crawling not fetching any longer

Sebastian Nagel
Could you explain what is meant by "continuously running crawl cycles"?

Usually, you run a crawl with a certain "depth", a max. number of cycles.
If the depth is reached the crawler stops even if there are still unfetched
URLs. If generator generates an empty fetch list in one cycle the crawler stops
before depth is reached. The reason for an empty fetch list may be:
  - no more unfetched URLs (trivial, but not in your case)
  - recent temporary failures: after a temporary failure (network timeout, etc.)
    a URL is blocked for one day.

Does one of these suggestions answer your question?

Sebastian

On 03/23/2012 02:46 PM, webdev1977 wrote:

> I was under the impression that setting topN for crawl cycles would limit the
> number of items each iteration of the crawl would fetch/parse.  However,
> eventually after continuously running crawl cycles it would get ALL the
> urls.  My continuous crawl has stopped fetching/parsing and the stats from
> crawldb indicate that db_unfetched is 133,359.
>
> Why is it no longer fetching urls if there are so many unfetched?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3851587.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: db_unfetched large number, but crawling not fetching any longer

webdev1977
I guess I STILL don't understand the topN setting.  Here is what I thought it would do:

Seed: file:////myfileserver.com/share1

share1 Dir listing:
file1.pdf ... file300.pdf, dir1 ... dir20

running the following in a never ending shell script:

{generate crawl/crawldb crawl/segments -topN 1000
fetch
parse
updatedb
invertlinks
solrindex
solrdedup}

The first iteration it would get the top 1000 scoring urls.  After this first iteration it would have 1000 urls in the crawldb and the next iteration it would choose the next 1000 top scoring urls.. and so on and so forth.

Which means that eventually it would crawl ALL of the urls.  I am running this script and I see that my db_fetched, db_unfetched, and total urls are growing in number, but I am not seeing any new content being sent to solr.  Not sure what is going on here?





Reply | Threaded
Open this post in threaded view
|

Re: db_unfetched large number, but crawling not fetching any longer

webdev1977
I think I may have figured it out.. but I don't know how to fix it :-(

I have many pdfs and html files that have relative links in them.  They are not from the originally hosted site, but are re-hosted.  Nutch/Tika is trying to prepend the relative urls in incounters with the url that contained the link to it.

So if the first page was: http://mysite.com/web/myapp?id=12345
and that is an html file with this:

mylink

It is doing this:

http://mysite.com/web/myapp?id=12345link_to_new_place.htm.

It is getting into the crawldb this way, but obviously is not a valid url.  So my crawldb looks like it has 1,000,000 records, even though there should only be about 300,000

Is there anyway to stop this behavior?
Reply | Threaded
Open this post in threaded view
|

Re: db_unfetched large number, but crawling not fetching any longer

remi tassing
I'm not sure to totally understand what you meant.

1. In case you know exactly how the relative urls are translated into, you
can use urlnormalizefilter to change them in what would make more 'sense'.
2. The 2nd option, if you don't want those relative links to be included,
you can use the urlregexfilter to block Nutch from crawling them.

Would that help?

Remi

On Tue, Mar 27, 2012 at 2:12 AM, webdev1977 <[hidden email]> wrote:

> I think I may have figured it out.. but I don't know how to fix it :-(
>
> I have many pdfs and html files that have relative links in them.  They are
> not from the originally hosted site, but are re-hosted.  Nutch/Tika is
> trying to prepend the relative urls in incounters with the url that
> contained the link to it.
>
> So if the first page was: http://mysite.com/web/myapp?id=12345
> and that is an html file with this:
>
> link_to_new_place.htm mylink
>
> It is doing this:
>
> http://mysite.com/web/myapp?id=12345link_to_new_place.htm.
>
> It is getting into the crawldb this way, but obviously is not a valid url.
> So my crawldb looks like it has 1,000,000 records, even though there should
> only be about 300,000
>
> Is there anyway to stop this behavior?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3858935.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>