Problems with Fetcher threads?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems with Fetcher threads?

Jakob Heidebrecht
Hallo,

Is there a problem of fetching with many threads?

I injected a single URL to the DB and fetched in each case three circles.

First case 1 fetcher thread, second and third 20 fetcher threads.

In the first case I got 102 pages,
in the sekond 19 pages and
in the third 22 pages.

Everything else was the same all the time.

Is this a bug?
May the server kick me out whet I'm fetching it with many threads at the
same time?

Regards,

Jakob

--
5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse f?r Mail, Message, More +++
Reply | Threaded
Open this post in threaded view
|

Re: Problems with Fetcher threads?

Doug Cutting-2
Are you just crawling a single site?  Just one?  What is
fetcher.threads.per.host?   It is one by default, but only if
fetcher.threads.per.host is greater than one will the fetcher be able to
effectively use multiple threads to crawl a single site.  Otherwise
these threads will conflict and fail to fetch pages.

Doug

Jakob Heidebrecht wrote:

> Hallo,
>
> Is there a problem of fetching with many threads?
>
> I injected a single URL to the DB and fetched in each case three circles.
>
> First case 1 fetcher thread, second and third 20 fetcher threads.
>
> In the first case I got 102 pages,
> in the sekond 19 pages and
> in the third 22 pages.
>
> Everything else was the same all the time.
>
> Is this a bug?
> May the server kick me out whet I'm fetching it with many threads at the
> same time?
>
> Regards,
>
> Jakob
>
Reply | Threaded
Open this post in threaded view
|

max fetcher threads per host, buggy behaviour.

em-13
There is a problem with max threats per host I'm experiencing right now.

Nutch is completely ignoring 'maximum threads per host' and delay after one
thread finishes with a host.

I have the version from 6/24.
The problem is there regardless if I go with the default settings (put
nothing in nutch-site.xml regarding the fetcher) or I specify fetcher
threads=20.

To reproduce:
Fetch something in several segments.
Merge several segments.
Replace in the configuration of regex-urlfilter.txt:
-[?*!@=]
with
-[*!@]
because I want to crawl all the forums in my target sites.

Delete the database, and recreate it again. (updatedb)
Start fetching again.

At this point I can see 20 urls to the same host being fetched. And bunch of
errors happening because the target sites cannot serve me 20 pages per 10
seconds.

Is this because I'm excluding the default "?=" or... ? Any idea how to fetch
maximum 1 page per host per fetching run?

I partially solved the problem my splitting the fetching workload in 20
segments and fetching 3-5 threads per segment, but this isn't nice solution
as I have to micro-manage all the fetch segments and merge them afterward.

E.

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Thursday, July 07, 2005 2:25 PM
To: [hidden email]
Subject: Re: Problems with Fetcher threads?

Are you just crawling a single site?  Just one?  What is
fetcher.threads.per.host?   It is one by default, but only if
fetcher.threads.per.host is greater than one will the fetcher be able to
effectively use multiple threads to crawl a single site.  Otherwise
these threads will conflict and fail to fetch pages.

Doug

Jakob Heidebrecht wrote:

> Hallo,
>
> Is there a problem of fetching with many threads?
>
> I injected a single URL to the DB and fetched in each case three circles.
>
> First case 1 fetcher thread, second and third 20 fetcher threads.
>
> In the first case I got 102 pages,
> in the sekond 19 pages and
> in the third 22 pages.
>
> Everything else was the same all the time.
>
> Is this a bug?
> May the server kick me out whet I'm fetching it with many threads at the
> same time?
>
> Regards,
>
> Jakob
>