New to nutch, seem to be problems

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

New to nutch, seem to be problems

misc
Hello-

    My configuration and stats are at the end of this email.  I have set up nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well, but problems started after this.

    1. Generate takes many hours to complete.  It doesn't matter whether I generate 1 million or 1000 items, it takes about 5 hours to complete.  Is this normal?

    2. Fetch works great, until it is done.  It then freezes up indefinitely.  It can fetch 1000000 pages in about 12 hours, and all the fetched content is in /tmp, but then it just sits there, not returning to the command line.  I have let it sit for about 12 hours and eventually broke down and cancelled it.  If I try to undate the database it of course fails.

    3. Fetch2 runs very slowly, even though I am using 80 threads, I only download an object per every few seconds (1 every 5 or 10 seconds).  From the log, I can see that almost always 79 or 80 threads are spinWaiting.

    4. I can't tell if fetch2 freezes like fetch does, as I haven't been able to wait the many days it will take to go through a full fetch with fetch2.

Configuration:

    Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

    The ethernet connection has a dedicated 1gb connection to the web, so certainly that isn't a problem.

    I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

    I seeded with urls from the opendirectory, 100000.  I first ran a pass to load all 100000, then took the topN=1million (10 times larger than the first set of urls).  The first pass had no problem, the second pass (and beyond) is where the problems began.

Reply | Threaded
Open this post in threaded view
|

Re: New to nutch, seem to be problems

misc

Hello-

    I will reply to my own post with new findings and observations.  About
the slowness of generate, I just don't believe that it should take many
hours to generate (any sized) list on a database that is a couple million
large.  I could do the equivalant on plain text lists using grep, sort, uniq
in just minutes.  I *must* be doing something wrong.

    I dug into it today.  Could someone correct me if I am wrong on any of
this?  I couldn't find any written information about this anywhere.

    1. The generate seems to be broken into three phases, each a separate
mapreduce command.  The first phase runs through all the urls in the
crawldb, and throws out any that aren't eligable for crawling (by
crawldate).

    2. The second phase partitions by hostname and ranks according to
frequency.  It also cuts out repeat requests to a host if the number is too
high (set by a parameter), and then sorts the urls by frequency.

    3. The third phase updates the database with the information that the
url is being crawled and should not be handed out to anyone else.

    By observing what was going on, I could see that the first phase seems
to take a couple of hours.  I can change the debug level of nutch to debug
and see all the rejected urls being generated, and it does seem to be slow,
a couple per second (my db has about 200k crawled things, and about 2000000
uncrawled, so about 1 in 10 should be rejected....  How can nutch only be
going at a rate of about 20 per second, this is way too slow).

    I also looked to see if DNS lookups were slowing me down, but as far as
I can tell not.  This is because the first phase doesn't even do DNS, yet is
slow, and second because I used Wireshark to look for dns lookups and found
none.

    Can someone tell me the expected time for generate to run?  6 hours is
too long!

                        thanks
                            -J


----- Original Message -----
From: "misc" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, August 28, 2007 6:27 PM
Subject: New to nutch, seem to be problems


Hello-

    My configuration and stats are at the end of this email.  I have set up
nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well,
but problems started after this.

    1. Generate takes many hours to complete.  It doesn't matter whether I
generate 1 million or 1000 items, it takes about 5 hours to complete.  Is
this normal?

    2. Fetch works great, until it is done.  It then freezes up
indefinitely.  It can fetch 1000000 pages in about 12 hours, and all the
fetched content is in /tmp, but then it just sits there, not returning to
the command line.  I have let it sit for about 12 hours and eventually broke
down and cancelled it.  If I try to undate the database it of course fails.

    3. Fetch2 runs very slowly, even though I am using 80 threads, I only
download an object per every few seconds (1 every 5 or 10 seconds).  From
the log, I can see that almost always 79 or 80 threads are spinWaiting.

    4. I can't tell if fetch2 freezes like fetch does, as I haven't been
able to wait the many days it will take to go through a full fetch with
fetch2.

Configuration:

    Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.

    The ethernet connection has a dedicated 1gb connection to the web, so
certainly that isn't a problem.

    I have tested on nutch 0.9 and the newest daily build from 2007-08-28.

    I seeded with urls from the opendirectory, 100000.  I first ran a pass
to load all 100000, then took the topN=1million (10 times larger than the
first set of urls).  The first pass had no problem, the second pass (and
beyond) is where the problems began.


Reply | Threaded
Open this post in threaded view
|

Re: New to nutch, seem to be problems

Tranquil
Hi,

You might want to use the broken down method for whole web search. (see wiki
for nutch tutorial or crawl method explaind)
it's broken down to serveral seperate commands.

another thing.. as for DNS.
in /etc/nscd.conf change the enable-cache      hostsr     yes    to     no.
that way the nscd deamon won't cache name resolving and youll get more dns
resolving working..
you can also try to minimize the bytes you are downloading from a site (
nutch-site.xml & nutch-default.xml)

hope this helps.

Eyal

On 8/30/07, misc <[hidden email]> wrote:

>
>
> Hello-
>
>     I will reply to my own post with new findings and observations.  About
> the slowness of generate, I just don't believe that it should take many
> hours to generate (any sized) list on a database that is a couple million
> large.  I could do the equivalant on plain text lists using grep, sort,
> uniq
> in just minutes.  I *must* be doing something wrong.
>
>     I dug into it today.  Could someone correct me if I am wrong on any of
> this?  I couldn't find any written information about this anywhere.
>
>     1. The generate seems to be broken into three phases, each a separate
> mapreduce command.  The first phase runs through all the urls in the
> crawldb, and throws out any that aren't eligable for crawling (by
> crawldate).
>
>     2. The second phase partitions by hostname and ranks according to
> frequency.  It also cuts out repeat requests to a host if the number is
> too
> high (set by a parameter), and then sorts the urls by frequency.
>
>     3. The third phase updates the database with the information that the
> url is being crawled and should not be handed out to anyone else.
>
>     By observing what was going on, I could see that the first phase seems
> to take a couple of hours.  I can change the debug level of nutch to debug
> and see all the rejected urls being generated, and it does seem to be
> slow,
> a couple per second (my db has about 200k crawled things, and about
> 2000000
> uncrawled, so about 1 in 10 should be rejected....  How can nutch only be
> going at a rate of about 20 per second, this is way too slow).
>
>     I also looked to see if DNS lookups were slowing me down, but as far
> as
> I can tell not.  This is because the first phase doesn't even do DNS, yet
> is
> slow, and second because I used Wireshark to look for dns lookups and
> found
> none.
>
>     Can someone tell me the expected time for generate to run?  6 hours is
> too long!
>
>                         thanks
>                             -J
>
>
> ----- Original Message -----
> From: "misc" <[hidden email]>
> To: <[hidden email]>
> Sent: Tuesday, August 28, 2007 6:27 PM
> Subject: New to nutch, seem to be problems
>
>
> Hello-
>
>     My configuration and stats are at the end of this email.  I have set
> up
> nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well,
> but problems started after this.
>
>     1. Generate takes many hours to complete.  It doesn't matter whether I
> generate 1 million or 1000 items, it takes about 5 hours to complete.  Is
> this normal?
>
>     2. Fetch works great, until it is done.  It then freezes up
> indefinitely.  It can fetch 1000000 pages in about 12 hours, and all the
> fetched content is in /tmp, but then it just sits there, not returning to
> the command line.  I have let it sit for about 12 hours and eventually
> broke
> down and cancelled it.  If I try to undate the database it of course
> fails.
>
>     3. Fetch2 runs very slowly, even though I am using 80 threads, I only
> download an object per every few seconds (1 every 5 or 10 seconds).  From
> the log, I can see that almost always 79 or 80 threads are spinWaiting.
>
>     4. I can't tell if fetch2 freezes like fetch does, as I haven't been
> able to wait the many days it will take to go through a full fetch with
> fetch2.
>
> Configuration:
>
>     Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
>
>     The ethernet connection has a dedicated 1gb connection to the web, so
> certainly that isn't a problem.
>
>     I have tested on nutch 0.9 and the newest daily build from 2007-08-28.
>
>     I seeded with urls from the opendirectory, 100000.  I first ran a pass
> to load all 100000, then took the topN=1million (10 times larger than the
> first set of urls).  The first pass had no problem, the second pass (and
> beyond) is where the problems began.
>
>
>


--
Eyal Edri
Reply | Threaded
Open this post in threaded view
|

Re: New to nutch, seem to be problems

misc

Hello-

    Isn't it obvious from my post that I *am* already breaking my crawl into
separate commands?  That is why I can say stuff like "Generate takes many
hours to complete".  Also, as I mentioned from my last posting, I am almost
sure that the slowness of generate isn't because of DNS, because wireshark
showed me that no dns lookups are taking place.  Are you sure you responded
to the correct posting.

    Also, with all due respect, I am a little frusterated that after hours
of googling and three days of posting someone can't just tell me the simple
answer to the question "how long should I expect generate to take on a
database of 1-2 million items".  10 minutes?  1 hour? 6 hours?  A day?
Certainly someone has done it, and if the answer is really 6 hours, then I
can stop spending so much time digging in the code trying to figure out what
is wrong.  I don't really think it is though considering that I could
process similar data with grep sort and uniq in a few minutes.

                        thanks
                            -J


----- Original Message -----
From: "eyal edri" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, August 29, 2007 9:34 PM
Subject: Re: New to nutch, seem to be problems


> Hi,
>
> You might want to use the broken down method for whole web search. (see
> wiki
> for nutch tutorial or crawl method explaind)
> it's broken down to serveral seperate commands.
>
> another thing.. as for DNS.
> in /etc/nscd.conf change the enable-cache      hostsr     yes    to
> no.
> that way the nscd deamon won't cache name resolving and youll get more dns
> resolving working..
> you can also try to minimize the bytes you are downloading from a site (
> nutch-site.xml & nutch-default.xml)
>
> hope this helps.
>
> Eyal
>
> On 8/30/07, misc <[hidden email]> wrote:
>>
>>
>> Hello-
>>
>>     I will reply to my own post with new findings and observations.
>> About
>> the slowness of generate, I just don't believe that it should take many
>> hours to generate (any sized) list on a database that is a couple million
>> large.  I could do the equivalant on plain text lists using grep, sort,
>> uniq
>> in just minutes.  I *must* be doing something wrong.
>>
>>     I dug into it today.  Could someone correct me if I am wrong on any
>> of
>> this?  I couldn't find any written information about this anywhere.
>>
>>     1. The generate seems to be broken into three phases, each a separate
>> mapreduce command.  The first phase runs through all the urls in the
>> crawldb, and throws out any that aren't eligable for crawling (by
>> crawldate).
>>
>>     2. The second phase partitions by hostname and ranks according to
>> frequency.  It also cuts out repeat requests to a host if the number is
>> too
>> high (set by a parameter), and then sorts the urls by frequency.
>>
>>     3. The third phase updates the database with the information that the
>> url is being crawled and should not be handed out to anyone else.
>>
>>     By observing what was going on, I could see that the first phase
>> seems
>> to take a couple of hours.  I can change the debug level of nutch to
>> debug
>> and see all the rejected urls being generated, and it does seem to be
>> slow,
>> a couple per second (my db has about 200k crawled things, and about
>> 2000000
>> uncrawled, so about 1 in 10 should be rejected....  How can nutch only be
>> going at a rate of about 20 per second, this is way too slow).
>>
>>     I also looked to see if DNS lookups were slowing me down, but as far
>> as
>> I can tell not.  This is because the first phase doesn't even do DNS, yet
>> is
>> slow, and second because I used Wireshark to look for dns lookups and
>> found
>> none.
>>
>>     Can someone tell me the expected time for generate to run?  6 hours
>> is
>> too long!
>>
>>                         thanks
>>                             -J
>>
>>
>> ----- Original Message -----
>> From: "misc" <[hidden email]>
>> To: <[hidden email]>
>> Sent: Tuesday, August 28, 2007 6:27 PM
>> Subject: New to nutch, seem to be problems
>>
>>
>> Hello-
>>
>>     My configuration and stats are at the end of this email.  I have set
>> up
>> nutch to crawl 100,000 urls.  The first pass (of 100,000) items went
>> well,
>> but problems started after this.
>>
>>     1. Generate takes many hours to complete.  It doesn't matter whether
>> I
>> generate 1 million or 1000 items, it takes about 5 hours to complete.  Is
>> this normal?
>>
>>     2. Fetch works great, until it is done.  It then freezes up
>> indefinitely.  It can fetch 1000000 pages in about 12 hours, and all the
>> fetched content is in /tmp, but then it just sits there, not returning to
>> the command line.  I have let it sit for about 12 hours and eventually
>> broke
>> down and cancelled it.  If I try to undate the database it of course
>> fails.
>>
>>     3. Fetch2 runs very slowly, even though I am using 80 threads, I only
>> download an object per every few seconds (1 every 5 or 10 seconds).  From
>> the log, I can see that almost always 79 or 80 threads are spinWaiting.
>>
>>     4. I can't tell if fetch2 freezes like fetch does, as I haven't been
>> able to wait the many days it will take to go through a full fetch with
>> fetch2.
>>
>> Configuration:
>>
>>     Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
>>
>>     The ethernet connection has a dedicated 1gb connection to the web, so
>> certainly that isn't a problem.
>>
>>     I have tested on nutch 0.9 and the newest daily build from
>> 2007-08-28.
>>
>>     I seeded with urls from the opendirectory, 100000.  I first ran a
>> pass
>> to load all 100000, then took the topN=1million (10 times larger than the
>> first set of urls).  The first pass had no problem, the second pass (and
>> beyond) is where the problems began.
>>
>>
>>
>
>
> --
> Eyal Edri
>

Reply | Threaded
Open this post in threaded view
|

Re: New to nutch, seem to be problems

misc
In reply to this post by misc

Hello-

    One more important piece of data about the problems that I am having.
After waiting a really long time, I learned that fetch is not hung up, it
was just reeeeeeeealy slow.  It took only a few hours to go through all the
urls (the corresponding lines for each url appears in the hadoop.log, and
all the content was loaded).  Then it took 24 hours of waiting before the
phrase "fetcher done" appeared.  Then fetch returned.  Why would fetch hang
after the crawl was done before returning?

    Looking at the code it would seem that some of the fetcher threads must
be stuck for a long time.  Don't these time out?

                        thanks


----- Original Message -----
From: "misc" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, August 29, 2007 5:31 PM
Subject: Re: New to nutch, seem to be problems


>
> Hello-
>
>    I will reply to my own post with new findings and observations.  About
> the slowness of generate, I just don't believe that it should take many
> hours to generate (any sized) list on a database that is a couple million
> large.  I could do the equivalant on plain text lists using grep, sort,
> uniq in just minutes.  I *must* be doing something wrong.
>
>    I dug into it today.  Could someone correct me if I am wrong on any of
> this?  I couldn't find any written information about this anywhere.
>
>    1. The generate seems to be broken into three phases, each a separate
> mapreduce command.  The first phase runs through all the urls in the
> crawldb, and throws out any that aren't eligable for crawling (by
> crawldate).
>
>    2. The second phase partitions by hostname and ranks according to
> frequency.  It also cuts out repeat requests to a host if the number is
> too high (set by a parameter), and then sorts the urls by frequency.
>
>    3. The third phase updates the database with the information that the
> url is being crawled and should not be handed out to anyone else.
>
>    By observing what was going on, I could see that the first phase seems
> to take a couple of hours.  I can change the debug level of nutch to debug
> and see all the rejected urls being generated, and it does seem to be
> slow, a couple per second (my db has about 200k crawled things, and about
> 2000000 uncrawled, so about 1 in 10 should be rejected....  How can nutch
> only be going at a rate of about 20 per second, this is way too slow).
>
>    I also looked to see if DNS lookups were slowing me down, but as far as
> I can tell not.  This is because the first phase doesn't even do DNS, yet
> is slow, and second because I used Wireshark to look for dns lookups and
> found none.
>
>    Can someone tell me the expected time for generate to run?  6 hours is
> too long!
>
>                        thanks
>                            -J
>
>
> ----- Original Message -----
> From: "misc" <[hidden email]>
> To: <[hidden email]>
> Sent: Tuesday, August 28, 2007 6:27 PM
> Subject: New to nutch, seem to be problems
>
>
> Hello-
>
>    My configuration and stats are at the end of this email.  I have set up
> nutch to crawl 100,000 urls.  The first pass (of 100,000) items went well,
> but problems started after this.
>
>    1. Generate takes many hours to complete.  It doesn't matter whether I
> generate 1 million or 1000 items, it takes about 5 hours to complete.  Is
> this normal?
>
>    2. Fetch works great, until it is done.  It then freezes up
> indefinitely.  It can fetch 1000000 pages in about 12 hours, and all the
> fetched content is in /tmp, but then it just sits there, not returning to
> the command line.  I have let it sit for about 12 hours and eventually
> broke down and cancelled it.  If I try to undate the database it of course
> fails.
>
>    3. Fetch2 runs very slowly, even though I am using 80 threads, I only
> download an object per every few seconds (1 every 5 or 10 seconds).  From
> the log, I can see that almost always 79 or 80 threads are spinWaiting.
>
>    4. I can't tell if fetch2 freezes like fetch does, as I haven't been
> able to wait the many days it will take to go through a full fetch with
> fetch2.
>
> Configuration:
>
>    Core duo 2.4GhZ, 1 gig ram, 750GB hard drive.
>
>    The ethernet connection has a dedicated 1gb connection to the web, so
> certainly that isn't a problem.
>
>    I have tested on nutch 0.9 and the newest daily build from 2007-08-28.
>
>    I seeded with urls from the opendirectory, 100000.  I first ran a pass
> to load all 100000, then took the topN=1million (10 times larger than the
> first set of urls).  The first pass had no problem, the second pass (and
> beyond) is where the problems began.
>
>