.8 svn - fetcher performance..

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

.8 svn - fetcher performance..

Byron Miller-2
Anything i should change/tweak on my fetcher config
for .8 release? i'm only getting 5 pages/sec and i was
getting nearly 50 on .7 with 125 threads going.  Does
.8 not use threads like 7 did?

I believe i'm just using the standard protocol-http
support, not http-client
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Doug Cutting
Byron Miller wrote:
> Anything i should change/tweak on my fetcher config
> for .8 release? i'm only getting 5 pages/sec and i was
> getting nearly 50 on .7 with 125 threads going.  Does
> .8 not use threads like 7 did?

Byron,

Have you tried again more recently?  A number of bugs have been fixed in
0.8 in the past few weeks.  I think it is now much more stable.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Doug Cook
In reply to this post by Byron Miller-2
Byron,

Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm running into a similar problem.
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

kkrugler
Hi Doug,

>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>running into a similar problem.

We wound up dramatically increasing the number of threads, which
seemed to help solve the bandwidth utilization problem. With Nutch
0.7 we were running about 200 threads per crawler, and with Nutch 0.8
it's more like 2000+ threads...though you have to reduce the thread
stack size in this type of configuration.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Sami Siren-2
Ken Krugler wrote:

> Hi Doug,
>
>> Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>> running into a similar problem.
>
>
> We wound up dramatically increasing the number of threads, which
> seemed to help solve the bandwidth utilization problem. With Nutch 0.7
> we were running about 200 threads per crawler, and with Nutch 0.8 it's
> more like 2000+ threads...though you have to reduce the thread stack
> size in this type of configuration.
>
Fetchlist seems to be sorted by url.This leads to many threads being
blocked when crawler is configured to fetch by a low number of threads
per host (default 1) and there are several urls from same host in the
fetchlist.

This could perhaps be improved by sorting by some other key?

--
 Sami Siren



Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Thomas Delnoij-3
+1 for a solution to this pressing issue!

I am seeing the same problem, in my case two symptoms:

1) low fetch speeds
2) crawls end "before their time" with "aborting with xxx hung
threads" error message

I am doing a focussed crawl on about 70.000 domains.
crawl.ignore.external.links is set to true.

In previous discussions on the list these issues have mainly been
attributed to crawls on such a limited set of domains.

See if I understand this correcly. FetchLists are hostwise disjoint,
thus all urls from the same domain are in the same FetchList. Folks
*not* on MapReduce are by definition always working with one Fetcher.
Otherwise could be many, in which case this mechanism prevents the
politeness rules from being disobeyed.

Could somebody confirm these assumptions are correct?

I have tried to work around the issues by changing the configuration.
I tried increasing fetcher.threads.fetch, http.timeout and
http.max.delays.

I also changed generate.max.per.host setting, following Doug's advice
of setting this value to TopN / Fetcher Threads, all to no lasting
avail.

So far, I haven't tried increasing the fetcher.threads.per.host to
more than 4 with 100 threads, though. I will do that now.

I really think we should gather some more data regarding fetch speed
problems. Maybe some of you who are seeing decent fetch speeds in a
focussed crawl setup could share some of their tips in tuning the
installation.

Thanks a lot for your time if you read so far :)

Rgrds, Thomas Delnoij




On 6/28/06, Sami Siren <[hidden email]> wrote:

> Ken Krugler wrote:
>
> > Hi Doug,
> >
> >> Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >> running into a similar problem.
> >
> >
> > We wound up dramatically increasing the number of threads, which
> > seemed to help solve the bandwidth utilization problem. With Nutch 0.7
> > we were running about 200 threads per crawler, and with Nutch 0.8 it's
> > more like 2000+ threads...though you have to reduce the thread stack
> > size in this type of configuration.
> >
> Fetchlist seems to be sorted by url.This leads to many threads being
> blocked when crawler is configured to fetch by a low number of threads
> per host (default 1) and there are several urls from same host in the
> fetchlist.
>
> This could perhaps be improved by sorting by some other key?
>
> --
>  Sami Siren
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Sami Siren-2

>>
>> Fetchlist seems to be sorted by url.This leads to many threads being
>
Ok, this isn't true because it is sorted by using HashComparator. For
some reason
generated list contains some parts wich are more or less sorted by host
and some
parts looks more "random".

--
 Sami Siren

Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Thomas Delnoij-3
> Ok, this isn't true because it is sorted by using HashComparator. For
> some reason
> generated list contains some parts wich are more or less sorted by host
> and some
> parts looks more "random".

This is consistent with what I am seeing; the Fetcher slowing down for
a while, sometimes coming to a virtual halt (a lot of repeated fetch
speed statements, for instance "412052 pages, 95915 errors, 3.0
pages/s, 638 kb/s,"), then the Fetcher speeding up again and fetching
at acceptable fetch speeds.

Rgrds, Thomas
Reply | Threaded
Open this post in threaded view
|

deleting URL duplicates - never actually deleted?

Honda-Search Administrator
Maybe someone can explain to me how this works.

First, my setup.

I create a fetchlist each night with FreeFetchlistTool and fetch those
pages.  It often contains the same URLS that are already in the database,
but this tool gets the newest copies of those URLs.

I also run nutch dedup after everything is fetched, indexed, etc.  I then
merge the segments using the following command:

ls -d $segments_dir/* | xargs bin/nutch merge $index_dir

Every night the number of "duplicates" increases.  THis is so because the
duplicates from the day before are not actually deleted (I assume).

Is dedup removing them from some sort of master index and the segments
retain their original information?

If so, is there a way to merge the segments into one (or whatever) so that
duplicate URLs do not exist?  Would mergesegs do this?

Thanks for any help, and I hope my questionis clear.

Matt

Reply | Threaded
Open this post in threaded view
|

Re: deleting URL duplicates - never actually deleted?

Marko Bauhardt-2

  Do you delete the duplicates before you merge the index? Run first  
the merge command and then the dedup command.

But a better way is you create one index of all segments with the  
index command and then runs the dedup command of this one index.

Hope this Helps,
Marko


Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:

> Maybe someone can explain to me how this works.
>
> First, my setup.
>
> I create a fetchlist each night with FreeFetchlistTool and fetch  
> those pages.  It often contains the same URLS that are already in  
> the database, but this tool gets the newest copies of those URLs.
>
> I also run nutch dedup after everything is fetched, indexed, etc.  
> I then merge the segments using the following command:
>
> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>
> Every night the number of "duplicates" increases.  THis is so  
> because the duplicates from the day before are not actually deleted  
> (I assume).
>
> Is dedup removing them from some sort of master index and the  
> segments retain their original information?
>
> If so, is there a way to merge the segments into one (or whatever)  
> so that duplicate URLs do not exist?  Would mergesegs do this?
>
> Thanks for any help, and I hope my questionis clear.
>
> Matt
>
>

Reply | Threaded
Open this post in threaded view
|

Re: deleting URL duplicates - never actually deleted?

Honda-Search Administrator
Marko,

Currently the shell command is as follows:

---
# index new segment
bin/nutch index $s1

# update the database
bin/nutch updatedb crawl/db $s1

# De-duplicate indexes
# "bogus" argument is ignored but needed due to
# a bug in the number of args expected
bin/nutch dedup crawl/segments bogus

# Merge indexes
ls -d crawl/segments/* | xargs bin/nutch merge crawl/index
---

Should I actually switch the last two commands around?

Matt

Original Message -----
From: "Marko Bauhardt" <[hidden email]>
To: <[hidden email]>
Sent: Friday, June 30, 2006 2:57 AM
Subject: Re: deleting URL duplicates - never actually deleted?


>
>  Do you delete the duplicates before you merge the index? Run first  
> the merge command and then the dedup command.
>
> But a better way is you create one index of all segments with the  
> index command and then runs the dedup command of this one index.
>
> Hope this Helps,
> Marko
>
>
> Am 29.06.2006 um 23:07 schrieb Honda-Search Administrator:
>
>> Maybe someone can explain to me how this works.
>>
>> First, my setup.
>>
>> I create a fetchlist each night with FreeFetchlistTool and fetch  
>> those pages.  It often contains the same URLS that are already in  
>> the database, but this tool gets the newest copies of those URLs.
>>
>> I also run nutch dedup after everything is fetched, indexed, etc.  
>> I then merge the segments using the following command:
>>
>> ls -d $segments_dir/* | xargs bin/nutch merge $index_dir
>>
>> Every night the number of "duplicates" increases.  THis is so  
>> because the duplicates from the day before are not actually deleted  
>> (I assume).
>>
>> Is dedup removing them from some sort of master index and the  
>> segments retain their original information?
>>
>> If so, is there a way to merge the segments into one (or whatever)  
>> so that duplicate URLs do not exist?  Would mergesegs do this?
>>
>> Thanks for any help, and I hope my questionis clear.
>>
>> Matt
>>
>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: deleting URL duplicates - never actually deleted?

Marko Bauhardt-2
>
> # De-duplicate indexes
> # "bogus" argument is ignored but needed due to
> # a bug in the number of args expected
> bin/nutch dedup crawl/segments bogus
>

The dedup command works only on many indexes and not on one or many  
segments. The directory structure of an index looks like:
index/part-00000/SOME_LUCENE_FILES

Here is an example how is the structure of an crawl:
crawl/segments/20060702232437
crawl/segments/20060702233040
crawl/linkdb
crawl/indexes //this is the index of the two segments

Now you can run dedup: bin/nutch dedup crawl/indexes

If you run dedup on a folder which contains segments, an exception  
should be thrown. Look at your logfiles and verify that the dedup  
process runs whithout exeptions.

Marko

Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Zaheed Haque
In reply to this post by Doug Cook
On 6/28/06, Ken Krugler <[hidden email]> wrote:

> Hi Doug,
>
> >Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >running into a similar problem.
>
> We wound up dramatically increasing the number of threads, which
> seemed to help solve the bandwidth utilization problem. With Nutch
> 0.7 we were running about 200 threads per crawler, and with Nutch 0.8
> it's more like 2000+ threads...though you have to reduce the thread
> stack size in this type of configuration.

Hi Ken

Could you please give me some clue regarding the stack size you are
seeing the best bandwidth utilization... I have the following

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
max rt priority                 (-r) unlimited
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

What stack size should I play with the default seems to be 8192kb ?
also any onther parameters I should tweak? I often get too many open
files problem and I never could use my full bandwidth.. I am using
about 10% of my bandwidth. I have played around with ulimit -n "very
high number" which solves the "too many open files" but its not
utilizing all my bandwidth, any help will be very much appreciated.

Thanks
Zaheed


> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

kkrugler
>On 6/28/06, Ken Krugler <[hidden email]> wrote:
>>Hi Doug,
>>
>>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>>>running into a similar problem.
>>
>>We wound up dramatically increasing the number of threads, which
>>seemed to help solve the bandwidth utilization problem. With Nutch
>>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
>>it's more like 2000+ threads...though you have to reduce the thread
>>stack size in this type of configuration.
>
>Hi Ken
>
>Could you please give me some clue regarding the stack size you are
>seeing the best bandwidth utilization...

Note that stack size twiddling is only done to allow for increasing
the number of fetcher threads without running of out JVM or OS memory.

>  I have the following
>
>core file size          (blocks, -c) 0
>data seg size           (kbytes, -d) unlimited
>max nice                        (-e) 20
>file size               (blocks, -f) unlimited
>pending signals                 (-i) unlimited
>max locked memory       (kbytes, -l) unlimited
>max memory size         (kbytes, -m) unlimited
>open files                      (-n) 1024
>pipe size            (512 bytes, -p) 8
>POSIX message queues     (bytes, -q) unlimited
>max rt priority                 (-r) unlimited
>stack size              (kbytes, -s) 8192
>cpu time               (seconds, -t) unlimited
>max user processes              (-u) unlimited
>virtual memory          (kbytes, -v) unlimited
>file locks                      (-x) unlimited
>
>What stack size should I play with the default seems to be 8192kb ?

We use something like ulimit -s 512 to set a 512K stack size at the OS level.

>also any onther parameters I should tweak?

We specify -Xss512K when running the fetch map-reduce task to set the
stack size in the JVM. But I don't remember off the top of my head
which of the many different config files this gets set in. Stefan?
>
>I often get too many open
>files problem

That's a separate issue.

>and I never could use my full bandwidth.. I am using
>about 10% of my bandwidth. I have played around with ulimit -n "very
>high number" which solves the "too many open files" but its not
>utilizing all my bandwidth, any help will be very much appreciated.

Try increasing the number of fetcher threads and reducing the stack
size. With 10 high-end servers in a cluster, we were able to max out
a 100mbs connection for brief periods, though as our crawl converged
(because it's a vertical crawl) the max rate drops eventually to
about 50mps.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Reply | Threaded
Open this post in threaded view
|

Re: .8 svn - fetcher performance..

Zaheed Haque
In reply to this post by Zaheed Haque
Ken:

Thank you very much for the info, I applied it my testing enviornment
and I could see big changes in my bandwidth utilization. I have tried
it on a simple server and i could get a rather constant 25-29
pages/sec in a vertical crawl. Previously I was getting about 5-7
pages/sec.

Cheers
Zaheed


On 7/11/06, Ken Krugler <[hidden email]> wrote:

> >On 6/28/06, Ken Krugler <[hidden email]> wrote:
> >>Hi Doug,
> >>
> >>>Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
> >>>running into a similar problem.
> >>
> >>We wound up dramatically increasing the number of threads, which
> >>seemed to help solve the bandwidth utilization problem. With Nutch
> >>0.7 we were running about 200 threads per crawler, and with Nutch 0.8
> >>it's more like 2000+ threads...though you have to reduce the thread
> >>stack size in this type of configuration.
> >
> >Hi Ken
> >
> >Could you please give me some clue regarding the stack size you are
> >seeing the best bandwidth utilization...
>
> Note that stack size twiddling is only done to allow for increasing
> the number of fetcher threads without running of out JVM or OS memory.
>
> >  I have the following
> >
> >core file size          (blocks, -c) 0
> >data seg size           (kbytes, -d) unlimited
> >max nice                        (-e) 20
> >file size               (blocks, -f) unlimited
> >pending signals                 (-i) unlimited
> >max locked memory       (kbytes, -l) unlimited
> >max memory size         (kbytes, -m) unlimited
> >open files                      (-n) 1024
> >pipe size            (512 bytes, -p) 8
> >POSIX message queues     (bytes, -q) unlimited
> >max rt priority                 (-r) unlimited
> >stack size              (kbytes, -s) 8192
> >cpu time               (seconds, -t) unlimited
> >max user processes              (-u) unlimited
> >virtual memory          (kbytes, -v) unlimited
> >file locks                      (-x) unlimited
> >
> >What stack size should I play with the default seems to be 8192kb ?
>
> We use something like ulimit -s 512 to set a 512K stack size at the OS level.
>
> >also any onther parameters I should tweak?
>
> We specify -Xss512K when running the fetch map-reduce task to set the
> stack size in the JVM. But I don't remember off the top of my head
> which of the many different config files this gets set in. Stefan?
> >
> >I often get too many open
> >files problem
>
> That's a separate issue.
>
> >and I never could use my full bandwidth.. I am using
> >about 10% of my bandwidth. I have played around with ulimit -n "very
> >high number" which solves the "too many open files" but its not
> >utilizing all my bandwidth, any help will be very much appreciated.
>
> Try increasing the number of fetcher threads and reducing the stack
> size. With 10 high-end servers in a cluster, we were able to max out
> a 100mbs connection for brief periods, though as our crawl converged
> (because it's a vertical crawl) the max rate drops eventually to
> about 50mps.
>
> -- Ken
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "Find Code, Find Answers"
>