throttling bandwidth

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

throttling bandwidth

waterwheel
My ISP called and said my nutch crawler is chewing up 20mbits on a line
he's only supposed to be using 10.   Is there an easy way to tinker with
how much bandwidth we're using at once?  I know we can change the number
of open threads the crawler has, but it seems to me this won't make a
huge difference.  If I chop the number of open threads in half, it'll
just download half the pages, twice as fast?  I stand to be corrected on
this.

Any other thoughts? doesn't have to be correct or elegant as long as it
works.

Failing a reasonable solution in nutch, is there some sort of linux
level tool that will easily allow me to throttle how much bandwidth the
crawl is using at once?

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Rod Taylor-2
On Mon, 2006-01-16 at 18:02 -0500, Insurance Squared Inc. wrote:
> My ISP called and said my nutch crawler is chewing up 20mbits on a line
> he's only supposed to be using 10.   Is there an easy way to tinker with
> how much bandwidth we're using at once?  I know we can change the number
> of open threads the crawler has, but it seems to me this won't make a
> huge difference.  If I chop the number of open threads in half, it'll
> just download half the pages, twice as fast?  I stand to be corrected on
> this.

Bump the delay between pages and drop the number of threads by 10 fold.

Start increasing the thread count from there until you hit your target.
I've found I can get within 5% of my target bandwidth this way.

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Andrzej Białecki-2
In reply to this post by waterwheel
Insurance Squared Inc. wrote:

> My ISP called and said my nutch crawler is chewing up 20mbits on a
> line he's only supposed to be using 10.   Is there an easy way to
> tinker with how much bandwidth we're using at once?  I know we can
> change the number of open threads the crawler has, but it seems to me
> this won't make a huge difference.  If I chop the number of open
> threads in half, it'll just download half the pages, twice as fast?  I
> stand to be corrected on this.
>
> Any other thoughts? doesn't have to be correct or elegant as long as
> it works.
> Failing a reasonable solution in nutch, is there some sort of linux
> level tool that will easily allow me to throttle how much bandwidth
> the crawl is using at once?

I put my cluster behind a m0n0wall (http://m0n0.ch), which has a
built-in traffic shaper. This is based on FreeBSD, which I prefer over
Linux for such applications, but there are similar Linux solutions, or
commercial routers with built-in traffic shaping.

I think that you could also play some tricks with a bandwidth-limiting
proxy server, because protocol-httpclient can use a proxy.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

kangas
In reply to this post by waterwheel
I'm not aware of any way to do this within Nutch (yet). I could be  
wrong, 'tho.

If you have the time and inclination to set up a Linux-based router,  
you could point your crawlers through it and use iproute2 to shape  
outbound traffic from that box.

http://lartc.org/howto/ is a pretty definitive writeup on this sort  
of stuff. Look at the sample config in section 9.2.2.2.

--Matt

On Jan 16, 2006, at 6:02 PM, Insurance Squared Inc. wrote:

> My ISP called and said my nutch crawler is chewing up 20mbits on a  
> line he's only supposed to be using 10.   Is there an easy way to  
> tinker with how much bandwidth we're using at once?  I know we can  
> change the number of open threads the crawler has, but it seems to  
> me this won't make a huge difference.  If I chop the number of open  
> threads in half, it'll just download half the pages, twice as  
> fast?  I stand to be corrected on this.
>
> Any other thoughts? doesn't have to be correct or elegant as long  
> as it works.
> Failing a reasonable solution in nutch, is there some sort of linux  
> level tool that will easily allow me to throttle how much bandwidth  
> the crawl is using at once?
>
> Thanks.

--
Matt Kangas / [hidden email]


Reply | Threaded
Open this post in threaded view
|

RE: throttling bandwidth

Fuad Efendi
In reply to this post by waterwheel
For ISPs around-the-world, thew most important thing is the Number of Active
TCP Sessions.

Such manufacturers as CISCO sell/license their hardware with different
options: 1024 sessions, 65536 sessions, etc.

Backbones are shared between users, and you can kill others using 1024
sessions.

ISPs don't like such "download accelerators" as wGet which use few TCP
sessions for a single file download.


-----Original Message-----
From: Insurance Squared Inc. [mailto:[hidden email]]
Sent: Monday, January 16, 2006 6:03 PM
To: [hidden email]
Subject: throttling bandwidth


My ISP called and said my nutch crawler is chewing up 20mbits on a line
he's only supposed to be using 10.   Is there an easy way to tinker with
how much bandwidth we're using at once?  I know we can change the number
of open threads the crawler has, but it seems to me this won't make a
huge difference.  If I chop the number of open threads in half, it'll
just download half the pages, twice as fast?  I stand to be corrected on
this.

Any other thoughts? doesn't have to be correct or elegant as long as it
works.

Failing a reasonable solution in nutch, is there some sort of linux
level tool that will easily allow me to throttle how much bandwidth the
crawl is using at once?

Thanks.




Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Michael Nebel
In reply to this post by waterwheel
Hi,

I had a similair problem and installed a squid-proxy-server. The squid
has the ability to limit the bandwidth and the integration in nutch was
pretty simple (just to enter a proxy). Further more there is an other
place to block the crawling of special websites.

If needed, I can assist you with the squid configuration.

Regards

        Michael

Insurance Squared Inc. wrote:

> My ISP called and said my nutch crawler is chewing up 20mbits on a line
> he's only supposed to be using 10.   Is there an easy way to tinker with
> how much bandwidth we're using at once?  I know we can change the number
> of open threads the crawler has, but it seems to me this won't make a
> huge difference.  If I chop the number of open threads in half, it'll
> just download half the pages, twice as fast?  I stand to be corrected on
> this.
>
> Any other thoughts? doesn't have to be correct or elegant as long as it
> works.
> Failing a reasonable solution in nutch, is there some sort of linux
> level tool that will easily allow me to throttle how much bandwidth the
> crawl is using at once?
>
> Thanks.
>


--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/
Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Jay Pound
In reply to this post by waterwheel
there are a number of linux packages for QOS/traffic shaping, my favorite is
wondershaper, I havent set it up since the 2.4 kernel but it works well,
also if your not inclined to do something that involved, your isp can give
that machine's ip address a car statement in your/their cisco router
preventing that particular machine from using a max of x bandwidth. or the
do it yourself solution buy the cheapest POS router (linksys, generic) they
wont be able to route 20mbit of data through the nat, at least older ones
couldent get much more than 5mbit or so (newer ones can do 9mbit +) so there
are some solutions to your problem.
-J
----- Original Message -----
From: "Insurance Squared Inc." <[hidden email]>
To: <[hidden email]>
Sent: Monday, January 16, 2006 6:02 PM
Subject: throttling bandwidth


> My ISP called and said my nutch crawler is chewing up 20mbits on a line
> he's only supposed to be using 10.   Is there an easy way to tinker with
> how much bandwidth we're using at once?  I know we can change the number
> of open threads the crawler has, but it seems to me this won't make a
> huge difference.  If I chop the number of open threads in half, it'll
> just download half the pages, twice as fast?  I stand to be corrected on
> this.
>
> Any other thoughts? doesn't have to be correct or elegant as long as it
> works.
>
> Failing a reasonable solution in nutch, is there some sort of linux
> level tool that will easily allow me to throttle how much bandwidth the
> crawl is using at once?
>
> Thanks.
>
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>



--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Byron Miller-2
Just to add my 2 cents, for the most part if you have
a decent nic card you could issue OS commands to drop
the port rate of your interface to 10mbit and not
waste cpu cycles on shaping/proxying.

Although i do recommend squid for this since i too use
it to further filter/offload regex/hostname blocks as
well.

-byron

--- Jay Pound <[hidden email]> wrote:

> there are a number of linux packages for QOS/traffic
> shaping, my favorite is
> wondershaper, I havent set it up since the 2.4
> kernel but it works well,
> also if your not inclined to do something that
> involved, your isp can give
> that machine's ip address a car statement in
> your/their cisco router
> preventing that particular machine from using a max
> of x bandwidth. or the
> do it yourself solution buy the cheapest POS router
> (linksys, generic) they
> wont be able to route 20mbit of data through the
> nat, at least older ones
> couldent get much more than 5mbit or so (newer ones
> can do 9mbit +) so there
> are some solutions to your problem.
> -J
> ----- Original Message -----
> From: "Insurance Squared Inc."
> <[hidden email]>
> To: <[hidden email]>
> Sent: Monday, January 16, 2006 6:02 PM
> Subject: throttling bandwidth
>
>
> > My ISP called and said my nutch crawler is chewing
> up 20mbits on a line
> > he's only supposed to be using 10.   Is there an
> easy way to tinker with
> > how much bandwidth we're using at once?  I know we
> can change the number
> > of open threads the crawler has, but it seems to
> me this won't make a
> > huge difference.  If I chop the number of open
> threads in half, it'll
> > just download half the pages, twice as fast?  I
> stand to be corrected on
> > this.
> >
> > Any other thoughts? doesn't have to be correct or
> elegant as long as it
> > works.
> >
> > Failing a reasonable solution in nutch, is there
> some sort of linux
> > level tool that will easily allow me to throttle
> how much bandwidth the
> > crawl is using at once?
> >
> > Thanks.
> >
> >
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
>
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Andrzej Białecki-2
In reply to this post by Fuad Efendi
Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of Active
> TCP Sessions.
>  

This is completely false. Having worked for an ISP I can assure you that
the most important metric is the amount of traffic, and its behavior
over time. TCP sessions? We don't need no stinking TCP, we route good
ol' IP ;)

Please check your facts before claiming something about all ISPs around
the world.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: throttling bandwidth

Jay Pound
In reply to this post by Byron Miller-2
I agree its the simplest solution that is the best, drop the network card
speed back to 10mbit half duplex, then you wont be using all of your isp's
10mbit.
-J
----- Original Message -----
From: "Byron Miller" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, January 17, 2006 10:14 AM
Subject: Re: throttling bandwidth


> Just to add my 2 cents, for the most part if you have
> a decent nic card you could issue OS commands to drop
> the port rate of your interface to 10mbit and not
> waste cpu cycles on shaping/proxying.
>
> Although i do recommend squid for this since i too use
> it to further filter/offload regex/hostname blocks as
> well.
>
> -byron
>
> --- Jay Pound <[hidden email]> wrote:
>
> > there are a number of linux packages for QOS/traffic
> > shaping, my favorite is
> > wondershaper, I havent set it up since the 2.4
> > kernel but it works well,
> > also if your not inclined to do something that
> > involved, your isp can give
> > that machine's ip address a car statement in
> > your/their cisco router
> > preventing that particular machine from using a max
> > of x bandwidth. or the
> > do it yourself solution buy the cheapest POS router
> > (linksys, generic) they
> > wont be able to route 20mbit of data through the
> > nat, at least older ones
> > couldent get much more than 5mbit or so (newer ones
> > can do 9mbit +) so there
> > are some solutions to your problem.
> > -J
> > ----- Original Message -----
> > From: "Insurance Squared Inc."
> > <[hidden email]>
> > To: <[hidden email]>
> > Sent: Monday, January 16, 2006 6:02 PM
> > Subject: throttling bandwidth
> >
> >
> > > My ISP called and said my nutch crawler is chewing
> > up 20mbits on a line
> > > he's only supposed to be using 10.   Is there an
> > easy way to tinker with
> > > how much bandwidth we're using at once?  I know we
> > can change the number
> > > of open threads the crawler has, but it seems to
> > me this won't make a
> > > huge difference.  If I chop the number of open
> > threads in half, it'll
> > > just download half the pages, twice as fast?  I
> > stand to be corrected on
> > > this.
> > >
> > > Any other thoughts? doesn't have to be correct or
> > elegant as long as it
> > > works.
> > >
> > > Failing a reasonable solution in nutch, is there
> > some sort of linux
> > > level tool that will easily allow me to throttle
> > how much bandwidth the
> > > crawl is using at once?
> > >
> > > Thanks.
> > >
> > >
> > >
> > > --
> > > This message has been scanned for viruses and
> > > dangerous content by MailScanner, and is
> > > believed to be clean.
> > >
> > >
> >
> >
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> >
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>



--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Reply | Threaded
Open this post in threaded view
|

RE: throttling bandwidth

Fuad Efendi
In reply to this post by Andrzej Białecki-2
Yes, it is completely wrong, just because ISP's Employees usually asks
questions like as "What is your OS version? What is your hard drive?" etc. I
gave very-very old info, may be it was true just 4-5 years ago.

CISCO licenses their PIX by number of concurrent TCP sessions, and it is not
IP... It is on different layer...

Of course, ISP may have different policy depending on their technology and
their connections to another ISP, they are all intemediaries...

Which ISP have you worked for, UUNet? WorldCom...


TCP is over IP. Always.




-----Original Message-----
From: Andrzej Bialecki

Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of
Active
> TCP Sessions.
>  

This is completely false. Having worked for an ISP I can assure you that
the most important metric is the amount of traffic, and its behavior
over time. TCP sessions? We don't need no stinking TCP, we route good
ol' IP ;)

Please check your facts before claiming something about all ISPs around
the world.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply | Threaded
Open this post in threaded view
|

RE: throttling bandwidth

Fuad Efendi
In reply to this post by Andrzej Białecki-2
Andrzej,


I think I really need to provide more details here, just as a sample:

Ted Rogers is ISP, and he has 8000Mbps synchronous connection to UUNet. His
hardware allows to remember (due to RAM and CPU limitations) up to 1,000,000
of IP addresses, and 20,000 TCP ports for each "handshake". And his hardware
randomize bandwidth evenly between 1,000,000 x 20,000 = 20,000,000,000 TCP
connections. Why evenly? Because it is cheapest solution, no CPU required,
no network latency.

So, for instance, you use more bandwisth if you use more TCP connections
(because connection to UUNet backbone is shared between many users).

Now, consider big building, and Router on the roof, which allows only 1024
TCP sessions... If you are using 512 TCP threads, you are guaranteed to use
at least 50% of the total bandwidth of shared channel (if your "last mile"
allows it).

This is the case when 100Mbps "last mile" is dedicated, and 1000Gbps "before
last mile" is shared between 256 users with 65536 sessions limitation for
all of them.

ISP's employees usually don't know such details. They are "help-desk", and
they usually ask "Could you please use less than 50Gb download per month?
What is your IE version?!"

So, suggestion to Crawlers:
1. Consider education of ISP's employees
2. Decrease number of concurrent "alive" threads (aka concurrently open TCP
sockets)
3. Increase bandwidth




-----Original Message-----
From: Andrzej Bialecki


Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of
Active
> TCP Sessions.
>  

This is completely false. Having worked for an ISP I can assure you that
the most important metric is the amount of traffic, and its behavior
over time. TCP sessions? We don't need no stinking TCP, we route good
ol' IP ;)

Please check your facts before claiming something about all ISPs around
the world.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply | Threaded
Open this post in threaded view
|

RE: throttling bandwidth

Fuad Efendi
I made small assumption/mistake in a previous post. Not all of you are using
Transport-Layer-Routers (aka Firewalls, or layer-4-Router)

But, small in-house companies are almost always using SHDSL etc., IP over
ATM, IP over Frame Relay, ...

Hardware between Crawler and Web-Site always has limitations such as CPU,
RAM; and IP packets (layer 3 of OSI), and in some cases TCP (layer 4) are
randomly/evenly distributed...

If hardware allows to send 1,000,000 of IP packets per second, and you are
trying to send 1,999,999 of IP packets per second, no one else can get
access to Internet but you, even if you are using just 10% of the total
available bandwidth.

In some cases equipment gets overloaded even with 55-60% of the total
channel loading.



-----Original Message-----
From: Fuad Efendi
...
hardware allows to remember (due to RAM and CPU limitations) up to 1,000,000
of IP addresses, and 20,000 TCP ports for each "handshake". And his hardware
randomize bandwidth evenly between 1,000,000 x 20,000 = 20,000,000,000 TCP
connections...

Reply | Threaded
Open this post in threaded view
|

RE: throttling bandwidth

Fuad Efendi
I was totally wrong in previous 3-4 posts.
ISP route IP.
Thanks
-----Original Message-----
From: Andrzej Bialecki

Fuad Efendi wrote:
> For ISPs around-the-world, thew most important thing is the Number of
Active
> TCP Sessions.
>  

This is completely false. Having worked for an ISP I can assure you that
the most important metric is the amount of traffic, and its behavior
over time. TCP sessions? We don't need no stinking TCP, we route good
ol' IP ;)

Please check your facts before claiming something about all ISPs around
the world.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com