[jira] Created: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

Sebastian Nagel (Jira)
Nutch - Fetcher - HTTP - Performance Testing & Tuning
-----------------------------------------------------

         Key: NUTCH-109
         URL: http://issues.apache.org/jira/browse/NUTCH-109
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.7, 0.6, 0.7.1, 0.8-dev    
 Environment: Nutch: Windows XP, J2SE 1.4.2_09
Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
    Reporter: Fuad Efendi


1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment

2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...

I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...

I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)

Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/

Please note:

Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
   private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);

I'll add more comments after finishing tests...



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
------------------------------

    Attachment: protocol-httpclient-innovation-0.1.0.zip

New Plugin, you may play with commenting this code in HttpFactory
        static {
                CookieModule.setCookiePolicyHandler(null);
        }



> Nutch - Fetcher - HTTP - Performance Testing & Tuning
> -----------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - HTTP - Performance Testing & Tuning

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331764 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

By default, Java 1.4 caches DNS-to-IP mappings forever...

   java.security.Security.setProperty("networkaddress.cache.ttl" , "10000");

- we need to add smth in code/configuration.


> Nutch - Fetcher - HTTP - Performance Testing & Tuning
> -----------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.6, 0.7.1, 0.8-dev
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]

Fuad Efendi updated NUTCH-109:
------------------------------

    Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation  (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)

I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)

Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)

I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):

Protocol-HTTPClient-Innovation:
1,321,470 milliseconds

Protocol-HTTP:
26,946,076 milliseconds

Protocol-HttpClient:
27,062,854 milliseconds


P.S.
Please note, Protocol-HTTPClient-Innovation plugin is only basic version, v.0.1.0,
HttpFactory is growing and contains cache (3 TCP connections per Host)
http://www.innovation.ch/java/HTTPClient/ is very old but _production_ level... style of a source code may seem too old... you may need to change "enum" to "enumeration" in downloaded source files in order to compile it :)))

Very popular load-generating tool is based on HTTPClient (Innovation):
http://grinder.sourceforge.net/
http://www.innovation.ch/java/HTTPClient/


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331847 ]

Doug Cutting commented on NUTCH-109:
------------------------------------

Is your HTTP client polite?  Does it only have a single connection open the the server at a time, and does it pause fetcher.server.delay between each request?  It looks as though you are permitting three simultaneous requests, and I can see no delays.

How did you configure protocol-http and protocol-httpclient?  One can configure these to use multiple connections per server by increasing fetcher.threads.per.host.  By default they will only make a single request at a time.  One can also configure these to not delay between requests by setting fetcher.server.delay to zero.  Such settings are not considered polite, but they will substantially improve fetcher performance.


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Andrzej Białecki-2
In reply to this post by Sebastian Nagel (Jira)
Fuad Efendi (JIRA) wrote:

>      [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> ------------------------------
>
>     Summary: Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation  (was: Nutch - Fetcher - HTTP - Performance Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server installation, with crawled 120,000 pages (I used Teleport Ultra to crawl HTML pages from www.apache.org, I spent probably 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds

This is interesting. Could you please check what is the difference in
this benchmark, if you set HttpVersion.HTTP_1_1 in
protocol-httpclient/HttpResponse.java:92 ?

Unfortunately, Nutch cannot use that library because it's LGPL.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331848 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I used default settings rof Nutch-0.7.1, I modified only <name>plugin.includes</name> in nutch-site.xml

HTTPClient is polite enough; HTTPClient(host) creates persistent TCP (and HTTP) connection, uses own threads to manage this connection, and automatically handles all "Keep-Alive", default for Keep-Alive is "60 seconds"; I've not studied their API throughly and I haven't tested it...

HttpFactory has default setting 3 HTTPClient -per-host (it means we have 3 TCP connections per single host, and we send multiple GET messages over single HTTP without waiting for reply)... I used 20 concurrent threads (so, I sent few HTTP requests per-TCP-channel)

Single HTTPConnection class is fully thread-safe and allows multiple threads to send multiple GET requests over single TCP connection... and all replies are in-sync... I can perform different test for this.



> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Fuad Efendi
In reply to this post by Andrzej Białecki-2
I have to perform another test... At least we know that the problem is in
Network Layer...
I believe: not only HTTP_1_1, but also establishment of TCP connection takes
long time (including intermediary equipment such as routers, firewall,
per-IP-based load balancers, ...)

In my sample, HttpFactory caches TCP connections (3 sockets per host), and
HTTPClient automatically reestablishes HTTP-Keep-Alive each 60 seconds,
probably HttpClient/Apache also has this functionality which we don't use
yet...

Thanks,
Fuad

>This is interesting. Could you please check what is the difference in
>this benchmark, if you set HttpVersion.HTTP_1_1 in
>protocol-httpclient/HttpResponse.java:92 ?

>Unfortunately, Nutch cannot use that library because it's LGPL.
--
Best regards,
Andrzej Bialecki     <><

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

AJ Chen-2
In reply to this post by Sebastian Nagel (Jira)
Fuad,
Several days for 120,000 pages? That's very slow. Could you show some status
lines in the log file? (grep "status:") What's the bandwidth you have?

-AJ

On 10/11/05, Fuad Efendi (JIRA) <[hidden email]> wrote:

>
> [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
>
> Fuad Efendi updated NUTCH-109:
> ------------------------------
>
> Summary: Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance
> Testing & Tuning)
>
> I performed performance tests, using default Apache HTTPD Web-Server
> installation, with crawled 120,000 pages (I used Teleport Ultra to crawl
> HTML pages from www.apache.org <http://www.apache.org>, I spent probably
> 10 hours)
>
> Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1),
> and Suse Linux 9.3 (Server with Apache)
>
> I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take
> few days to crawl all 120,000 pages):
>
> Protocol-HTTPClient-Innovation:
> 1,321,470 milliseconds
>
> Protocol-HTTP:
> 26,946,076 milliseconds
>
> Protocol-HttpClient:
> 27,062,854 milliseconds
>
>
> P.S.
> Please note, Protocol-HTTPClient-Innovation plugin is only basic version,
> v.0.1.0,
> HttpFactory is growing and contains cache (3 TCP connections per Host)
> http://www.innovation.ch/java/HTTPClient/ is very old but _production_
> level... style of a source code may seem too old... you may need to change
> "enum" to "enumeration" in downloaded source files in order to compile it
> :)))
>
> Very popular load-generating tool is based on HTTPClient (Innovation):
> http://grinder.sourceforge.net/
> http://www.innovation.ch/java/HTTPClient/
>
>
> > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> > -----------------------------------------------------------------------
> >
> > Key: NUTCH-109
> > URL: http://issues.apache.org/jira/browse/NUTCH-109
> > Project: Nutch
> > Type: Improvement
> > Components: fetcher
> > Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> > Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53
> > Reporter: Fuad Efendi
> > Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment
> > 2. Web Server creates Client thread and hopes that Nutch really uses
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM
> "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new plugin
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing
> http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and Suse
> Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note:
> > Class HttpFactory contains cache of HTTPConnection objects; each object
> run each thread; each object is absolutely thread-safe, so we can send
> multiple GET requests using single instance:
> > private static int CLIENTS_PER_HOST = NutchConf.get().getInt("
> http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>
Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Fuad Efendi
>Several days for 120,000 pages? That's very slow. Could you show some
status lines in the log file? (grep "status:") What's the bandwidth you
have?

AJ,

I mean: I haven't tried to run "-depth 20", I run "-depth 6" and crawled
21,000 pages for 7-8 hours... I mirrored 120,000 pages from www.apache.org
usig Teleport Ultra, total about 10 hours for this crawl (8mbps download, 10
threads);

During 3 tests I crawled (each time) 21,000 pages from _local_ web-site (in
the same LAN segment, 100mbps); existing plugins required 8 hours per 21,000
pages, so I couldn't try 120,000 pages...


Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331857 ]

Doug Cutting commented on NUTCH-109:
------------------------------------

Comparing protocol-http and protocol-httpclient with default settings, which permit only a single request at a time with five second delays between each request, to something that permits three simultaneous connections with no delays is not a fair comparison.  There is probably some advantage to using "Keep-Alive", but these benchmarks do not measure it.  To make a fair comparison you must configure Nutch with fetcher.server.delay=0 and fetcher.threads.per.host=3.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331877 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Ok, I'll do it tonight;
I believe fetcher.server.delay means "Wait for a Response from Server, then throw a Timeout Exception"
I can also execute 1000 threads, we will have fair comparison even with fetcher.server.delay=50 seconds (fair - because of too many threads - we will have probably 20 requests per second, 20 * 50 = 1000)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

This method:
  private static InetAddress blockAddr(URL url) throws ProtocolException {...}

I checked it in both classes:
  org.apache.nutch.protocol.http.Http
  org.apache.nutch.protocol.httpclient.Http

Default settings (nutch-default.xml):
  fetcher.server.delay=5.0 (seconds)
  fetcher.threads.per.host=1

blockAddr(...) method blocks Internet Address for fetcher.server.delay amount of time, it "blocks" this address for all threads except current thread. Rest of threads are in Sleep() state; amount of sleeping threads is limited by
  fetcher.threads.per.host

So, playing with this parameters we can probably improve performance; I'm going to perform new performance tests.

New plugin does not use this:
  http.timeout=10000
  http.content.limit=65536

Keep-Alive timeout is very important; default "Keep-Alive" timeout of a new plugin is 60 seconds (it automatically closes HTTP after 60 seconds).

1. we are establishing TCP transport, 100-300 milliseconds X 2-3 times (TCP HandShake? some IP packets...)
2. Apache HTTPD Server creates Client thread to handle our requests, 1 second (more or less, try Internet Explorer, first page takes few second to download, then browsing works very fast - we have personal Thread on the Server).
3. Line 135, HttpResponse.java:
     get.releaseConnection();

Unfortunately we won't use HTTP/1.1 even if I modify some parameters such as
   HttpVersion.HTTP_1_0 (protocol-httpclient/HttpResponse.java:92)
- we close connection at the end...

We have network equipment limitations too, we can't reach more than 65000 threads over single LAN card, and JVM is good (but better is to have multiple JVM/processes, 100 threads each...)

We can load network segment for only 30% due to those HandShakes and delays...

Compare with any free available Web-Grabber tool, even IE/Netscape, downloading single big file can use 99% of network capacity, downloading multiple HTML - only 20-30% (I saw it in Teleport Pro during downloads from multiple linked to Apache sites, 10 threads)

Apache's MultiThreadedExample.java  uses single instance of HttpClient for multiple threads,
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/httpclient/trunk/src/examples/MultiThreadedExample.java?view=markup

        // Create an HttpClient with the MultiThreadedHttpConnectionManager.
        // This connection manager must be used if more than one thread will
        // be using the HttpClient.
        HttpClient httpClient = new HttpClient(new MultiThreadedHttpConnectionManager());
        // Set the default host/protocol for the methods to connect to.
        // This value will only be used if the methods are not given an absolute URI
        httpClient.getHostConfiguration().setHost("jakarta.apache.org", 80, "http");


Same was done in a new plugin, with a basic very small code.

I am going to perform new tests; any suggestions are highly welcomed...
it will take few days (10 hours per test)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Otis Gospodnetic-2-2
Hi,

I find it a bit hard to follow your various ideas here, but I'll add my
comments to some parts below.

--- "Fuad Efendi (JIRA)" <[hidden email]> wrote:

>     [
>
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892
> ]
>
> Fuad Efendi commented on NUTCH-109:
> -----------------------------------
>
> This method:
>   private static InetAddress blockAddr(URL url) throws
> ProtocolException {...}

Where is this method?

> I checked it in both classes:
>   org.apache.nutch.protocol.http.Http
>   org.apache.nutch.protocol.httpclient.Http
>
> Default settings (nutch-default.xml):
>   fetcher.server.delay=5.0 (seconds)
>   fetcher.threads.per.host=1
>
> blockAddr(...) method blocks Internet Address for
> fetcher.server.delay amount of time, it "blocks" this address for all
> threads except current thread. Rest of threads are in Sleep() state;
> amount of sleeping threads is limited by
>   fetcher.threads.per.host

That doesn't sound right.  That property is not meant for specifying
sleep time, but rather the number of threads that are allowed to hit
the same host at the same time.  In other words, this lets you control
the degree of parallelization, so to speak.  That is the equivalent of
those "3 TCP connections" you were mentioning yesterday.

fetcher.server.delay is what specifies "sleep between requests" time.

> So, playing with this parameters we can probably improve performance;
> I'm going to perform new performance tests.
>
> New plugin does not use this:
>   http.timeout=10000
>   http.content.limit=65536

This may affect your benchmark.  I don't know how much, but it will.

> Keep-Alive timeout is very important; default "Keep-Alive" timeout of
> a new plugin is 60 seconds (it automatically closes HTTP after 60
> seconds).
>
> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3
> times (TCP HandShake? some IP packets...)
> 2. Apache HTTPD Server creates Client thread to handle our requests,
> 1 second (more or less, try Internet Explorer, first page takes few
> second to download, then browsing works very fast - we have personal
> Thread on the Server).

This is often be due to the initial hostname address lookup, when the
domain name server doesn't have the host name IP address already
cached.

> 3. Line 135, HttpResponse.java:
>      get.releaseConnection();
>
> Unfortunately we won't use HTTP/1.1 even if I modify some parameters
> such as
>    HttpVersion.HTTP_1_0 (protocol-httpclient/HttpResponse.java:92)
> - we close connection at the end...

Have you seen Kelvin Tan's patch?
You should take a look, it's in JIRA, and addresses some of the
HTTP/1.1 issues that you are concerned about.

> We have network equipment limitations too, we can't reach more than
> 65000 threads over single LAN card, and JVM is good (but better is to
> have multiple JVM/processes, 100 threads each...)

65000 threads?  What are you trying to fetch?  The whole web?


Otis


> We can load network segment for only 30% due to those HandShakes and
> delays...
>
> Compare with any free available Web-Grabber tool, even IE/Netscape,
> downloading single big file can use 99% of network capacity,
> downloading multiple HTML - only 20-30% (I saw it in Teleport Pro
> during downloads from multiple linked to Apache sites, 10 threads)
>
> Apache's MultiThreadedExample.java  uses single instance of
> HttpClient for multiple threads,
>
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/httpclient/trunk/src/examples/MultiThreadedExample.java?view=markup

>
>         // Create an HttpClient with the
> MultiThreadedHttpConnectionManager.
>         // This connection manager must be used if more than one
> thread will
>         // be using the HttpClient.
>         HttpClient httpClient = new HttpClient(new
> MultiThreadedHttpConnectionManager());
>         // Set the default host/protocol for the methods to connect
> to.
>         // This value will only be used if the methods are not given
> an absolute URI
>        
> httpClient.getHostConfiguration().setHost("jakarta.apache.org", 80,
> "http");
>
>
> Same was done in a new plugin, with a basic very small code.
>
> I am going to perform new tests; any suggestions are highly
> welcomed...
> it will take few days (10 hours per test)
>
>
> > Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation
> >
>
-----------------------------------------------------------------------

> >
> >          Key: NUTCH-109
> >          URL: http://issues.apache.org/jira/browse/NUTCH-109
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> >  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
> >     Reporter: Fuad Efendi
> >  Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment
> > 2. Web Server creates Client thread and hopes that Nutch really
> uses HTTP/1.1, or at least Nutch sends "Connection: close" before
> closing in JVM "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new
> plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that
> existing http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and
> Suse Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note:
> > Class HttpFactory contains cache of HTTPConnection objects; each
> object run each thread; each object is absolutely thread-safe, so we
> can send multiple GET requests using single instance:
> >    private static int CLIENTS_PER_HOST =
> NutchConf.get().getInt("http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

em-13

>>We have network equipment limitations too, we can't reach more than
>>65000 threads over single LAN card, and JVM is good (but better is to
>>have multiple JVM/processes, 100 threads each...)
>>    
>>
>
>65000 threads?  What are you trying to fetch?  The whole web?
>
>
>Otis
>  
>
65000/100 =650 processes.

What kind of server do you have?
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331897 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Opps... need to learn more!
[protocol-httpclient] Http.java is Singleton, it uses MultiThreadedHttpConnectionManager
It uses single instance of HttpClient for all hosts and all threads.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331904 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I can't use Email right now, so put comments here:
===
>Have you seen Kelvin Tan's patch?
>You should take a look, it's in JIRA, and addresses some of the
>HTTP/1.1 issues that you are concerned about.

And my reply:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html

===
>>   private static InetAddress blockAddr(URL url) throws
>> ProtocolException {...}

>Where is this method?

[plugin-httpclient] & [protocol-http], Http.java

===
>> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3
>> times (TCP HandShake? some IP packets...)
>> 2. Apache HTTPD Server creates Client thread to handle our requests,
>> 1 second (more or less, try Internet Explorer, first page takes few
>> second to download, then browsing works very fast - we have personal
>> Thread on the Server).

>This is often be due to the initial hostname address lookup, when the
>domain name server doesn't have the host name IP address already
>cached.

Not. DNS Lookup happens only onse per JVM lifecycle; 1 & 2 HandShakes happen meny times.

===
>> We have network equipment limitations too, we can't reach more than
>> 65000 threads over single LAN card, and JVM is good (but better is to
>> have multiple JVM/processes, 100 threads each...)

>65000 threads?  What are you trying to fetch?  The whole web?

It was a sample for people trying to use more threads for better performenca; they can't use more that 65000. Also, nobody tested JVM, SUN's JVM 1.3.1 performed ugly with more than 100 threads (at least, on Windows).


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331907 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

>> ...try Internet Explorer, first page takes few
>> second to download, then browsing works very fast - we have personal
>> Thread on the Server

>This is often be due to the initial hostname address lookup, when the
>domain name server doesn't have the host name IP address already
>cached.

However, I have local DNS Server on the network; it has a cache... Windows also has own DNS cache:
%SystemRoot%\system32\svchost.exe -k NetworkService


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331913 ]

Fuad Efendi commented on NUTCH-109:
-----------------------------------

I was totally wrong and unfair:
====
>Have you seen Kelvin Tan's patch?
>You should take a look, it's in JIRA, and addresses some of the
>HTTP/1.1 issues that you are concerned about.

http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg01037.html 
====

I need to perform tests with Kelvin Tan's patch too.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, but also for intermediary network equipment
> 2. Web Server creates Client thread and hopes that Nutch really uses HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on http://www.innovation.ch/java/HTTPClient/
> Please note:
> Class HttpFactory contains cache of HTTPConnection objects; each object run each thread; each object is absolutely thread-safe, so we can send multiple GET requests using single instance:
>    private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

suspicious outlink count

em-13
In reply to this post by em-13
202443 Pages consumed: 130000 (at index 130000). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.

If there is maxoutlinks already specified in the xml config, why does
nutch bother counting anything over that again?
12