Nutch doesn't go through HTTP proxy.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch doesn't go through HTTP proxy.

Marcin Okraszewski-3
I tried to run Nutch 0.9 from my network, which require HTTP proxy access. I have set up http.proxy.host and http.proxy.port properties in my nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can see it in log (see below). But still I get java.net.UnknownHostException.

Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really tries to use proxy. And there is request from Nutch to proxy to get robots.txt. It says "404 Not Found". There is no fallowing request for particular page, only for robots.txt.

Any ideas what is wrong?
Marcin Okraszewski

007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port = <my_proxy_port>
2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
2007-05-15 17:38:59,465 INFO  http.Http - http.agent = YetAnotherSearchEngine/Nutch-0.9
2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking = true
2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots = true
2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
2007-05-15 17:38:59,832 ERROR http.Http - org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: <crawl_site>: <crawl_site>
2007-05-15 17:38:59,832 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
2007-05-15 17:38:59,832 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212)
2007-05-15 17:38:59,832 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
2007-05-15 17:38:59,832 ERROR http.Http - Caused by: java.net.UnknownHostException: www.gral.pl: www.gral.pl
2007-05-15 17:38:59,832 ERROR http.Http - at java.net.InetAddress.getAllByName0(InetAddress.java:1128)
2007-05-15 17:38:59,833 ERROR http.Http - at java.net.InetAddress.getAllByName0(InetAddress.java:1098)
2007-05-15 17:38:59,833 ERROR http.Http - at java.net.InetAddress.getAllByName(InetAddress.java:1061)
2007-05-15 17:38:59,833 ERROR http.Http - at java.net.InetAddress.getByName(InetAddress.java:958)
2007-05-15 17:38:59,833 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site> failed with: org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: <crawl_site>: <crawl_site>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch doesn't go through HTTP proxy.

Michael Wechner
Marcin Okraszewski wrote:

>I tried to run Nutch 0.9 from my network, which require HTTP proxy access. I have set up http.proxy.host and http.proxy.port properties in my nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can see it in log (see below). But still I get java.net.UnknownHostException.
>
>Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really tries to use proxy. And there is request from Nutch to proxy to get robots.txt. It says "404 Not Found". There is no fallowing request for particular page, only for robots.txt.
>
>Any ideas what is wrong?
>  
>

IIRC we had to patch Nutch in order to make it work with a proxy, but
that is Nutch 0.8 and I don't have this code available right now, but
you might want to search JIRA for possible patches. Whereas actually it
seems like something has been done

http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt

issues 21

HTH

Michael

>Marcin Okraszewski
>
>007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
>2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port = <my_proxy_port>
>2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
>2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
>2007-05-15 17:38:59,465 INFO  http.Http - http.agent = YetAnotherSearchEngine/Nutch-0.9
>2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking = true
>2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots = true
>2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
>2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
>2007-05-15 17:38:59,832 ERROR http.Http - org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: <crawl_site>: <crawl_site>
>2007-05-15 17:38:59,832 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
>2007-05-15 17:38:59,832 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:212)
>2007-05-15 17:38:59,832 ERROR http.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
>2007-05-15 17:38:59,832 ERROR http.Http - Caused by: java.net.UnknownHostException: www.gral.pl: www.gral.pl
>2007-05-15 17:38:59,832 ERROR http.Http - at java.net.InetAddress.getAllByName0(InetAddress.java:1128)
>2007-05-15 17:38:59,833 ERROR http.Http - at java.net.InetAddress.getAllByName0(InetAddress.java:1098)
>2007-05-15 17:38:59,833 ERROR http.Http - at java.net.InetAddress.getAllByName(InetAddress.java:1061)
>2007-05-15 17:38:59,833 ERROR http.Http - at java.net.InetAddress.getByName(InetAddress.java:958)
>2007-05-15 17:38:59,833 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
>2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
>2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site> failed with: org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: <crawl_site>: <crawl_site>
>
>
>  
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61

Reply | Threaded
Open this post in threaded view
|

Re: Nutch doesn't go through HTTP proxy.

Emmanuel JOKE
In reply to this post by Marcin Okraszewski-3
I had the same issue.

You need to use a tool like http://java-ntlm-proxy.sourceforge.net/ to
bypass the proxy.
You will have to edit the configuration file to add your proxy server
hostname, port, login and pwd.

Then  you need to configure you nucth process to point to this process. You
shoudl add the following in nutch-site.xml
<property>
  <name>http.proxy.host</name>
  <value>hostname of the machine where is located the NTLMProxy</value>
  <description>The proxy hostname.  If empty, no proxy is
used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>port of the NTLMProxy process </value>
  <description>The proxy port.</description>
</property>

I suggest also to add this property to avoid any conflict of reolution of
hostname:
 <property>
   <name>fetcher.threads.per.host.by.ip</name>
   <value>false</value>
   <description>ssssssssss.</description>
</property>

Hope it will help you

>
 I tried to run Nutch 0.9 from my network, which require HTTP proxy access.

> I have set up http.proxy.host and http.proxy.port properties in my
> nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can
> see it in log (see below). But still I get java.net.UnknownHostException.
>
> Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really
> tries to use proxy. And there is request from Nutch to proxy to get
> robots.txt. It says "404 Not Found". There is no fallowing request for
> particular page, only for robots.txt.
>
> Any ideas what is wrong?
> Marcin Okraszewski
>
> 007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
> 2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port =
> <my_proxy_port>
> 2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
> 2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
> 2007-05-15 17:38:59,465 INFO  http.Http - http.agent =
> YetAnotherSearchEngine/Nutch-0.9
> 2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking =
> true
> 2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots =
> true
> 2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
> 2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
> 2007-05-15 17:38:59,832 ERROR http.Http -
> org.apache.nutch.protocol.http.api.HttpException:
> java.net.UnknownHostException: <crawl_site>: <crawl_site>
> 2007-05-15 17:38:59,832 ERROR http.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
> 2007-05-15 17:38:59,832 ERROR http.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
HttpBase.java:212)

> 2007-05-15 17:38:59,832 ERROR http.Http - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> 2007-05-15 17:38:59,832 ERROR http.Http - Caused by:
> java.net.UnknownHostException: www.gral.pl: www.gral.pl
> 2007-05-15 17:38:59,832 ERROR http.Http - at
> java.net.InetAddress.getAllByName0(InetAddress.java:1128)
> 2007-05-15 17:38:59,833 ERROR http.Http - at
> java.net.InetAddress.getAllByName0(InetAddress.java:1098)
> 2007-05-15 17:38:59,833 ERROR http.Http - at
> java.net.InetAddress.getAllByName(InetAddress.java:1061)
> 2007-05-15 17:38:59,833 ERROR http.Http - at
> java.net.InetAddress.getByName(InetAddress.java:958)
> 2007-05-15 17:38:59,833 ERROR http.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
> 2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
> 2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site>
> failed with: org.apache.nutch.protocol.http.api.HttpException:
> java.net.UnknownHostException: <crawl_site>: <crawl_site>
>
>
Reply | Threaded
Open this post in threaded view
|

Re:Nutch doesn't go through HTTP proxy.

Marcin Okraszewski-3
Thanks a lot!! That was exactly it - the fetcher.threads.per.host.by.ip property. As my network is isolated from Internet DNS, fetcher couldn't resolve the name, so group by IP. Turning the poperty to false reloved the problem. I didn't think of it.

Thanks a lot for help.
Marcin

 

> I had the same issue.
>
> You need to use a tool like http://java-ntlm-proxy.sourceforge.net/ to
> bypass the proxy.
> You will have to edit the configuration file to add your proxy server
> hostname, port, login and pwd.
>
> Then  you need to configure you nucth process to point to this process. You
> shoudl add the following in nutch-site.xml
> <property>
>   <name>http.proxy.host</name>
>   <value>hostname of the machine where is located the NTLMProxy</value>
>   <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
>
> <property>
>   <name>http.proxy.port</name>
>   <value>port of the NTLMProxy process </value>
>   <description>The proxy port.</description>
> </property>
>
> I suggest also to add this property to avoid any conflict of reolution of
> hostname:
>  <property>
>    <name>fetcher.threads.per.host.by.ip</name>
>    <value>false</value>
>    <description>ssssssssss.</description>
> </property>
>
> Hope it will help you
>
> >
>  I tried to run Nutch 0.9 from my network, which require HTTP proxy access.
> > I have set up http.proxy.host and http.proxy.port properties in my
> > nutch-site.xml. Proxy do not require authorization. Nutch takes it - I can
> > see it in log (see below). But still I get java.net.UnknownHostException.
> >
> > Interestingly, I used Wireshark (or Ethereal) to sniff if Nutch really
> > tries to use proxy. And there is request from Nutch to proxy to get
> > robots.txt. It says "404 Not Found". There is no fallowing request for
> > particular page, only for robots.txt.
> >
> > Any ideas what is wrong?
> > Marcin Okraszewski
> >
> > 007-05-15 17:38:59,465 INFO  http.Http - http.proxy.host = <my_proxy_host>
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.proxy.port =
> > <my_proxy_port>
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.timeout = 10000
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.content.limit = 65536
> > 2007-05-15 17:38:59,465 INFO  http.Http - http.agent =
> > YetAnotherSearchEngine/Nutch-0.9
> > 2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.blocking =
> > true
> > 2007-05-15 17:38:59,465 INFO  http.Http - protocol.plugin.check.robots =
> > true
> > 2007-05-15 17:38:59,466 INFO  http.Http - fetcher.server.delay = 100
> > 2007-05-15 17:38:59,466 INFO  http.Http - http.max.delays = 100
> > 2007-05-15 17:38:59,832 ERROR http.Http -
> > org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: <crawl_site>: <crawl_site>
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:340)
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
> HttpBase.java:212)
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > 2007-05-15 17:38:59,832 ERROR http.Http - Caused by:
> > java.net.UnknownHostException: www.gral.pl: www.gral.pl
> > 2007-05-15 17:38:59,832 ERROR http.Http - at
> > java.net.InetAddress.getAllByName0(InetAddress.java:1128)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > java.net.InetAddress.getAllByName0(InetAddress.java:1098)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > java.net.InetAddress.getAllByName(InetAddress.java:1061)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > java.net.InetAddress.getByName(InetAddress.java:958)
> > 2007-05-15 17:38:59,833 ERROR http.Http - at
> > org.apache.nutch.protocol.http.api.HttpBase.blockAddr(HttpBase.java:336)
> > 2007-05-15 17:38:59,833 ERROR http.Http - ... 2 more
> > 2007-05-15 17:38:59,834 INFO  fetcher.Fetcher - fetch of <crawl_site>
> > failed with: org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: <crawl_site>: <crawl_site>
> >
> >