fether handling on 302 redirect

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

fether handling on 302 redirect

Yuzo Kanomata
Hi,

I'm using Nutch 0.7.2 and ran into this problem.

I was attempting to crawl some pages starting with a url that the webserver
sends back a 302 redirect. Instead of retrieving the location line and
crawling there, Nutch seems not to be able to get the correct content out.

I checked some values in Fetcher.java.

Specifically in the run method the values of protocol status is 16
(EXCEPTION) and protocol content is null during the fetch cycle:

Protocol protocol = ProtocolFactory.getProtocol(url);
ProtocolOutput output = protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
switch(pstat.getCode()) {

This seems odd in that I would expect a 302 to generate a
ProtocolStatus.MOVED value.

Is there a mistake I'm doing like not enabling a plugin?

I have the conf/nutch-default.xml plugin property set at:

<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html
|pdf|msword)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

The url I am starting from is:

<http://proxy.arts.uci.edu/gamelab/portal/>

I did both a telnet http GET of the url and ran a tcpdump on the webserver
and it is reporting back 302 with a valid location field.

Thanks,

Yuzo
Reply | Threaded
Open this post in threaded view
|

Re: fether handling on 302 redirect

Yuzo Kanomata
I discovered the source of my problem.

The web server reports back 302 with the header field as:
location: url
but Nutch expects
Location: url

I fixed my problme by adding a few lines to Http.java

Specifics and Patch:
--------------------

In package org.apache.nutch.protocol.http, class Http handles the 302
response as:

url = new URL(url,response.getHeader("Location"));

so it ties to match the string "Location" to get the redirect but the
server I am dealing with reports back "location"

My patch is to change:

File: Http.java
Dist: 0.7.1 and 0.7.2
dir location from Nutch download:
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
method: public ProtocolOutput getProtocolOutput(FetchListEntry fle)
below:
} else if (code >= 300 && code < 400) {   // handle redirect

Change:

url = new URL(url,response.getHeader("Location"));

to:

String loc = response.getHeader("Location");
if (loc == null){
    loc=response.getHeader("location");
}
url = new URL(url,loc);

This fixes the specific problem I have been having.

HTH

Yuzo



--On Wednesday, June 14, 2006 6:04 PM -0700 Yuzo Kanomata
<[hidden email]> wrote:

> Hi,
>
> I'm using Nutch 0.7.2 and ran into this problem.
>
> I was attempting to crawl some pages starting with a url that the
> webserver sends back a 302 redirect. Instead of retrieving the location
> line and crawling there, Nutch seems not to be able to get the correct
> content out.
>
> I checked some values in Fetcher.java.
>
> Specifically in the run method the values of protocol status is 16
> (EXCEPTION) and protocol content is null during the fetch cycle:
>
> Protocol protocol = ProtocolFactory.getProtocol(url);
> ProtocolOutput output = protocol.getProtocolOutput(fle);
> ProtocolStatus pstat = output.getStatus();
> switch(pstat.getCode()) {
>
> This seems odd in that I would expect a 302 to generate a
> ProtocolStatus.MOVED value.
>
> Is there a mistake I'm doing like not enabling a plugin?
>
> I have the conf/nutch-default.xml plugin property set at:
>
> <property>
>   <name>plugin.includes</name>
>  <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
> tml
>| pdf|msword)|index-basic|query-(basic|site|url)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
>
> The url I am starting from is:
>
> <http://proxy.arts.uci.edu/gamelab/portal/>
>
> I did both a telnet http GET of the url and ran a tcpdump on the
> webserver and it is reporting back 302 with a valid location field.
>
> Thanks,
>
> Yuzo