Fetch failed with protocol status: gone(11)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fetch failed with protocol status: gone(11)

Robert Scavilla
Hi again, and thank in advance for your kind help.

Nutch 1.14

I am getting the following error message when crawling a site:
*Fetch failed with protocol status: gone(11), lastModified=0:
https://www.sitename.com <https://www.sitename.com>*

The only documentation I can find says:

> public static final int GONE = 11;
> /** Resource has moved permanently. New url should be found in args. */
>
I'm not sure what this means. When I load the page in my browser it shows
status codes 200 or 304 for all resources.

The problem only exists on a single site - other sites crawl fine.

I saved a page from the site locally and that page fetches successfully.

Can you please steer my in the right direction. Many Thanks,
...bob
Reply | Threaded
Open this post in threaded view
|

Re: Fetch failed with protocol status: gone(11)

Sebastian Nagel-2
Hi Bob,

the relevant Javadoc comment stands before the declaration of a variable (here a constant):
  /** Resource is gone. */
  public static final int GONE = 11;

More detailed, GONE results from one of the following HTTP status codes:
 400 Bad request
 401 Unauthorized
 410 Gone   (*forever* gone, opposed to 404 Not Found)
See src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java

My guess would be that "www.sitename.com" requires authentication.

Just repeat the request as
 bin/nutch parsechecker \
    -Dstore.http.headers=true \
    -Dstore.http.request=true \
    ... <url>

(I guess you're already using parsechecker or indexchecker)
This will show the HTTP headers where you'll find the exact HTTP status code.

Best,
Sebastian



On 12/17/19 4:36 PM, Robert Scavilla wrote:

> Hi again, and thank in advance for your kind help.
>
> Nutch 1.14
>
> I am getting the following error message when crawling a site:
> *Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.sitename.com <https://www.sitename.com>*
>
> The only documentation I can find says:
>
>> public static final int GONE = 11;
>> /** Resource has moved permanently. New url should be found in args. */
>>
> I'm not sure what this means. When I load the page in my browser it shows
> status codes 200 or 304 for all resources.
>
> The problem only exists on a single site - other sites crawl fine.
>
> I saved a page from the site locally and that page fetches successfully.
>
> Can you please steer my in the right direction. Many Thanks,
> ...bob
>

Reply | Threaded
Open this post in threaded view
|

Re: Fetch failed with protocol status: gone(11)

Robert Scavilla
Thank you Sebastian. I added the run-time parameters and the output is
identical. I am not seeing the http status codes though??

The log file shows:

2019-12-17 15:37:36,602 INFO  parse.ParserChecker - fetching:
https://www.avalonpontoons.com/
2019-12-17 15:37:36,872 INFO  protocol.RobotRulesParser - robots.txt
whitelist not configured.
2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.host = null
2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.port = 8080
2019-12-17 15:37:36,873 INFO  http.Http - http.proxy.exception.list = false
2019-12-17 15:37:36,873 INFO  http.Http - http.timeout = 10000
2019-12-17 15:37:36,873 INFO  http.Http - http.content.limit = -1
2019-12-17 15:37:36,873 INFO  http.Http - http.agent = FFDevBot/Nutch-1.14 (
fourfront.us)
2019-12-17 15:37:36,873 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-12-17 15:37:36,873 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-12-17 15:37:36,873 INFO  http.Http - http.enable.cookie.header = true

the command line shows:
>$NUTCHl/bin/nutch parsechecker     -Dstore.http.headers=true
-Dstore.http.request=true https://www.avalonpontoons.com/
fetching: https://www.avalonpontoons.com/
robots.txt whitelist not configured.
Fetch failed with protocol status: gone(11), lastModified=0:
https://www.avalonpontoons.com/


On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
<[hidden email]> wrote:

> Hi Bob,
>
> the relevant Javadoc comment stands before the declaration of a variable
> (here a constant):
>   /** Resource is gone. */
>   public static final int GONE = 11;
>
> More detailed, GONE results from one of the following HTTP status codes:
>  400 Bad request
>  401 Unauthorized
>  410 Gone   (*forever* gone, opposed to 404 Not Found)
> See
> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>
> My guess would be that "www.sitename.com" requires authentication.
>
> Just repeat the request as
>  bin/nutch parsechecker \
>     -Dstore.http.headers=true \
>     -Dstore.http.request=true \
>     ... <url>
>
> (I guess you're already using parsechecker or indexchecker)
> This will show the HTTP headers where you'll find the exact HTTP status
> code.
>
> Best,
> Sebastian
>
>
>
> On 12/17/19 4:36 PM, Robert Scavilla wrote:
> > Hi again, and thank in advance for your kind help.
> >
> > Nutch 1.14
> >
> > I am getting the following error message when crawling a site:
> > *Fetch failed with protocol status: gone(11), lastModified=0:
> > https://www.sitename.com <https://www.sitename.com>*
> >
> > The only documentation I can find says:
> >
> >> public static final int GONE = 11;
> >> /** Resource has moved permanently. New url should be found in args. */
> >>
> > I'm not sure what this means. When I load the page in my browser it shows
> > status codes 200 or 304 for all resources.
> >
> > The problem only exists on a single site - other sites crawl fine.
> >
> > I saved a page from the site locally and that page fetches successfully.
> >
> > Can you please steer my in the right direction. Many Thanks,
> > ...bob
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Fetch failed with protocol status: gone(11)

Sebastian Nagel-2
Hi Bob,

> I am not seeing the http status codes though??

Sorry, yes you're right. The headers are recorded but parsechecker
does not print them if fetching fails.

The server responds with a "400 Bad request" if the user-agent string
contains "nutch", reproducible by:
  wget --header 'User-Agent: nutch' -d https://www.avalonpontoons.com/
  ...
  ---response begin---
  HTTP/1.1 400 Bad Request
  ...

You could set the user-agent string:

 bin/nutch parsechecker \
   -Dhttp.agent.name=somethingelse \
   -Dhttp.agent.version='' ...

and this site should work. Recommended, to send a meaningful user-agent string.

Best,
Sebastian


On 12/17/19 9:43 PM, Robert Scavilla wrote:

> Thank you Sebastian. I added the run-time parameters and the output is
> identical. I am not seeing the http status codes though??
>
> The log file shows:
>
> 2019-12-17 15:37:36,602 INFO  parse.ParserChecker - fetching:
> https://www.avalonpontoons.com/
> 2019-12-17 15:37:36,872 INFO  protocol.RobotRulesParser - robots.txt
> whitelist not configured.
> 2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.host = null
> 2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.port = 8080
> 2019-12-17 15:37:36,873 INFO  http.Http - http.proxy.exception.list = false
> 2019-12-17 15:37:36,873 INFO  http.Http - http.timeout = 10000
> 2019-12-17 15:37:36,873 INFO  http.Http - http.content.limit = -1
> 2019-12-17 15:37:36,873 INFO  http.Http - http.agent = FFDevBot/Nutch-1.14 (
> fourfront.us)
> 2019-12-17 15:37:36,873 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2019-12-17 15:37:36,873 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2019-12-17 15:37:36,873 INFO  http.Http - http.enable.cookie.header = true
>
> the command line shows:
>> $NUTCHl/bin/nutch parsechecker     -Dstore.http.headers=true
> -Dstore.http.request=true https://www.avalonpontoons.com/
> fetching: https://www.avalonpontoons.com/
> robots.txt whitelist not configured.
> Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.avalonpontoons.com/
>
>
> On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
> <[hidden email]> wrote:
>
>> Hi Bob,
>>
>> the relevant Javadoc comment stands before the declaration of a variable
>> (here a constant):
>>   /** Resource is gone. */
>>   public static final int GONE = 11;
>>
>> More detailed, GONE results from one of the following HTTP status codes:
>>  400 Bad request
>>  401 Unauthorized
>>  410 Gone   (*forever* gone, opposed to 404 Not Found)
>> See
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>
>> My guess would be that "www.sitename.com" requires authentication.
>>
>> Just repeat the request as
>>  bin/nutch parsechecker \
>>     -Dstore.http.headers=true \
>>     -Dstore.http.request=true \
>>     ... <url>
>>
>> (I guess you're already using parsechecker or indexchecker)
>> This will show the HTTP headers where you'll find the exact HTTP status
>> code.
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 12/17/19 4:36 PM, Robert Scavilla wrote:
>>> Hi again, and thank in advance for your kind help.
>>>
>>> Nutch 1.14
>>>
>>> I am getting the following error message when crawling a site:
>>> *Fetch failed with protocol status: gone(11), lastModified=0:
>>> https://www.sitename.com <https://www.sitename.com>*
>>>
>>> The only documentation I can find says:
>>>
>>>> public static final int GONE = 11;
>>>> /** Resource has moved permanently. New url should be found in args. */
>>>>
>>> I'm not sure what this means. When I load the page in my browser it shows
>>> status codes 200 or 304 for all resources.
>>>
>>> The problem only exists on a single site - other sites crawl fine.
>>>
>>> I saved a page from the site locally and that page fetches successfully.
>>>
>>> Can you please steer my in the right direction. Many Thanks,
>>> ...bob
>>>
>>
>>
>