RE: Sitemap URL's concatenated, causing status 14 not found

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: Sitemap URL's concatenated, causing status 14 not found

Markus Jelsma-2
Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine Github account. If you would do the honours of opening a ticket, please do so.

Entschuldiging,
Markus

 
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Tuesday 29th May 2018 11:33
> To: [hidden email]
> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
>
> > I agree that the this is not the ideal error behaviour, but I guess the code was written from the
> assumption that the document is valid and conformant.
>
> Over time the crawler-commons sitemap parser has been extended to get as much as possible from
> non-conforming sitemaps as well. Of course, it's hard to foresee and handle all possible mistakes...
> The equivalent syntax error for sitemaps (missing closing/next <url> in <urlset> is handled.
>
> @Markus: Please open an issue for crawler-commons
>   https://github.com/crawler-commons/crawler-commons/issues/
>
> Thanks,
> Sebastian
>
>
> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
> > Hi Markus,
> >
> > I don’t believe this is a valid sitemapindex. Each <sitemap> should include exactly one <loc>.
> > See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> > I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant.
> >
> > Yossi.
> >
> >> -----Original Message-----
> >> From: Markus Jelsma <[hidden email]>
> >> Sent: 25 May 2018 23:45
> >> To: User <[hidden email]>
> >> Subject: Sitemap URL's concatenated, causing status 14 not found
> >>
> >> Hello,
> >>
> >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> >> Nutch things those two sitemap URL's are actually one consisting of both
> >> concatenated.
> >>
> >> Here is https://www.saxion.nl/sitemap.xml
> >>
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <ns2:sitemapindex
> >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
> >> <sitemap>
> >> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
> >> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
> >> </sitemap>
> >> </ns2:sitemapindex>
> >>
> >> This seems fine, but Nutch attempts, and obviously fails to load:
> >>
> >> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> >> Status code: 14 for https://www.saxion.nl/opleidingen-
> >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> >>
> >> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
> >>
> >> Thanks,
> >> Markus
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sitemap URL's concatenated, causing status 14 not found

Sebastian Nagel
Hi Markus,

ok, no problem. Done:
  https://github.com/crawler-commons/crawler-commons/issues/213

Sebastian

On 06/07/2018 12:21 AM, Markus Jelsma wrote:

> Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine Github account. If you would do the honours of opening a ticket, please do so.
>
> Entschuldiging,
> Markus
>
>  
>  
> -----Original message-----
>> From:Sebastian Nagel <[hidden email]>
>> Sent: Tuesday 29th May 2018 11:33
>> To: [hidden email]
>> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
>>
>>> I agree that the this is not the ideal error behaviour, but I guess the code was written from the
>> assumption that the document is valid and conformant.
>>
>> Over time the crawler-commons sitemap parser has been extended to get as much as possible from
>> non-conforming sitemaps as well. Of course, it's hard to foresee and handle all possible mistakes...
>> The equivalent syntax error for sitemaps (missing closing/next <url> in <urlset> is handled.
>>
>> @Markus: Please open an issue for crawler-commons
>>   https://github.com/crawler-commons/crawler-commons/issues/
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
>>> Hi Markus,
>>>
>>> I don’t believe this is a valid sitemapindex. Each <sitemap> should include exactly one <loc>.
>>> See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
>>> I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant.
>>>
>>> Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Markus Jelsma <[hidden email]>
>>>> Sent: 25 May 2018 23:45
>>>> To: User <[hidden email]>
>>>> Subject: Sitemap URL's concatenated, causing status 14 not found
>>>>
>>>> Hello,
>>>>
>>>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
>>>> Nutch things those two sitemap URL's are actually one consisting of both
>>>> concatenated.
>>>>
>>>> Here is https://www.saxion.nl/sitemap.xml
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <ns2:sitemapindex
>>>> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
>>>> <sitemap>
>>>> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
>>>> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
>>>> </sitemap>
>>>> </ns2:sitemapindex>
>>>>
>>>> This seems fine, but Nutch attempts, and obviously fails to load:
>>>>
>>>> 2018-05-25 16:27:50,515 ERROR [Thread-30]
>>>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
>>>> Status code: 14 for https://www.saxion.nl/opleidingen-
>>>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>>>>
>>>> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
>>>>
>>>> Thanks,
>>>> Markus
>>>
>>
>>