Sitemap URL's concatenated, causing status 14 not found

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Sitemap URL's concatenated, causing status 14 not found

Markus Jelsma-2
Hello,

We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but Nutch things those two sitemap URL's are actually one consisting of both concatenated.

Here is https://www.saxion.nl/sitemap.xml

<?xml version="1.0" encoding="UTF-8"?>
<ns2:sitemapindex xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
<loc>https://www.saxion.nl/content-sitemap.xml</loc>
</sitemap>
</ns2:sitemapindex>

This seems fine, but Nutch attempts, and obviously fails to load:

2018-05-25 16:27:50,515 ERROR [Thread-30] org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. Status code: 14 for https://www.saxion.nl/opleidingen-sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml

What is going on here? Why does Nutch, or CC's sitemap util behave like this?

Thanks,
Markus
Reply | Threaded
Open this post in threaded view
|

RE: Sitemap URL's concatenated, causing status 14 not found

Yossi Tamari
Hi Markus,

I don’t believe this is a valid sitemapindex. Each <sitemap> should include exactly one <loc>.
See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant.

        Yossi.

> -----Original Message-----
> From: Markus Jelsma <[hidden email]>
> Sent: 25 May 2018 23:45
> To: User <[hidden email]>
> Subject: Sitemap URL's concatenated, causing status 14 not found
>
> Hello,
>
> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> Nutch things those two sitemap URL's are actually one consisting of both
> concatenated.
>
> Here is https://www.saxion.nl/sitemap.xml
>
> <?xml version="1.0" encoding="UTF-8"?>
> <ns2:sitemapindex
> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
> <sitemap>
> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
> </sitemap>
> </ns2:sitemapindex>
>
> This seems fine, but Nutch attempts, and obviously fails to load:
>
> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> Status code: 14 for https://www.saxion.nl/opleidingen-
> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>
> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
>
> Thanks,
> Markus

Reply | Threaded
Open this post in threaded view
|

RE: Sitemap URL's concatenated, causing status 14 not found

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Ah, of course, i missed that!

Thanks,
Markus
 
-----Original message-----

> From:Yossi Tamari <[hidden email]>
> Sent: Saturday 26th May 2018 2:57
> To: [hidden email]
> Subject: RE: Sitemap URL's concatenated, causing status 14 not found
>
> Hi Markus,
>
> I don’t believe this is a valid sitemapindex. Each <sitemap> should include exactly one <loc>.
> See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant.
>
> Yossi.
>
> > -----Original Message-----
> > From: Markus Jelsma <[hidden email]>
> > Sent: 25 May 2018 23:45
> > To: User <[hidden email]>
> > Subject: Sitemap URL's concatenated, causing status 14 not found
> >
> > Hello,
> >
> > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> > Nutch things those two sitemap URL's are actually one consisting of both
> > concatenated.
> >
> > Here is https://www.saxion.nl/sitemap.xml
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <ns2:sitemapindex
> > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
> > <sitemap>
> > <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
> > <loc>https://www.saxion.nl/content-sitemap.xml</loc>
> > </sitemap>
> > </ns2:sitemapindex>
> >
> > This seems fine, but Nutch attempts, and obviously fails to load:
> >
> > 2018-05-25 16:27:50,515 ERROR [Thread-30]
> > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> > Status code: 14 for https://www.saxion.nl/opleidingen-
> > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> >
> > What is going on here? Why does Nutch, or CC's sitemap util behave like this?
> >
> > Thanks,
> > Markus
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sitemap URL's concatenated, causing status 14 not found

Sebastian Nagel
In reply to this post by Yossi Tamari
> I agree that the this is not the ideal error behaviour, but I guess the code was written from the
assumption that the document is valid and conformant.

Over time the crawler-commons sitemap parser has been extended to get as much as possible from
non-conforming sitemaps as well. Of course, it's hard to foresee and handle all possible mistakes...
The equivalent syntax error for sitemaps (missing closing/next <url> in <urlset> is handled.

@Markus: Please open an issue for crawler-commons
  https://github.com/crawler-commons/crawler-commons/issues/

Thanks,
Sebastian


On 05/26/2018 02:57 AM, Yossi Tamari wrote:

> Hi Markus,
>
> I don’t believe this is a valid sitemapindex. Each <sitemap> should include exactly one <loc>.
> See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant.
>
> Yossi.
>
>> -----Original Message-----
>> From: Markus Jelsma <[hidden email]>
>> Sent: 25 May 2018 23:45
>> To: User <[hidden email]>
>> Subject: Sitemap URL's concatenated, causing status 14 not found
>>
>> Hello,
>>
>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
>> Nutch things those two sitemap URL's are actually one consisting of both
>> concatenated.
>>
>> Here is https://www.saxion.nl/sitemap.xml
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <ns2:sitemapindex
>> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
>> <sitemap>
>> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
>> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
>> </sitemap>
>> </ns2:sitemapindex>
>>
>> This seems fine, but Nutch attempts, and obviously fails to load:
>>
>> 2018-05-25 16:27:50,515 ERROR [Thread-30]
>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
>> Status code: 14 for https://www.saxion.nl/opleidingen-
>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>>
>> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
>>
>> Thanks,
>> Markus
>