ok, no problem. Done:
> Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine Github account. If you would do the honours of opening a ticket, please do so.
>
> Entschuldiging,
> Markus
>
>
>
> -----Original message-----
>> From:Sebastian Nagel <
[hidden email]>
>> Sent: Tuesday 29th May 2018 11:33
>> To:
[hidden email]
>> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
>>
>>> I agree that the this is not the ideal error behaviour, but I guess the code was written from the
>> assumption that the document is valid and conformant.
>>
>> Over time the crawler-commons sitemap parser has been extended to get as much as possible from
>> non-conforming sitemaps as well. Of course, it's hard to foresee and handle all possible mistakes...
>> The equivalent syntax error for sitemaps (missing closing/next <url> in <urlset> is handled.
>>
>> @Markus: Please open an issue for crawler-commons
>>
https://github.com/crawler-commons/crawler-commons/issues/>>
>> Thanks,
>> Sebastian
>>
>>
>> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
>>> Hi Markus,
>>>
>>> I don’t believe this is a valid sitemapindex. Each <sitemap> should include exactly one <loc>.
>>> See also
https://www.sitemaps.org/protocol.html#index and
https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
>>> I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant.
>>>
>>> Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Markus Jelsma <
[hidden email]>
>>>> Sent: 25 May 2018 23:45
>>>> To: User <
[hidden email]>
>>>> Subject: Sitemap URL's concatenated, causing status 14 not found
>>>>
>>>> Hello,
>>>>
>>>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
>>>> Nutch things those two sitemap URL's are actually one consisting of both
>>>> concatenated.
>>>>
>>>> Here is
https://www.saxion.nl/sitemap.xml>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <ns2:sitemapindex
>>>> xmlns:ns2="
http://www.sitemaps.org/schemas/sitemap/0.9">
>>>> <sitemap>
>>>> <loc>
https://www.saxion.nl/opleidingen-sitemap.xml</loc>
>>>> <loc>
https://www.saxion.nl/content-sitemap.xml</loc>
>>>> </sitemap>
>>>> </ns2:sitemapindex>
>>>>
>>>> This seems fine, but Nutch attempts, and obviously fails to load:
>>>>
>>>> 2018-05-25 16:27:50,515 ERROR [Thread-30]
>>>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
>>>> Status code: 14 for
https://www.saxion.nl/opleidingen->>>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>>>>
>>>> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
>>>>
>>>> Thanks,
>>>> Markus
>>>
>>
>>