Sitemap detection bug?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Sitemap detection bug?

Michael Chen
Hi,

I've been unable to detect the sitemap for
https://www.mscdirect.com/robots.txt, I did some searching and I think
it might be due to their robots.txt line spacing format. I tried
user-agent=Googlebot but that didn't help either. Could someone
reproduce the problem?

Thanks!

Michael

Reply | Threaded
Open this post in threaded view
|

Re: Sitemap detection bug?

Michael Chen
Hi Sebastian,

Sorry forgot to reply to list.

I remember enabling debug logging once before and found that the parsing of robots.txt stops after it finds the entry relevant to the crawler ID. Is the site map information displayed there too?

It would also be great if someone could test it on 2.x, which should be very quick. I'm positive that there is something wrong with MSCDirect in specific that's blocking the site map extraction. Other sites work.

Thanks!
Michael

> On Aug 18, 2017, at 12:41, Sebastian Nagel <[hidden email]> wrote:
>
> Hi Michael,
>
> yes, I tried the mentioned sitemap with crawler-commons. The sitemap URL was detected in the
> robots.txt file. It needs some more debugging. The problem for me: I know 2.x not from running
> any production crawler, so it will take longer for me to get into it.
>
> But would you mind to move all discussions to user@nutch. It's important
> to keep them public, as some sort of documentation.
>
> Thanks,
> Sebastian
>
>
>> On 08/18/2017 08:10 PM, Michael Chen wrote:
>> Could you check it for mscdirect.com? Some documentation on sitemaps suggest that there should be a
>> blank line before sitemaps, which MSCDirect doesn't have. Also might have something to do with the
>> crawler ID?
>>
>> Please let me know if I can provide you with any additional information.
>>
>> Thank you!
>>
>> Michael
>>
>>
>>> On 08/18/2017 06:16 AM, Sebastian Nagel wrote:
>>> Hi Michael,
>>>
>>> I've checked crawler-commons which is used for robots.txt parsing (recent version and also 0.5 used
>>> by Nutch 2.x).  Seems to work. But it needs a closer look where the problem is.
>>>
>>> Best,
>>> Sebastian
>>>
>>>> On 08/18/2017 03:40 AM, Michael Chen wrote:
>>>> Hi,
>>>>
>>>> I've been unable to detect the sitemap for https://www.mscdirect.com/robots.txt, I did some
>>>> searching and I think it might be due to their robots.txt line spacing format. I tried
>>>> user-agent=Googlebot but that didn't help either. Could someone reproduce the problem?
>>>>
>>>> Thanks!
>>>>
>>>> Michael
>>>>
>>
>