Quantcast

ASP Parser

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

ASP Parser

Seth Taylor
I've recently just installed and configured Nutch from source.  From
what I've read by default, Nutch will parse text and html based
documents only.  I have a site I'm trying to crawl which is all asp
pages.  I put the asp mime type in the mime-type.xml document.  What
else do I need to do in order for Nutch to crawl asp pages?

 

Thanks,

Seth

 

[hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: ASP Parser

Jérôme Charron
>
> I've recently just installed and configured Nutch from source. From
> what I've read by default, Nutch will parse text and html based
> documents only. I have a site I'm trying to crawl which is all asp
> pages. I put the asp mime type in the mime-type.xml document. What
> else do I need to do in order for Nutch to crawl asp pages?

Corrects me if I'm wrong, but ASP is like JSP: a page that is interpreted on
the server side and generates any type of document (mainly some pure html).
So, you don't need to add ASP support on Nutch, since you ASP pages
certainly generate some HTML code.

Jerome


--
http://motrech.free.fr/
http://frutch.free.fr/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [Nutch-general] ASP Parser

David Spencer-2
In reply to this post by Seth Taylor
Seth Taylor wrote:

> I've recently just installed and configured Nutch from source.  From
> what I've read by default, Nutch will parse text and html based
> documents only.  I have a site I'm trying to crawl which is all asp
> pages.  I put the asp mime type in the mime-type.xml document.  What
> else do I need to do in order for Nutch to crawl asp pages?

Probably you need to check out the URL filter (conf/crawl-urlfilter.txt)
and make sure the pages are not rejected. Note that there might be a
pattern that rejects argument to the URL so you might want to disable
that if the pages take args.

I would think that there is no ASP MIME type per-se -- surely the
average ASP page returns HTML documents?!

>
>  
>
> Thanks,
>
> Seth
>
>  
>
> [hidden email]
>
>

Loading...