I have what I assume to be a simple user issue with nutch-0.8-dev. I'm
to do a single site crawl on a Fedora Core 4 Linux machine. The site
I'm crawling consists
of Perl (Catalyst to be specific), and PHP (an app called gallery, and
an instance of Media Wiki).
The issue I'm having is that Nutch does not seem to crawl the gallery
section of the site.
There are links from the main site to gallery, and I've listed the top
level gallery URL
my initial url list I pass to nutch crawl.
Sorry for the length of the message, but I wanted to try to provide as
much information about
the problem as I could.
Nutch does crawl the wiki and perl sections of the site.
It is possible that the URL filter is preventing the links from being
crawled, especially if they have characters such as ? or ; in them (i.e.
like a php session id). Can you post an example of a link?