first i set
# skip file:, ftp:, & mailto: urls
# skip image and other suffixes we can't yet parse
# skip URLs containing certain characters as probable queries, etc.
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# skip everything else
in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled -depth 3" i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some webpage can't crawl all why? and index file after crawl don't have segments file for nutch search it have only
Re: [jira] Created: (NUTCH-599) nutch crawl and index problem
I have replied this query of yours yesterday in
[hidden email]. If you haven't received the reply,
probably you have not subscribed to the nutch-user mailing list. If
you haven't subscribed, please do so by sending a blank mail to
Nutch 0.9 works fine for us. So it is not a bug in Nutch 0.9 This
looks like a configuration problem at your end. Please discuss this
properly in [hidden email] instead of submitting it as a
bug in Nutch.