RE: modify nutch to boost crawl for certain keywords
Filtering based on keywords in URL is quite easy and doesn't
require source changes. Just edit the conf/regex-urlfilter.txt
file. If you want to filter for "puppies", you would have an
# Skip everything else
Just boosting a URL so that it gets preferential fetch preference
would probably require source changes though. The place
to start would probably be in FetchListTool.java which contains
the main for the "bin/nutch generate" step.
>I have been looking at the nutch sources to see how to modify it such
>that nutch will crawl only links with certain keywords but to no avail.
>Can anyone please help me out. Even just pointing me to the right
>classes to start with would be of great help.
>I was told there is a nutch book that I can buy. Would this book be of
>help? Would you recommend it for my purpose?
>Do You Yahoo!?
>Tired of spam? Yahoo! Mail has the best spam protection around