modify nutch to boost crawl for certain keywords

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

modify nutch to boost crawl for certain keywords

Jason Viloria
Hi

I have been looking at the nutch sources to see how to modify it such
that nutch will crawl only links with certain keywords but to no avail.


Can anyone please help me out. Even just pointing me to the right
classes to start with would be of great help.

I was told there is a nutch book that I can buy. Would this book be of
help? Would you recommend it for my purpose?

Thanks
Jason

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

RE: modify nutch to boost crawl for certain keywords

Howie Wang
Filtering based on keywords in URL is quite easy and doesn't
require source changes. Just edit the conf/regex-urlfilter.txt
file. If you want to filter for "puppies", you would have an
entry like:

+.*puppies

# Skip everything else
-.

Just boosting a URL so that it gets preferential fetch preference
would probably require source changes though. The place
to start would probably be in FetchListTool.java which contains
the main for the "bin/nutch generate" step.

Howie

>
>I have been looking at the nutch sources to see how to modify it such
>that nutch will crawl only links with certain keywords but to no avail.
>
>
>Can anyone please help me out. Even just pointing me to the right
>classes to start with would be of great help.
>
>I was told there is a nutch book that I can buy. Would this book be of
>help? Would you recommend it for my purpose?
>
>Thanks
>Jason
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com