[HOW-TO] How to make Nutch Ignore META Tags

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[HOW-TO] How to make Nutch Ignore META Tags

Rajasekar Karthik
One of the problems when indexing a site - META tags not allowing nutch to index or follow links. It is always a good respect to obey the rules of the site. But, if the site owner is ok with you to ignore this rule, you can make nutch ignore this rule.

In File HtmlParser.java located in - src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java

comment the following lines:
if (!metaTags.getNoIndex()) {               // okay to index
if (!metaTags.getNoFollow()) {              // okay to follow links

and of course, the closing brackets for each if loop. After this, Just rebuild nutch jar & war file

Why would you want to do this?
* Site Owner does not want to change his code and at the same time you want to make that site available for indexing & searching.

Any other suggestions are welcome. Thanks.