Nutch HTMLParseFilters

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Nutch HTMLParseFilters

Gaurav Agarwal

I have started using Nutch recently for one of the academic research projects involving crawling particular kind of web-pages.

While crawling, I did not need to crawl bmp,jpeg,mp3 etc. , so I went ahead and updates my url-filter property file to block these. However, this did not stop these urls (to jpeg etc.) from showing up as outlinks from a valid html page. In fact, because I had put a limit on number of outgoing links as 100, these useless urls occupied the available slots and blocked a few valid html pages from being fetched (of course, this can be resolved by increasing the threshold on #outlinks/page).

I went ahead and created a filter for HTMLParseFilter extension point to throw away any of these invalid urls at the parse time itself. I also modified the HTMLParseFilter class to execute these filters in a particular order according to a new property introduced in the nutch-site.xml) . This was done because I wanted the pruning to happen after all the HTMLParseFilters have executed (eg in the case of JSParseFilter).

Now, I am posting this mail to ask if this feature is already present and I just redundantly did all this, or if it is not present in the core, will it make any sense for anyone else to have this. i can send the code etc. back to developers to put it in the core if they find it useful (it was trivially easy, thanks to highly simple Nutch Plugin architecture and anyone can implement it anyways).