We are using Nutch to crawl html content for Wikipedia articles. We
use static list urls as an input. To do this we've injected our list
of urls, set db.update.additions.allowed to false, and set the crawl
depth to 1.
- We iterate over the output segment files using
'SequenceFile.Reader' and pullout the 'string' as well as 'binary'
form of content.