You could use suffix filters to filter out any document that isn't a PDF.
Marco Vanossi wrote:
> Do you think there is an easy way to do make nutch generate a list of
> certain documents type to fetch?
> For example:
> If one would like to crawl only PDF docs (after some pages was already
> crawled, wich linked to PDF docs), the command:
> "bin/nutch generate db segments -topN 1000 -type:pdf" could do that.
> Thanks for any help and comment,