"generate db segments topN" with TYPE

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

"generate db segments topN" with TYPE

ocramp
Hi,

 Do you think there is an easy way to do make nutch generate a list of only
certain documents type to fetch?

 For example:
 If one would like to crawl only PDF docs (after some pages was already
crawled, wich linked to PDF docs), the command:
 "bin/nutch generate db segments -topN 1000 -type:pdf" could do that.

Thanks for any help and comment,
Marco
Reply | Threaded
Open this post in threaded view
|

Re: "generate db segments topN" with TYPE

Dennis Kubes
You could use suffix filters to filter out any document that isn't a PDF.

Dennis

Marco Vanossi wrote:

> Hi,
>
> Do you think there is an easy way to do make nutch generate a list of
> only
> certain documents type to fetch?
>
> For example:
> If one would like to crawl only PDF docs (after some pages was already
> crawled, wich linked to PDF docs), the command:
> "bin/nutch generate db segments -topN 1000 -type:pdf" could do that.
>
> Thanks for any help and comment,
> Marco
>