Filtering pages before indexing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Filtering pages before indexing

roman.spitzbart
Hi,

Is there a way to filter pages before they're indexed in Nutch? I try to crawl an Intranet site but only PDF documents should make it to the index (in later stages this will be extended but PDFs are the main focus). I've tried using the regex or suffix filters but this prevents the crawling as well. I try to crawl all pages (mainly HTML) and then index only PDFs referenced by those pages...

Thanks and Regards,
Roman
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Filtering pages before indexing

Andrzej Białecki-2
[hidden email] wrote:
> Hi,
>
> Is there a way to filter pages before they're indexed in Nutch? I try to crawl an Intranet site but only PDF documents should make it to the index (in later stages this will be extended but PDFs are the main focus). I've tried using the regex or suffix filters but this prevents the crawling as well. I try to crawl all pages (mainly HTML) and then index only PDFs referenced by those pages...
>  

Nutch doesn't run filters at this stage ... it is assumed that correctly
fetched and parsed documents are suitable for indexing.

However, you could make a minor modification in Indexer.reduce(), around
lines 213-214, instead of this:

       ...
        Document doc = new Document();
        Metadata metadata = parseData.getContentMeta();
       ...

you could first try running filters on the URL:

    public void configure(JobConf job) {
       ...
       urlfilters = new URLFilters(job);
       ...
    }

    public void reduce( ...) {
        ...
        String url = urlfilters.filter(key.toString());
        if (url == null) return;
        Document doc = new Document();
        Metadata metadata = parseData.getContentMeta();
       ...

or run any other method of checking whether it's an acceptable type of
document. In short, if you return from the reduce() method without
calling output.collect(), then this document will be dropped from the
resulting index.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Loading...