Excluding individual pages?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Excluding individual pages?

Dave Beckstrom-2
Hi Everyone,

I searched and didn't find an answer.

Nutch is indexing the content of the page that has the seed urls in it and
then that page shows up in the SOLR search results.   We don't want that to
happen.

Is there a way to have nutch crawl the seed url page but not push that page
into SOLR?  If not, is there a way to have a particular page excluded from
the SOLR search results?  Either way I'm trying to not have that page show
in search results.

Thank you!

Dave

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/



Reply | Threaded
Open this post in threaded view
|

RE: Excluding individual pages?

Markus Jelsma-2
Hello Dave,

If you have just one specific page you do not want Nutch to index, or Solr to show, you can either create a custom IndexingFilter that returns null (rejecting it) for the specified URL, or add an additional filterQuery to Solr, fq=-id:<SEED_URL>, filtering the specific URL from the results.

If there are more than a few URLs you want to exclude from indexing, and they have a pattern, you can uses regular expressions in the IndexingFilter or Solr filterQuery.

This is manual intervention, and only possible if your set is small enough, and does not change frequently. If this is not the case, you need more rigorous tools to detect and reject - what we call - hub pages or overview pages.

Regards,
Markus
 
-----Original message-----

> From:Dave Beckstrom <[hidden email]>
> Sent: Thursday 10th October 2019 22:34
> To: [hidden email]
> Subject: Excluding individual pages?
>
> Hi Everyone,
>
> I searched and didn't find an answer.
>
> Nutch is indexing the content of the page that has the seed urls in it and
> then that page shows up in the SOLR search results.   We don't want that to
> happen.
>
> Is there a way to have nutch crawl the seed url page but not push that page
> into SOLR?  If not, is there a way to have a particular page excluded from
> the SOLR search results?  Either way I'm trying to not have that page show
> in search results.
>
> Thank you!
>
> Dave
>
> --
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
>
> https://www.collectivefls.com/ <https://www.collectivefls.com/
>
>
>
>