rejected by filters

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

rejected by filters

Robert Scavilla
Hello and Thank you for helping. For some nutch is rejecting the domain
https://www.penn.museum/

The regex-urlfilter is: +.
seeding with https://www.penn.museum/

And on crawl it keeps giving:
Injector: Total urls rejected by filters: 1

This is the only time I've had this issue and was wondering if the .museum
TLD was the problem??
Reply | Threaded
Open this post in threaded view
|

Re: rejected by filters

BlackIce
I think you are correct in your assumption.
According to this:

https://issues.apache.org/jira/browse/NUTCH-2620?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Nutch asumes that the TLD is no longer than 4 characters, this is being in
the process of being fixed in the next release, which should be out shortly.

Greetings

On Wed, Aug 8, 2018 at 7:26 PM Robert Scavilla <[hidden email]> wrote:

> Hello and Thank you for helping. For some nutch is rejecting the domain
> https://www.penn.museum/
>
> The regex-urlfilter is: +.
> seeding with https://www.penn.museum/
>
> And on crawl it keeps giving:
> Injector: Total urls rejected by filters: 1
>
> This is the only time I've had this issue and was wondering if the .museum
> TLD was the problem??
>