Adding specfic query parameters to nutch url filters

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding specfic query parameters to nutch url filters

Sachin Mittal
Hi,
I have checked the regex-urlfilter and by default I see this line:

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

In my case for a particular url I want to crawl a specific query, so wanted
to know what file would be the best to make changes to enable this.

Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
and fast-urlfilter.

Would adding filters in any of the later two files would help.
Any idea why these filters are added, like what would be the potential
usecase.

Also say if I add multiple filter plugins backed by these files, then how
url filtering works? Only those urls which pass all the plugins are
selected to be fetched or any of the plugin?

Thanks
Sachin
Reply | Threaded
Open this post in threaded view
|

RE: Adding specfic query parameters to nutch url filters

Markus Jelsma-2
Hello Sachin,

Once a URL gets filtered, by any plugin, it is rejected entirely.

If you want specific queries to pass the regex-urlfilter, you must let is pass explicitly above this -[?*!@=] line, e.g. +passThisQuery=

Use bin/nutch filterchecker -stdIn for quick testing.

Regards,
Markus

-----Original message-----

> From:Sachin Mittal <[hidden email]>
> Sent: Monday 21st October 2019 14:22
> To: [hidden email]
> Subject: Adding specfic query parameters to nutch url filters
>
> Hi,
> I have checked the regex-urlfilter and by default I see this line:
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> In my case for a particular url I want to crawl a specific query, so wanted
> to know what file would be the best to make changes to enable this.
>
> Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
> and fast-urlfilter.
>
> Would adding filters in any of the later two files would help.
> Any idea why these filters are added, like what would be the potential
> usecase.
>
> Also say if I add multiple filter plugins backed by these files, then how
> url filtering works? Only those urls which pass all the plugins are
> selected to be fetched or any of the plugin?
>
> Thanks
> Sachin
>