RE: Adding specfic query parameters to nutch url filters
Once a URL gets filtered, by any plugin, it is rejected entirely.
If you want specific queries to pass the regex-urlfilter, you must let is pass explicitly above this -[?*!@=] line, e.g. +passThisQuery=
Use bin/nutch filterchecker -stdIn for quick testing.
> From:Sachin Mittal <[hidden email]>
> Sent: Monday 21st October 2019 14:22
> To: [hidden email] > Subject: Adding specfic query parameters to nutch url filters
> I have checked the regex-urlfilter and by default I see this line:
> # skip URLs containing certain characters as probable queries, etc.
> In my case for a particular url I want to crawl a specific query, so wanted
> to know what file would be the best to make changes to enable this.
> Would it be regex-urlfilter or I also see a filters file suffix-urlfilter
> and fast-urlfilter.
> Would adding filters in any of the later two files would help.
> Any idea why these filters are added, like what would be the potential
> Also say if I add multiple filter plugins backed by these files, then how
> url filtering works? Only those urls which pass all the plugins are
> selected to be fetched or any of the plugin?