# skip file: ftp: and mailto: urls
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
# skip URLs containing certain characters as probable queries, etc.
# skip URLs with slash-delimited segment that repeats 3+ times, to break
# accept anything else
*3. nutch-site.xml *has *plugin.includes* property with
included in it.
*4. *When I test with *bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter*, I am getting expected
Results, *But at Crawl time all these urls are getting included in fetch
> From: Gajanan Watkar <[hidden email]>
> To: [hidden email] > Cc:
> Date: Wed, 10 Oct 2018 17:19:24 +0530
> Subject: Re: Unable to get regex-urlfilter working
> I am using Nutch 2.x with habse as backend storage.
It was very basic mistake on my part. Default crawl script launches
generateJob with -noFilter switch which I failed to take notice of. Rest of
the configurations and job file were fine. Your reply was indeed helpful to
On Thu, Oct 11, 2018 at 9:39 PM lewis john mcgibbney <[hidden email]>
> Hi Gajanan,
> Seeing as you are using 2.x, are you making sure that the project has been
> built with the correct regex-urlfilter.txt being present on ClassPath and
> included in the job jar you are using?
> On Thu, Oct 11, 2018 at 12:19 AM <[hidden email]>
> > From: Gajanan Watkar <[hidden email]>
> > To: [hidden email] > > Cc:
> > Bcc:
> > Date: Wed, 10 Oct 2018 17:19:24 +0530
> > Subject: Re: Unable to get regex-urlfilter working
> > I am using Nutch 2.x with habse as backend storage.
> > *-Gajanan*