Unable to get regex-urlfilter working

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to get regex-urlfilter working

Gajanan Watkar
Hi all,

*1. Want to fillter all urls like:*

http://14538.diarynote.jp/items/music-jp/B00005FMG1/
http://12899diarynote.jp/amp/201503160602121325/
http://15131513marudiarynote.jp/amp/201603181431397340/
http://11621diarynote.jp/amp/200409061741310000/
http://14291.diarynote.jp/items/dvd-jp/B00016ZPCQ/
http://10695diarynote.jp/amp/200908112143487146/

*2. Contents of regex-urlfilter.txt file:*

# skip diarynote.jp
*-.*diarynote.jp.**

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

*3. nutch-site.xml *has *plugin.includes* property with
*urlfilter-regex *plugin
included in it.

*4. *When I test with *bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter*, I am getting expected
Results, *But at Crawl time all these urls are getting included in fetch
list*.

18/10/10 12:35:23 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/home/user/hadoop/tmp/hadoop-unjar5013141110548091848/regex-urlfilter.txt
http://11848.diarynote.jp/home/diary/new/
-http://11848.diarynote.jp/home/diary/new/
http://23810diarynote.jp/amp/201210031421469096/
-http://23810diarynote.jp/amp/201210031421469096/
diarynote.jp/amp/201210031421469096/
-diarynote.jp/amp/201210031421469096/
23810diarynote.jp/amp/201210031421469096/
-23810diarynote.jp/amp/201210031421469096/
11848.diarynote.jp/home/diary/new/
-11848.diarynote.jp/home/diary/new/
http://20131110karadiarynote.jp/amp/201604260043253476/
-http://20131110karadiarynote.jp/amp/201604260043253476/
20131110karadiarynote.jp/amp/201604260043253476/
-20131110karadiarynote.jp/amp/201604260043253476/

*What could be the problem. Needs Help.*


*-Gajanan*
Reply | Threaded
Open this post in threaded view
|

Re: Unable to get regex-urlfilter working

Gajanan Watkar
I am using Nutch 2.x with habse as backend storage.

*-Gajanan*


On Wed, Oct 10, 2018 at 5:17 PM Gajanan Watkar <[hidden email]>
wrote:

> Hi all,
>
> *1. Want to fillter all urls like:*
>
> http://14538.diarynote.jp/items/music-jp/B00005FMG1/ http://12899diarynote.jp/amp/201503160602121325/
> http://15131513marudiarynote.jp/amp/201603181431397340/
> http://11621diarynote.jp/amp/200409061741310000/
> http://14291.diarynote.jp/items/dvd-jp/B00016ZPCQ/
> http://10695diarynote.jp/amp/200908112143487146/
>
> *2. Contents of regex-urlfilter.txt file:*
>
> # skip diarynote.jp
> *-.*diarynote.jp.**
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> +.
>
> *3. nutch-site.xml *has *plugin.includes* property with *urlfilter-regex *plugin
> included in it.
>
> *4. *When I test with *bin/nutch plugin urlfilter-regex
> org.apache.nutch.urlfilter.regex.RegexURLFilter*, I am getting expected
> Results, *But at Crawl time all these urls are getting included in fetch
> list*.
>
> 18/10/10 12:35:23 INFO conf.Configuration: found resource
> regex-urlfilter.txt at
> file:/home/user/hadoop/tmp/hadoop-unjar5013141110548091848/regex-urlfilter.txt
> http://11848.diarynote.jp/home/diary/new/
> -http://11848.diarynote.jp/home/diary/new/
> http://23810diarynote.jp/amp/201210031421469096/
> -http://23810diarynote.jp/amp/201210031421469096/
> diarynote.jp/amp/201210031421469096/
> -diarynote.jp/amp/201210031421469096/
> 23810diarynote.jp/amp/201210031421469096/
> -23810diarynote.jp/amp/201210031421469096/
> 11848.diarynote.jp/home/diary/new/
> -11848.diarynote.jp/home/diary/new/
> http://20131110karadiarynote.jp/amp/201604260043253476/
> -http://20131110karadiarynote.jp/amp/201604260043253476/
> 20131110karadiarynote.jp/amp/201604260043253476/
> -20131110karadiarynote.jp/amp/201604260043253476/
>
> *What could be the problem. Needs Help.*
>
>
> *-Gajanan*
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to get regex-urlfilter working

lewis john mcgibbney-2
In reply to this post by Gajanan Watkar
Hi Gajanan,
Seeing as you are using 2.x, are you making sure that the project has been
built with the correct   regex-urlfilter.txt being present on ClassPath and
included in the job jar you are using?

On Thu, Oct 11, 2018 at 12:19 AM <[hidden email]> wrote:

>
>
> From: Gajanan Watkar <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Wed, 10 Oct 2018 17:19:24 +0530
> Subject: Re: Unable to get regex-urlfilter working
> I am using Nutch 2.x with habse as backend storage.
>
> *-Gajanan*
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to get regex-urlfilter working

Gajanan Watkar
Thanks Lewis,
It was very basic mistake on my part. Default crawl script launches
generateJob with -noFilter switch which I failed to take notice of. Rest of
the configurations and job file were fine. Your reply was indeed helpful to
bootstrap debugging.

-Gajanan

On Thu, Oct 11, 2018 at 9:39 PM lewis john mcgibbney <[hidden email]>
wrote:

> Hi Gajanan,
> Seeing as you are using 2.x, are you making sure that the project has been
> built with the correct   regex-urlfilter.txt being present on ClassPath and
> included in the job jar you are using?
>
> On Thu, Oct 11, 2018 at 12:19 AM <[hidden email]>
> wrote:
>
> >
> >
> > From: Gajanan Watkar <[hidden email]>
> > To: [hidden email]
> > Cc:
> > Bcc:
> > Date: Wed, 10 Oct 2018 17:19:24 +0530
> > Subject: Re: Unable to get regex-urlfilter working
> > I am using Nutch 2.x with habse as backend storage.
> >
> > *-Gajanan*
> >
>