How are the Regex URL Filters Supposed to Work?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How are the Regex URL Filters Supposed to Work?

Tkach
Okay, I think I may be missing something here.  I'm trying to use the
regex-urlfilter.txt and/or crawl-urlfilter.txt to make sure that only a
few url roots are accepted and several are rejected.

As a good example, I want to not fetch/index anything that starts
http://www.stopandshop.com/UPR_SSWWeb/ .  When I run a crawl with
"bin/nutch crawl urls -dir snsgiant -depth 2' I find that it pulls back
at least one URL that ought to be blocked, SswLogout.  Am I just not
getting something about the syntax?  (See the clip at the bottom-turned
up logging to DEBUG for all)

One other quick question, can you/how can you fetch some (but not all)
URLs which contain a ? such as the ...company_employapp.htm?posname in
the attached urlfilter texts?  I tried taking the ? out of the list of
symbols to skip, but that doesn't seem to work quite right.

INFO  fetcher.Fetcher - fetching
http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
2008-02-21 18:12:28,094 DEBUG api.RobotRulesParser - cache miss
http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
2008-02-21 18:12:29,457 DEBUG http.Http - fetching
http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
2008-02-21 18:12:29,548 DEBUG http.Http - fetched 0 bytes from
http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout

http://www.giantfood.com/
http://www.stopandshop.com/
http://www.stopandshop.com/payvantage/

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
# # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|cfm|sit|eps|wmf|js|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
-[*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

+^http://www.giantfood.com/
+^http://www.stopandshop.com/
+^http://www.stopandshop.com/payvantage/

# accept anything else
-^http://www.giantfood.com/cgi-bin/*
-^http://www.giantfood.com/locator/store_dsp_detail*
-^http://www.giantfood.com/aplus/aplus_school_directory.htm*
-^http://www.giantfood.com/careers/company_employapp.htm?posname=*
-^http://www.giantfood.com/pharmacy/header.htm
-^http://www.giantfood.com/pharmacy/sidebar.htm
-^http://www.giantfood.com/wine/header.htm
-^http://www.giantfood.com/wine/sidebar.htm
-^http://www.giantfood.com/foodguide/*
-^http://www.stopandshop.com/cgi-bin/*
-^http://www.stopandshop.com/rxrefill/ss-top.htm
-^http://www.stopandshop.com/rxrefill/ss-left.htm
-^http://www.stopandshop.com/rxrefill/ss-frame.htm
-^http://www.stopandshop.com/rxrefill/blank.htm
-^http://www.stopandshop.com/great_ideas/meal_solutions/top.htm
-^http://www.stopandshop.com/great_ideas/gift_cards/top.htm
-^http://www.stopandshop.com/UPR_SSWWeb/*
-.

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
# # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|cfm|sit|eps|wmf|js|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
-[*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

+^http://www.giantfood.com/
+^http://www.stopandshop.com/
+^http://www.stopandshop.com/payvantage/

# accept anything else
-^http://www.giantfood.com/cgi-bin/*
-^http://www.giantfood.com/locator/store_dsp_detail*
-^http://www.giantfood.com/aplus/aplus_school_directory.htm*
-^http://www.giantfood.com/careers/company_employapp.htm?posname=*
-^http://www.giantfood.com/pharmacy/header.htm
-^http://www.giantfood.com/pharmacy/sidebar.htm
-^http://www.giantfood.com/wine/header.htm
-^http://www.giantfood.com/wine/sidebar.htm
-^http://www.giantfood.com/foodguide/*
-^http://www.stopandshop.com/cgi-bin/*
-^http://www.stopandshop.com/rxrefill/ss-top.htm
-^http://www.stopandshop.com/rxrefill/ss-left.htm
-^http://www.stopandshop.com/rxrefill/ss-frame.htm
-^http://www.stopandshop.com/rxrefill/blank.htm
-^http://www.stopandshop.com/great_ideas/meal_solutions/top.htm
-^http://www.stopandshop.com/great_ideas/gift_cards/top.htm
-^http://www.stopandshop.com/UPR_SSWWeb/*
-.
Reply | Threaded
Open this post in threaded view
|

Re: How are the Regex URL Filters Supposed to Work?

Mario Méndez Villegas
I think your filters could work if you put first the urls you don't want to
crawl... ie:

# accept anything else
-^http://www.giantfood.com/cgi-bin/*
- <http://www.giantfood.com/cgi-bin/*->^
http://www.giantfood.com/locator/store_dsp_detail*
- <http://www.giantfood.com/locator/store_dsp_detail*->^
http://www.giantfood.com/aplus/aplus_school_directory.htm*<http://www.giantfood.com/aplus/aplus_school_directory.htm*->

+^http://www.giantfood.com/
+^http://www.stopandshop.com/
+^http://www.stopandshop.com/payvantage/
 <http://www.giantfood.com/aplus/aplus_school_directory.htm*->

2008/2/21 Nick Tkach <[hidden email]>:

> Okay, I think I may be missing something here.  I'm trying to use the
> regex-urlfilter.txt and/or crawl-urlfilter.txt to make sure that only a
> few url roots are accepted and several are rejected.
>
> As a good example, I want to not fetch/index anything that starts
> http://www.stopandshop.com/UPR_SSWWeb/ .  When I run a crawl with
> "bin/nutch crawl urls -dir snsgiant -depth 2' I find that it pulls back
> at least one URL that ought to be blocked, SswLogout.  Am I just not
> getting something about the syntax?  (See the clip at the bottom-turned
> up logging to DEBUG for all)
>
> One other quick question, can you/how can you fetch some (but not all)
> URLs which contain a ? such as the ...company_employapp.htm?posname in
> the attached urlfilter texts?  I tried taking the ? out of the list of
> symbols to skip, but that doesn't seem to work quite right.
>
> INFO  fetcher.Fetcher - fetching
> http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
> 2008-02-21<http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout2008-02-21>18:12:28,094 DEBUG
> api.RobotRulesParser - cache miss
> http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
> 2008-02-21<http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout2008-02-21>18:12:29,457 DEBUG
> http.Http - fetching
> http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
> 2008-02-21<http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout2008-02-21>18:12:29,548 DEBUG
> http.Http - fetched 0 bytes from
> http://www.stopandshop.com/UPR_SSWWeb/login/SswLogout
>
> http://www.giantfood.com/
> http://www.stopandshop.com/
> http://www.stopandshop.com/payvantage/
>
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #     http://www.apache.org/licenses/LICENSE-2.0
> # # Unless required by applicable law or agreed to in writing, software #
> distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
>
>
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|cfm|sit|eps|wmf|js|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> -[*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> +^http://www.giantfood.com/
> +^http://www.stopandshop.com/
> +^http://www.stopandshop.com/payvantage/
>
> # accept anything else
> -^http://www.giantfood.com/cgi-bin/*
> - <http://www.giantfood.com/cgi-bin/*->^
> http://www.giantfood.com/locator/store_dsp_detail*
> - <http://www.giantfood.com/locator/store_dsp_detail*->^
> http://www.giantfood.com/aplus/aplus_school_directory.htm*
> - <http://www.giantfood.com/aplus/aplus_school_directory.htm*->^
> http://www.giantfood.com/careers/company_employapp.htm?posname=*
> - <http://www.giantfood.com/careers/company_employapp.htm?posname=*->^
> http://www.giantfood.com/pharmacy/header.htm
> -^http://www.giantfood.com/pharmacy/sidebar.htm
> -^http://www.giantfood.com/wine/header.htm
> -^http://www.giantfood.com/wine/sidebar.htm
> -^http://www.giantfood.com/foodguide/*
> - <http://www.giantfood.com/foodguide/*->^
> http://www.stopandshop.com/cgi-bin/*
> - <http://www.stopandshop.com/cgi-bin/*->^
> http://www.stopandshop.com/rxrefill/ss-top.htm
> -^http://www.stopandshop.com/rxrefill/ss-left.htm
> -^http://www.stopandshop.com/rxrefill/ss-frame.htm
> -^http://www.stopandshop.com/rxrefill/blank.htm
> -^http://www.stopandshop.com/great_ideas/meal_solutions/top.htm
> -^http://www.stopandshop.com/great_ideas/gift_cards/top.htm
> -^http://www.stopandshop.com/UPR_SSWWeb/*
> -.
>
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #     http://www.apache.org/licenses/LICENSE-2.0
> # # Unless required by applicable law or agreed to in writing, software #
> distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
>
>
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|cfm|sit|eps|wmf|js|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> -[*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> +^http://www.giantfood.com/
> +^http://www.stopandshop.com/
> +^http://www.stopandshop.com/payvantage/
>
> # accept anything else
> -^http://www.giantfood.com/cgi-bin/*
> - <http://www.giantfood.com/cgi-bin/*->^
> http://www.giantfood.com/locator/store_dsp_detail*
> - <http://www.giantfood.com/locator/store_dsp_detail*->^
> http://www.giantfood.com/aplus/aplus_school_directory.htm*
> - <http://www.giantfood.com/aplus/aplus_school_directory.htm*->^
> http://www.giantfood.com/careers/company_employapp.htm?posname=*
> - <http://www.giantfood.com/careers/company_employapp.htm?posname=*->^
> http://www.giantfood.com/pharmacy/header.htm
> -^http://www.giantfood.com/pharmacy/sidebar.htm
> -^http://www.giantfood.com/wine/header.htm
> -^http://www.giantfood.com/wine/sidebar.htm
> -^http://www.giantfood.com/foodguide/*
> - <http://www.giantfood.com/foodguide/*->^
> http://www.stopandshop.com/cgi-bin/*
> - <http://www.stopandshop.com/cgi-bin/*->^
> http://www.stopandshop.com/rxrefill/ss-top.htm
> -^http://www.stopandshop.com/rxrefill/ss-left.htm
> -^http://www.stopandshop.com/rxrefill/ss-frame.htm
> -^http://www.stopandshop.com/rxrefill/blank.htm
> -^http://www.stopandshop.com/great_ideas/meal_solutions/top.htm
> -^http://www.stopandshop.com/great_ideas/gift_cards/top.htm
> -^http://www.stopandshop.com/UPR_SSWWeb/*
> -.
>
>