Bad URLs causing SEVERE exception

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Bad URLs causing SEVERE exception

Chirag Chaman-2

Over the weekend the fetcher crashed and kept crashing. The culprit was a
site which was pointing to bad links -- http://:80/ and http://:0/ etc.

These links were getting thru -- thus we changed the URL filter to only
accept valid URL.

As someone else may face the same issue, here is the RE -- this should go
towards the end of your regex-urlfilter.txt.   It would be nice if one of
the committers could add this to the default file and comment it out.

# accept http only - valid URLs only
+^http://[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-\.]+[\:0-9]*


NOTE: This is only good for Web crawling, if you need intranet crawling do
not use this as it will not let any URL thru without at least one period.


CC-
--------------------------------------------
Filangy, Inc.
www.filangy.com