Bad URLs causing SEVERE exception

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Bad URLs causing SEVERE exception

Chirag Chaman-2

Over the weekend the fetcher crashed and kept crashing. The culprit was a
site which was pointing to bad links -- http://:80/ and http://:0/ etc.

These links were getting thru -- thus we changed the URL filter to only
accept valid URL.

As someone else may face the same issue, here is the RE -- this should go
towards the end of your regex-urlfilter.txt.   It would be nice if one of
the committers could add this to the default file and comment it out.

# accept http only - valid URLs only

NOTE: This is only good for Web crawling, if you need intranet crawling do
not use this as it will not let any URL thru without at least one period.

Filangy, Inc.