I need the decoded forms for my project. If any contributors want the change I'll submit the one file patch for the decoded urls.
If any contributers want the url completely encoded per RFC1738 for use in fetching and searching, then I can submit that patch as well. This last item is what I believe this bug was opened for in the first place, though after research posted above, doesn't look like its required.
> Components: fetcher
> Reporter: Stefan Groschupf
> Priority: Minor
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356 > submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include
> illegal characters in URLs -- specifically, characters with
> the high bit set to produce non-English letters. In
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would
> help if high-bit characters (and other illegal characters)
> in URLs could be escaped (using percent-hex notation)
> as part of the URL fix-up process, probably right after
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit
> characters) taken from a longer URL:
> and with percent-escaped characters: