[jira] [Commented] (NUTCH-2598) URLNormalizerChecker fails on invalid URLs in input

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2598) URLNormalizerChecker fails on invalid URLs in input

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748869#comment-16748869 ]

ASF GitHub Bot commented on NUTCH-2598:

sebastian-nagel commented on pull request #435: NUTCH-2598 URLNormalizerChecker fails on invalid URLs in input
URL: https://github.com/apache/nutch/pull/435
   Output empty string for invalid URLs and do not exit.
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[hidden email]

> URLNormalizerChecker fails on invalid URLs in input
> ---------------------------------------------------
>                 Key: NUTCH-2598
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2598
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
> I use the URLNormalizerChecker (urlnormalizer-regex and urlnormalizer-basic) to normalize URLs before further processing them. If one of the used normalizers throws a MalformedURLException when the URLNormalizer.normalize(...) method is called, this isn't caught and causes the checker to exit:
> {noformat}
> Exception in thread "main" java.net.MalformedURLException: For input string: "???120810002"
>         at java.net.URL.<init>(URL.java:627)
>         at java.net.URL.<init>(URL.java:490)
>         at java.net.URL.<init>(URL.java:439)
>         at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:100)
>         at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319)
>         at org.apache.nutch.net.URLNormalizerChecker.process(URLNormalizerChecker.java:75)
>         at org.apache.nutch.util.AbstractChecker.processStdin(AbstractChecker.java:97)
>         at org.apache.nutch.util.AbstractChecker.run(AbstractChecker.java:77)
>         at org.apache.nutch.net.URLNormalizerChecker.run(URLNormalizerChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:80)
> Caused by: java.lang.NumberFormatException: For input string: "???120810002"
>         at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:580)
>         at java.lang.Integer.parseInt(Integer.java:615)
>         at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:222)
>         at java.net.URL.<init>(URL.java:622)
>         ... 10 more
> {noformat}
> The URLNormalizer interface declares the MalformedURLException, it should be caught in the normalizer checker:
> - log the error
> - return/output empty string

This message was sent by Atlassian JIRA