[jira] [Commented] (NUTCH-2868) urlnormalizer-protocol fails with StringIndexOutOfBoundsException when reading invalid line in configuration file

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2868) urlnormalizer-protocol fails with StringIndexOutOfBoundsException when reading invalid line in configuration file

Lewis John McGibbney (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17361224#comment-17361224 ]

ASF GitHub Bot commented on NUTCH-2868:
---------------------------------------

lewismc commented on a change in pull request #649:
URL: https://github.com/apache/nutch/pull/649#discussion_r649525019



##########
File path: src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
##########
@@ -82,15 +82,21 @@ private synchronized void readConfiguration(Reader configReader) throws IOExcept
     String line, host;
     String protocol;
     int delimiterIndex;
+    int lineNumber = 0;
 
     while ((line = reader.readLine()) != null) {
+      lineNumber++;
       if (StringUtils.isNotBlank(line) && !line.startsWith("#")) {
         line = line.trim();
         delimiterIndex = line.indexOf(" ");
         // try tabulator
         if (delimiterIndex == -1) {
           delimiterIndex = line.indexOf("\t");
         }
+        if (delimiterIndex == -1) {
+          LOG.warn("Invalid line {} without delimiter: {}", lineNumber, line);

Review comment:
       How exactly do I trigger this warning? Can you provide an example or a trivial unit test?

##########
File path: src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
##########
@@ -177,6 +183,7 @@ public void setConf(Configuration conf) {
       if (reader == null) {
         Path path = new Path(file);
         FileSystem fs = path.getFileSystem(conf);
+        LOG.info("Reading {} rules file {} from {}", pluginName, file, fs);

Review comment:
       Logging the `fs` object results in a log entry
   ```
   2021-06-10 10:51:00,072 INFO  protocol.ProtocolURLNormalizer (ProtocolURLNormalizer.java:setConf(186)) - Reading urlnormalizer-protocol rules file /Users/lmcgibbn/Downloads/nutch/build/urlnormalizer-protocol/test/data/broken_protocols.txt from org.apache.hadoop.fs.LocalFileSystem@e54303
   ```
   Is the `LocalFileSystem@e54303` object reference useful?




--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


> urlnormalizer-protocol fails with StringIndexOutOfBoundsException when reading invalid line in configuration file
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2868
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2868
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, urlnormalizer
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.19
>
>
> When reading a invalid line in the configuration file, the protocol urlnormalizer may fail with a StringIndexOutOfBoundsException:
> {noformat}
> 2021-06-10 05:10:41,877 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>         at java.lang.String.substring(String.java:1967)
>         at org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer.readConfiguration(ProtocolURLNormalizer.java:95)
>         at org.apache.nutch.net.urlnormalizer.protocol.ProtocolURLNormalizer.setConf(ProtocolURLNormalizer.java:182)
> {noformat}
> The invalid line should be logged and skipped without causing the job to fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)