[jira] [Commented] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.

Nicholas DiPiazza (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170100#comment-17170100 ]

ASF GitHub Bot commented on NUTCH-1190:
---------------------------------------

sebastian-nagel commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464477144



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL res = conf.getResource("date-styles.txt");

Review comment:
       Just in case: I meant the file name, not the variable name. A typical Nutch user only edits the configuration file but does not look into the source code to figure out what the file is used for.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


> MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1190
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1190
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, plugin
>    Affects Versions: 1.4
>         Environment: jdk6
>            Reporter: Zhang JinYan
>            Priority: Major
>             Fix For: 1.18
>
>         Attachments: MoreIndexingFilter.patch, NUTCH-1190-trunk.patch, date-styles.txt
>
>
> There many issues about missing date format:
> [NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
> [NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
> [NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]
> The data formats can be diverse, so why not move those data formats to a extra config file?
> I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt"(place in "conf"), which will be load on startup.
> {code}
>   public void setConf(Configuration conf) {
>     this.conf = conf;
>     MIME = new MimeUtil(conf);
>    
>     URL res = conf.getResource("date-styles.txt");
>     if(res==null){
>       LOG.error("Can't find resource: date-styles.txt");
>     }else{
>       try {
>         List lines = FileUtils.readLines(new File(res.getFile()));
>         for (int i = 0; i < lines.size(); i++) {
>           String dateStyle = (String) lines.get(i);
>           if(StringUtils.isBlank(dateStyle)){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           dateStyle=StringUtils.trim(dateStyle);
>           if(dateStyle.startsWith("#")){
>             lines.remove(i);
>             i--;
>             continue;
>           }
>           lines.set(i, dateStyle);
>         }
>         dateStyles = new String[lines.size()];
>         lines.toArray(dateStyles);
>       } catch (IOException e) {
>         LOG.error("Failed to load resource: date-styles.txt");
>       }
>     }
>   }
> {code}
> Then parse "lastModified" like this(sample):
> {code}
>   private long getTime(String date, String url) {
>     ......
>     Date parsedDate = DateUtils.parseDate(date, dateStyles);
>     time = parsedDate.getTime();
>     ......
>     return time;
>   }
> {code}
> This path also contains the "path" of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
> Find more details in the patch file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)