[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-1541) Indexer plugin to write CSV

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538509#comment-16538509 ]

ASF GitHub Bot commented on NUTCH-1541:
---------------------------------------

r0ann3l closed pull request #294: NUTCH-1541 Indexer plugin to write CSV
URL: https://github.com/apache/nutch/pull/294
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index b112c5027..03e43fc63 100644
--- a/build.xml
+++ b/build.xml
@@ -188,6 +188,7 @@
       <packageset dir="${plugins.dir}/index-replace/src/java"/>
       <packageset dir="${plugins.dir}/index-static/src/java"/>
       <packageset dir="${plugins.dir}/indexer-cloudsearch/src/java/" />
+      <packageset dir="${plugins.dir}/indexer-csv/src/java"/>
       <packageset dir="${plugins.dir}/indexer-dummy/src/java"/>
       <packageset dir="${plugins.dir}/indexer-elastic-rest/src/java/"/>
       <packageset dir="${plugins.dir}/indexer-elastic/src/java/" />
@@ -649,6 +650,7 @@
       <packageset dir="${plugins.dir}/index-replace/src/java"/>
       <packageset dir="${plugins.dir}/index-static/src/java"/>
       <packageset dir="${plugins.dir}/indexer-cloudsearch/src/java/" />
+      <packageset dir="${plugins.dir}/indexer-csv/src/java"/>
       <packageset dir="${plugins.dir}/indexer-dummy/src/java"/>
       <packageset dir="${plugins.dir}/indexer-elastic-rest/src/java/"/>
       <packageset dir="${plugins.dir}/indexer-elastic/src/java/" />
@@ -1071,6 +1073,8 @@
         <source path="${plugins.dir}/index-static/src/java/" />
         <source path="${plugins.dir}/index-static/src/test/" />
         <source path="${plugins.dir}/indexer-cloudsearch/src/java/" />
+        <source path="${plugins.dir}/indexer-csv/src/java"/>
+        <source path="${plugins.dir}/indexer-csv/src/test"/>
         <source path="${plugins.dir}/indexer-dummy/src/java/" />
         <source path="${plugins.dir}/indexer-elastic-rest/src/java/"/>
         <source path="${plugins.dir}/indexer-elastic/src/java/" />
diff --git a/conf/index-writers.xml.template b/conf/index-writers.xml.template
index 849f824a5..eaa5870a3 100644
--- a/conf/index-writers.xml.template
+++ b/conf/index-writers.xml.template
@@ -85,6 +85,25 @@
       <remove />
     </mapping>
   </writer>
+  <writer id="indexer_csv_1" class="org.apache.nutch.indexwriter.csv.CSVIndexWriter">
+    <parameters>
+      <param name="fields" value="id,title,content"/>
+      <param name="charset" value="UTF-8"/>
+      <param name="separator" value=","/>
+      <param name="valuesep" value="|"/>
+      <param name="quotechar" value="&quot;"/>
+      <param name="escapechar" value="&quot;"/>
+      <param name="maxfieldlength" value="4096"/>
+      <param name="maxfieldvalues" value="12"/>
+      <param name="header" value="true"/>
+      <param name="outpath" value="csvindexwriter"/>
+    </parameters>
+    <mapping>
+      <copy />
+      <rename />
+      <remove />
+    </mapping>
+  </writer>
   <writer id="indexer_elastic_1" class="org.apache.nutch.indexwriter.elastic.ElasticIndexWriter">
     <parameters>
       <param name="host" value=""/>
diff --git a/default.properties b/default.properties
index 004a8c311..d818ab501 100644
--- a/default.properties
+++ b/default.properties
@@ -192,6 +192,7 @@ plugins.index=\
 #
 plugins.indexer=\
    org.apache.nutch.indexwriter.cloudsearch*:\
+   org.apache.nutch.indexwriter.csv*:\
    org.apache.nutch.indexwriter.dummy*:\
    org.apache.nutch.indexwriter.elastic*:\
    org.apache.nutch.indexwriter.elasticrest*:\
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index 0744167cc..d8e2ef523 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -50,6 +50,7 @@
     <ant dir="index-replace" target="deploy"/>
     <ant dir="index-static" target="deploy"/>
     <ant dir="indexer-cloudsearch" target="deploy"/>
+    <ant dir="indexer-csv" target="deploy"/>
     <ant dir="indexer-dummy" target="deploy"/>
     <ant dir="indexer-elastic" target="deploy"/>
     <ant dir="indexer-elastic-rest" target="deploy"/>
@@ -119,6 +120,7 @@
      <ant dir="index-more" target="test"/>
      <ant dir="index-replace" target="test"/>
      <ant dir="index-static" target="test"/>
+     <ant dir="indexer-csv" target="test"/>
      <ant dir="indexer-elastic" target="test"/>
      <ant dir="language-identifier" target="test"/>
      <ant dir="lib-http" target="test"/>
@@ -182,6 +184,7 @@
     <ant dir="index-replace" target="clean"/>
     <ant dir="index-static" target="clean"/>
     <ant dir="indexer-cloudsearch" target="clean"/>
+    <ant dir="indexer-csv" target="clean"/>
     <ant dir="indexer-dummy" target="clean"/>
     <ant dir="indexer-elastic" target="clean"/>
     <ant dir="indexer-elastic-rest" target="clean"/>
diff --git a/src/plugin/indexer-csv/build.xml b/src/plugin/indexer-csv/build.xml
new file mode 100644
index 000000000..98f998e1b
--- /dev/null
+++ b/src/plugin/indexer-csv/build.xml
@@ -0,0 +1,22 @@
+<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project name="indexer-csv" default="jar-core">
+
+  <import file="../build-plugin.xml" />
+
+</project>
diff --git a/src/plugin/indexer-csv/ivy.xml b/src/plugin/indexer-csv/ivy.xml
new file mode 100644
index 000000000..2b59164b2
--- /dev/null
+++ b/src/plugin/indexer-csv/ivy.xml
@@ -0,0 +1,40 @@
+<?xml version="1.0" ?>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<ivy-module version="1.0">
+  <info organisation="org.apache.nutch" module="${ant.project.name}">
+    <license name="Apache 2.0"/>
+    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
+    <description>
+        Apache Nutch
+    </description>
+  </info>
+
+  <configurations>
+    <include file="../../../ivy/ivy-configurations.xml"/>
+  </configurations>
+
+  <publications>
+    <!--get the artifact from our module name-->
+    <artifact conf="master"/>
+  </publications>
+
+  <dependencies/>
+  
+</ivy-module>
diff --git a/src/plugin/indexer-csv/plugin.xml b/src/plugin/indexer-csv/plugin.xml
new file mode 100644
index 000000000..072c3ab1d
--- /dev/null
+++ b/src/plugin/indexer-csv/plugin.xml
@@ -0,0 +1,38 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  
+  http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<plugin id="indexer-csv" name="CSVIndexWriter" version="1.0.0"
+  provider-name="nutch.apache.org">
+
+  <runtime>
+    <library name="indexer-csv.jar">
+      <export name="*" />
+    </library>
+  </runtime>
+
+  <requires>
+    <import plugin="nutch-extensionpoints" />
+  </requires>
+
+  <extension id="org.apache.nutch.indexer.csv"
+    name="CSV Index Writer"
+    point="org.apache.nutch.indexer.IndexWriter">
+    <implementation id="CSVIndexWriter"
+      class="org.apache.nutch.indexwriter.csv.CSVIndexWriter" />
+  </extension>
+
+</plugin>
diff --git a/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVConstants.java b/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVConstants.java
new file mode 100644
index 000000000..aa927ced5
--- /dev/null
+++ b/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVConstants.java
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexwriter.csv;
+
+public interface CSVConstants {
+  
+  String CSV_FIELDS = "fields";
+
+  String CSV_CHARSET = "charset";
+
+  String CSV_FIELD_SEPARATOR = "separator";
+
+  String CSV_VALUESEPARATOR = "valuesep";
+
+  String CSV_QUOTECHARACTER = "quotechar";
+
+  String CSV_ESCAPECHARACTER = "escapechar";
+
+  String CSV_MAXFIELDLENGTH = "maxfieldlength";
+
+  String CSV_MAXFIELDVALUES = "maxfieldvalues";
+
+  String CSV_WITHHEADER = "header";
+
+  String CSV_OUTPATH = "outpath";
+
+}
diff --git a/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java b/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
new file mode 100644
index 000000000..c17467aa0
--- /dev/null
+++ b/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
@@ -0,0 +1,419 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.indexwriter.csv;
+
+import java.io.IOException;
+import java.nio.charset.Charset;
+import java.util.Date;
+import java.util.List;
+import java.util.ListIterator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.indexer.*;
+import org.apache.nutch.util.NutchConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Write Nutch documents to a CSV file (comma separated values), i.e., dump
+ * index as CSV or tab-separated plain text table. Format (encoding, separators,
+ * etc.) is configurable by a couple of options, see output of
+ * {@link #describe()}.
+ *
+ * <p>
+ * Note: works only in local mode, to be used with index option
+ * <code>-noCommit</code>.
+ * </p>
+ */
+public class CSVIndexWriter implements IndexWriter {
+
+  public static final Logger LOG = LoggerFactory
+      .getLogger(CSVIndexWriter.class);
+
+  private Configuration config;
+
+  /** ordered list of fields (columns) in the CSV file */
+  private String[] fields;
+
+  /** encoding of CSV file */
+  protected Charset encoding = Charset.forName("UTF-8");
+
+  /**
+   * represent separators (also quote and escape characters) as char(s) and
+   * byte(s) in the output encoding for efficiency.
+   */
+  protected class Separator {
+    protected String sepStr;
+    protected char[] chars;
+    protected byte[] bytes;
+
+    protected Separator(String sep) {
+      set(sep);
+    }
+
+    protected void set(String str) {
+      if (str != null) {
+        sepStr = str;
+        if (str.length() == 0) {
+          // empty separator
+          chars = new char[0];
+        } else {
+          chars = str.toCharArray();
+        }
+      }
+      // always convert to bytes (encoding may have changed)
+      bytes = sepStr.getBytes(encoding);
+    }
+
+    public String toString() {
+      StringBuilder sb = new StringBuilder();
+      for (char c : chars) {
+        if (c == '\n') {
+          sb.append("\\n");
+        } else if (c == '\r') {
+          sb.append("\\r");
+        } else if (c == '\t') {
+          sb.append("\\t");
+        } else if (c >= 0x7f || c <= 0x20) {
+          sb.append(String.format("\\u%04x", (int) c));
+        } else {
+          sb.append(c);
+        }
+      }
+      return sb.toString();
+    }
+
+    protected void setFromConf(IndexWriterParams parameters, String property) {
+      setFromConf(parameters, property, false);
+    }
+
+    protected void setFromConf(IndexWriterParams parameters, String property,
+        boolean isChar) {
+      String str = parameters.get(property);
+      if (isChar && str != null && !str.isEmpty()) {
+        LOG.warn("Separator " + property
+            + " must be a char, only the first character '" + str.charAt(0)
+            + "' of \"" + str + "\" is used");
+        str = str.substring(0, 1);
+      }
+      set(str);
+      LOG.info(property + " = " + toString());
+    }
+
+    /**
+     * Get index of first occurrence of any separator characters.
+     *
+     * @param value
+     *          String to scan
+     * @param start
+     *          position/index to start scan from
+     * @return position of first occurrence or -1 (not found or empty separator)
+     */
+    protected int find(String value, int start) {
+      if (chars.length == 0)
+        return -1;
+      if (chars.length == 1)
+        return value.indexOf(chars[0], start);
+      int index;
+      for (char c : chars) {
+        if ((index = value.indexOf(c, start)) >= 0) {
+          return index;
+        }
+      }
+      return -1;
+    }
+  }
+
+  /** separator between records (rows) resp. documents */
+  private Separator recordSeparator = new Separator("\r\n");
+
+  /** separator between fields (columns) */
+  private Separator fieldSeparator = new Separator(",");
+
+  /**
+   * separator between multiple values of one field ({@link NutchField} allows
+   * multiple values). Note: there is no escape for a valueSeparator, a character
+   * not present in field data should be chosen.
+   */
+  private Separator valueSeparator = new Separator("|");
+
+  /** quote character used to quote fields containing separators or quotes */
+  private Separator quoteCharacter = new Separator("\"");
+
+  /** escape character used to escape a quote character */
+  private Separator escapeCharacter = quoteCharacter;
+
+  /** max. length of a field value */
+  private int maxFieldLength = 4096;
+
+  /**
+   * max. number of values of one field, useful for fields with potentially many
+   * variant values, e.g., the "anchor" texts field
+   */
+  private int maxFieldValues = 12;
+
+  /** max. length of a field value */
+  private boolean withHeader = true;
+
+  /** output path / directory */
+  private String outputPath = "csvindexwriter";
+
+
+  private static final String description =
+      " - write index as CSV file (comma separated values)"
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_FIELDS,
+          "ordered list of fields (columns) in the CSV file")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_FIELD_SEPARATOR,
+          "separator between fields (columns), default: , (U+002C, comma)")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_QUOTECHARACTER,
+          "quote character used to quote fields containing separators or quotes, "
+              + "default: \" (U+0022, quotation mark)")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_ESCAPECHARACTER,
+          "escape character used to escape a quote character, "
+              + "default: \" (U+0022, quotation mark)")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_VALUESEPARATOR,
+          "separator between multiple values of one field, "
+              + "default: | (U+007C)")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_MAXFIELDVALUES,
+          "max. number of values of one field, useful for, "
+              + " e.g., the anchor texts field, default: 12")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_MAXFIELDLENGTH,
+          "max. length of a single field value in characters, default: 4096.")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_CHARSET,
+          "encoding of CSV file, default: UTF-8")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_WITHHEADER,
+          "write CSV column headers, default: true")
+      + String.format("\n  %-24s : %s", CSVConstants.CSV_OUTPATH,
+          "output path / directory, default: csvindexwriter. "
+          + "\n    CAVEAT: existing output directories are removed!") + "\n";
+
+
+  private FileSystem fs;
+
+  protected FSDataOutputStream csvout;
+
+  private Path csvLocalOutFile;
+
+  @Override
+  public void open(Configuration conf, String name) throws IOException {
+
+  }
+
+  /**
+   * Initializes the internal variables from a given index writer configuration.
+   *
+   * @param parameters Params from the index writer configuration.
+   * @throws IOException Some exception thrown by writer.
+   */
+  @Override
+  public void open(IndexWriterParams parameters) throws IOException {
+    outputPath = parameters.get(CSVConstants.CSV_OUTPATH, outputPath);
+    String charset = parameters.get(CSVConstants.CSV_CHARSET);
+    if (charset != null) {
+      encoding = Charset.forName(charset);
+    }
+    fieldSeparator.setFromConf(parameters, CSVConstants.CSV_FIELD_SEPARATOR);
+    quoteCharacter.setFromConf(parameters, CSVConstants.CSV_QUOTECHARACTER, true);
+    escapeCharacter.setFromConf(parameters, CSVConstants.CSV_ESCAPECHARACTER, true);
+    valueSeparator.setFromConf(parameters, CSVConstants.CSV_VALUESEPARATOR);
+    withHeader = parameters.getBoolean(CSVConstants.CSV_WITHHEADER, true);
+    maxFieldLength = parameters.getInt(CSVConstants.CSV_MAXFIELDLENGTH, maxFieldLength);
+    LOG.info(CSVConstants.CSV_MAXFIELDLENGTH + " = " + maxFieldLength);
+    maxFieldValues = parameters.getInt(CSVConstants.CSV_MAXFIELDVALUES, maxFieldValues);
+    LOG.info(CSVConstants.CSV_MAXFIELDVALUES + " = " + maxFieldValues);
+    fields = parameters.getStrings(CSVConstants.CSV_FIELDS, "id", "title", "content");
+    LOG.info("fields =");
+    for (String f : fields) {
+      LOG.info("\t" + f);
+    }
+
+    fs = FileSystem.get(config);
+    LOG.info("Writing output to {}", outputPath);
+    Path outputDir = new Path(outputPath);
+    fs = outputDir.getFileSystem(config);
+    csvLocalOutFile = new Path(outputDir, "nutch.csv");
+    if (!fs.exists(outputDir)) {
+      fs.mkdirs(outputDir);
+    }
+    if (fs.exists(csvLocalOutFile)) {
+      // clean-up
+      LOG.warn("Removing existing output path {}", csvLocalOutFile);
+      fs.delete(csvLocalOutFile, true);
+    }
+    csvout = fs.create(csvLocalOutFile);
+    if (withHeader) {
+      for (int i = 0; i < fields.length; i++) {
+        if (i > 0)
+          csvout.write(fieldSeparator.bytes);
+        csvout.write(fields[i].getBytes(encoding));
+      }
+    }
+    csvout.write(recordSeparator.bytes);
+  }
+
+  @Override
+  public void write(NutchDocument doc) throws IOException {
+    for (int i = 0; i < fields.length; i++) {
+      if (i > 0) {
+        csvout.write(fieldSeparator.bytes);
+      }
+      NutchField field = doc.getField(fields[i]);
+      if (field != null) {
+        List<Object> values = field.getValues();
+        int nValues = values.size();
+        if (nValues > maxFieldValues) {
+          nValues = maxFieldValues;
+        }
+        if (nValues > 1) {
+          // always quote multi-value fields
+          csvout.write(quoteCharacter.bytes);
+        }
+        ListIterator<Object> it = values.listIterator();
+        int j = 0;
+        while (it.hasNext() && j <= nValues) {
+          Object objval = it.next();
+          String value;
+          if (objval == null) {
+            continue;
+          } else if (objval instanceof Date) {
+            // date: format as "dow mon dd hh:mm:ss zzz yyyy"
+            value = objval.toString();
+          } else {
+            value = (String) objval;
+          }
+          if (nValues > 1) {
+            // multi-value field
+            writeEscaped(value);
+            if (it.hasNext()) {
+              csvout.write(valueSeparator.bytes);
+            }
+          } else {
+            writeQuoted(value);
+          }
+        }
+        if (nValues > 1) {
+          // closing quote of multi-value fields
+          csvout.write(quoteCharacter.bytes);
+        }
+      }
+    }
+    csvout.write(recordSeparator.bytes);
+  }
+
+  /** (deletion of documents is not supported) */
+  @Override
+  public void delete(String key) {
+  }
+
+  @Override
+  public void update(NutchDocument doc) throws IOException {
+    write(doc);
+  }
+
+  @Override
+  public void close() throws IOException {
+    csvout.close();
+    LOG.info("Finished CSV index in {}", csvLocalOutFile);
+  }
+
+  /** (nothing to commit) */
+  @Override
+  public void commit() {
+  }
+
+  @Override
+  public Configuration getConf() {
+    return config;
+  }
+
+  @Override
+  public String describe() {
+    return getClass().getSimpleName() + description;
+  }
+
+  @Override
+  public void setConf(Configuration conf) {
+    config = conf;
+  }
+
+  /** Write a value to output stream. If necessary use quote characters. */
+  private void writeQuoted (String value) throws IOException {
+    int nextQuoteChar;
+    if (quoteCharacter.chars.length > 0
+        && (((nextQuoteChar = quoteCharacter.find(value, 0)) >= 0)
+            || (fieldSeparator.find(value, 0) >= 0)
+            || (recordSeparator.find(value, 0) >= 0))) {
+      // need quotes
+      csvout.write(quoteCharacter.bytes);
+      writeEscaped(value, nextQuoteChar);
+      csvout.write(quoteCharacter.bytes);
+    } else {
+      if (value.length() > maxFieldLength) {
+        csvout.write(value.substring(0, maxFieldLength).getBytes(encoding));
+      } else {
+        csvout.write(value.getBytes(encoding));
+      }
+    }
+  }
+
+  /**
+   * Write a value to output stream. Escape quote characters.
+   * Clip value after <code>indexer.csv.maxfieldlength</code> characters.
+   *
+   * @param value
+   *          String to write
+   * @param nextQuoteChar
+   *          (first) occurrence of the quote character
+   */
+  private void writeEscaped (String value, int nextQuoteChar) throws IOException {
+    int start = 0;
+    int max = value.length();
+    if (max > maxFieldLength) {
+      max = maxFieldLength;
+    }
+    while (nextQuoteChar > 0 && nextQuoteChar < max) {
+      csvout.write(value.substring(start, nextQuoteChar).getBytes(encoding));
+      csvout.write(escapeCharacter.bytes);
+      csvout.write(quoteCharacter.bytes);
+      start = nextQuoteChar + 1;
+      nextQuoteChar = quoteCharacter.find(value, start);
+      if (nextQuoteChar > max) break;
+    }
+    csvout.write(value.substring(start, max).getBytes(encoding));
+  }
+
+  /**
+   * Write a value to output stream. Escape quote characters. Clip value after
+   * <code>indexer.csv.maxfieldlength</code> characters.
+   */
+  private void writeEscaped (String value) throws IOException {
+    int nextQuoteChar = quoteCharacter.find(value, 0);
+    writeEscaped(value, nextQuoteChar);
+  }
+
+  public static void main(String[] args) throws Exception {
+    final int res = ToolRunner.run(NutchConfiguration.create(),
+            new IndexingJob(), args);
+    System.exit(res);
+  }
+
+}
diff --git a/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/package-info.java b/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/package-info.java
new file mode 100644
index 000000000..1490567da
--- /dev/null
+++ b/src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/package-info.java
@@ -0,0 +1,21 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Index writer plugin to write a plain CSV file.
+ */
+package org.apache.nutch.indexwriter.csv;
\ No newline at end of file
diff --git a/src/plugin/indexer-csv/src/test/org/apache/nutch/indexwriter/csv/TestCSVIndexWriter.java b/src/plugin/indexer-csv/src/test/org/apache/nutch/indexwriter/csv/TestCSVIndexWriter.java
new file mode 100644
index 000000000..9110cd98b
--- /dev/null
+++ b/src/plugin/indexer-csv/src/test/org/apache/nutch/indexwriter/csv/TestCSVIndexWriter.java
@@ -0,0 +1,257 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.indexwriter.csv;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.UnsupportedEncodingException;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.TimeZone;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.nutch.indexer.IndexWriterParams;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Test CSVIndexWriter. Focus is on CSV-specific potential issues, mainly quoting and escaping.
+ */
+public class TestCSVIndexWriter {
+
+  protected static final Logger LOG = LoggerFactory
+      .getLogger(TestCSVIndexWriter.class);
+
+  /**
+   * Dummy IndexWriter which stores the indexed documents as CSV string in a
+   * {@link ByteArrayOutputStream} which can be easily accessed in test cases.
+   */
+  public class CSVByteArrayIndexWriter extends CSVIndexWriter {
+
+    ByteArrayOutputStream byteBuffer;
+    FileSystem.Statistics fsStats;
+
+    @Override
+    public void open(IndexWriterParams parameters) throws IOException {
+      super.open(parameters);
+      byteBuffer = new ByteArrayOutputStream();
+      fsStats = new FileSystem.Statistics("testCSVIndexWriter");
+      csvout = new FSDataOutputStream(byteBuffer, fsStats);
+    }
+
+    @Override
+    public void close() throws IOException {
+    }
+
+    /** get the indexed documents as CSV */
+    public String getData() {
+      try {
+        return byteBuffer.toString(encoding.name());
+      } catch (UnsupportedEncodingException e) {
+        return "";
+      }
+    }
+  }
+
+  /**
+   * write one NutchDocument as CSV record
+   *
+   * @param configParams configuration parameters: array (property => value, prop2 => value)
+   * @param docs         NutchDocument
+   * @return CSV string representing the document
+   */
+  private String getCSV(final String[] configParams, NutchDocument[] docs)
+      throws IOException {
+    Configuration conf = NutchConfiguration.create();
+    IndexWriterParams params = new IndexWriterParams(new HashMap<>());
+    for (int i = 0; i < configParams.length; i += 2) {
+      params.put(configParams[i], configParams[i + 1]);
+    }
+    CSVByteArrayIndexWriter out = new CSVByteArrayIndexWriter();
+    out.setConf(conf);
+    out.open(params);
+    for (NutchDocument doc : docs) {
+      out.write(doc);
+    }
+    out.close();
+    String csv = out.getData();
+    LOG.info(csv);
+    return csv;
+  }
+
+  /**
+   * write one document as CSV record
+   *
+   * @param configParams configuration parameters: array (property => value, prop2 => value)
+   * @param fieldContent array of {field => value} maps
+   * @return CSV string representing the document
+   */
+  private String getCSV(final String[] configParams, final String[] fieldContent)
+      throws IOException {
+    NutchDocument[] docs = new NutchDocument[1];
+    docs[0] = new NutchDocument();
+    for (int i = 0; i < fieldContent.length; i += 2) {
+      docs[0].add(fieldContent[i], fieldContent[i + 1]);
+    }
+    return getCSV(configParams, docs);
+  }
+
+  /** defaults, no quoting necessary */
+  @Test
+  public void testCSVdefault() throws IOException {
+    String[] fields = { "id", "http://nutch.apache.org/", "title",
+        "Welcome to Apache Nutch", "content",
+        "Apache Nutch is an open source web-search software project. ..." };
+    String csv = getCSV(new String[0], fields);
+    for (int i = 0; i < fields.length; i += 2) {
+      assertTrue("Testing field " + i + " (" + fields[i] + ")",
+          csv.contains(fields[i + 1]));
+    }
+  }
+
+  @Test
+  public void testCSVquoteFieldSeparators() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test,test2" };
+    String[] fields = { "test", "a,b", "test2", "c,d" };
+    String csv = getCSV(params, fields);
+    assertEquals("If field contains a fields separator, it must be quoted",
+        "\"a,b\",\"c,d\"", csv.trim());
+  }
+
+  @Test
+  public void testCSVquoteRecordSeparators() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test" };
+    String[] fields = { "test", "a\nb" };
+    String csv = getCSV(params, fields);
+    assertEquals("If field contains a fields separator, it must be quoted",
+        "\"a\nb\"", csv.trim());
+  }
+
+  @Test
+  public void testCSVescapeQuotes() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test" };
+    String[] fields = { "test", "a,b:\"quote\",c" };
+    String csv = getCSV(params, fields);
+    assertEquals("Quotes inside a quoted field must be escaped",
+        "\"a,b:\"\"quote\"\",c\"", csv.trim());
+  }
+
+  @Test
+  public void testCSVclipMaxLength() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test",
+        CSVConstants.CSV_MAXFIELDLENGTH, "8" };
+    String[] fields = { "test", "0123456789" };
+    String csv = getCSV(params, fields);
+    assertEquals("Field clipped to max. length = 8", "01234567", csv.trim());
+  }
+
+  @Test
+  public void testCSVclipMaxLengthQuote() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test",
+        CSVConstants.CSV_MAXFIELDLENGTH, "7" };
+    String[] fields = { "test", "1,\"2\",3,\"4\"" };
+    String csv = getCSV(params, fields);
+    assertEquals("Field clipped to max. length = 7", "\"1,\"\"2\"\",3\"",
+        csv.trim());
+  }
+
+  @Test
+  public void testCSVmultiValueFields() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test",
+        CSVConstants.CSV_VALUESEPARATOR, "|",
+        CSVConstants.CSV_QUOTECHARACTER, "" };
+    String[] fields = { "test", "abc", "test", "def" };
+    String csv = getCSV(params, fields);
+    assertEquals("Values of multi-value fields are concatenated by |",
+        "abc|def", csv.trim());
+  }
+
+  @Test
+  public void testCSVEncoding() throws IOException {
+    String[] charsets = { "iso-8859-1",
+        "\u00e4\u00f6\u00fc\u00df\u00e9\u00f4\u00ee", // äöüßéôî
+        "iso-8859-2", "\u0161\u010d\u0159\u016f", // ščřů
+        "iso-8859-5", "\u0430\u0441\u0434\u0444", // асдф
+    };
+    for (int i = 0; i < charsets.length; i += 2) {
+      String charset = charsets[i];
+      String test = charsets[i + 1];
+      String[] params = { CSVConstants.CSV_FIELDS, "test",
+          CSVConstants.CSV_CHARSET, charset };
+      String[] fields = { "test", test };
+      String csv = getCSV(params, fields);
+      assertEquals("wrong charset conversion", test, csv.trim());
+    }
+  }
+
+  /** test non-ASCII separator */
+  @Test
+  public void testCSVEncodingSeparator() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "test",
+        CSVConstants.CSV_CHARSET, "iso-8859-1",
+        CSVConstants.CSV_VALUESEPARATOR, "\u00a6", // ¦ (broken bar)
+        CSVConstants.CSV_QUOTECHARACTER, ""
+    };
+    String[] fields = { "test", "abc", "test", "def" };
+    String csv = getCSV(params, fields);
+    assertEquals("Values of multi-value fields are concatenated by ¦",
+        "abc\u00a6def", csv.trim());
+  }
+
+  @Test
+  public void testCSVtabSeparated() throws IOException {
+    String[] params = { CSVConstants.CSV_FIELDS, "1,2,3",
+        CSVConstants.CSV_FIELD_SEPARATOR, "\t",
+        CSVConstants.CSV_QUOTECHARACTER, ""
+    };
+    NutchDocument[] docs = new NutchDocument[2];
+    docs[0] = new NutchDocument();
+    docs[0].add("1", "a");
+    docs[0].add("1", "b");
+    docs[0].add("2", "a\"2\"b");
+    docs[0].add("3", "c,d");
+    docs[1] = new NutchDocument();
+    docs[1].add("1", "A");
+    docs[1].add("2", "B");
+    docs[1].add("3", "C");
+    String csv = getCSV(params, docs);
+    String[] records = csv.trim().split("\\r\\n");
+    assertEquals("tab-separated output", "a|b\ta\"2\"b\tc,d", records[0]);
+    assertEquals("tab-separated output", "A\tB\tC", records[1]);
+  }
+
+  @Test
+  public void testCSVdateField() throws IOException {
+    TimeZone.setDefault(TimeZone.getTimeZone("UTC"));
+    String[] params = { CSVConstants.CSV_FIELDS, "date" };
+    NutchDocument[] docs = new NutchDocument[1];
+    docs[0] = new NutchDocument();
+    docs[0].add("date", new Date(0)); // 1970-01-01
+    String csv = getCSV(params, docs);
+    assertTrue("date conversion", csv.contains("1970"));
+  }
+}
+


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Indexer plugin to write CSV
> ---------------------------
>
>                 Key: NUTCH-1541
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1541
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.15
>
>         Attachments: NUTCH-1541-v1.patch, NUTCH-1541-v2.patch
>
>
> With the new pluggable indexer a simple plugin would be handy to write configurable fields into a CSV file - for further analysis or just for export.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)