[jira] [Commented] (NUTCH-2399) indexer-elastic does not index multi-value fields (only the first value is indexed)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2399) indexer-elastic does not index multi-value fields (only the first value is indexed)

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280073#comment-16280073 ]

ASF GitHub Bot commented on NUTCH-2399:
---------------------------------------

jorgelbg closed pull request #236: NUTCH-2399 Add support for multivalue fields on indexer-elastic
URL: https://github.com/apache/nutch/pull/236
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
index dfcc01cf4..5921ab9af 100644
--- a/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
+++ b/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
@@ -24,6 +24,7 @@
 import java.io.IOException;
 import java.net.InetAddress;
 import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
 import java.util.concurrent.TimeUnit;
 
@@ -32,6 +33,7 @@
 import org.apache.hadoop.mapred.JobConf;
 import org.apache.nutch.indexer.IndexWriter;
 import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.indexer.NutchField;
 import org.elasticsearch.action.bulk.BulkResponse;
 import org.elasticsearch.action.bulk.BulkRequest;
 import org.elasticsearch.action.bulk.BackoffPolicy;
@@ -174,9 +176,13 @@ public void write(NutchDocument doc) throws IOException {
 
     // Add each field of this doc to the index source
     Map<String, Object> source = new HashMap<String, Object>();
-    for (String fieldName : doc.getFieldNames()) {
-      if (doc.getFieldValue(fieldName) != null) {
-        source.put(fieldName, doc.getFieldValue(fieldName));
+    for (final Map.Entry<String, NutchField> e : doc) {
+      final List<Object> values = e.getValue().getValues();
+
+      if (values.size() > 1) {
+        source.put(e.getKey(), values);
+      } else {
+        source.put(e.getKey(), values.get(0));
       }
     }
 


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> indexer-elastic does not index multi-value fields (only the first value is indexed)
> -----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2399
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2399
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Yossi Tamari
>             Fix For: 1.14
>
>
> Currently, if there is a NutchField with multiple values, only the first value is indexed (because this is what doc.getFieldValue returns). Pull request #200 checks if the NutchField has multiple values, and if so, they are added as an array (multivalue) field.
> [https://github.com/apache/nutch/pull/200]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)