Indexing files from HDFS

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing files from HDFS

István
Hi,

I have Solr 4.10.3 part of a CDH5 installation and I would like to index
huge amount of CSV files on HDFS. I was wondering what is the best way of
doing that.

Here is the current approach:

data.csv:

id, fruit
10, apple
20, orange

Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.jar

hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.11.1-job.jar
\
org.apache.solr.hadoop.MapReduceIndexerTool \
-D 'mapred.child.java.opts=-Xmx500m' --log4j \
/opt/cloudera/parcels/CDH/share/doc/search/examples/solr-nrt/log4j.properties
--morphline-file \
/home/user/readCSV.conf \
--output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
--go-live \
--zk-host name-node.server.com:2181/solr --collection collection0 \
hdfs://name-node.server.com:8020/user/solr/input

This leads to the following exception:

2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1
files using 1 real mappers into 1 reducers
Error: java.io.IOException: Batch Write Failure
        at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
..
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] unknown
field 'file_path'
        at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
        at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)

It appears to me that the schema does not have file_path. The collection is
created through Hue and it properly identifies the two fields id and fruit.
I found out that the search-mr tool has the following code that references
the file_path:

https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30

I am not sure what to do in order to be able to index files on HDFS. I have
two guesses:

- add the fields definied in the search tool to the schema when I create it
(not sure how that work through Hue)
- disable the HDFS meatadata insertion when inserting data

Has anybody seen this before?

Thanks,
Istvan




--
the sun shines for all
Reply | Threaded
Open this post in threaded view
|

Re: Indexing files from HDFS

Erick Erickson
You probably get much more informed responses from
the Cloudera folks, especially about Hue.

Best,
Erick

On Wed, Oct 11, 2017 at 6:05 AM, István <[hidden email]> wrote:

> Hi,
>
> I have Solr 4.10.3 part of a CDH5 installation and I would like to index
> huge amount of CSV files on HDFS. I was wondering what is the best way of
> doing that.
>
> Here is the current approach:
>
> data.csv:
>
> id, fruit
> 10, apple
> 20, orange
>
> Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.jar
>
> hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
> /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.11.1-job.jar
> \
> org.apache.solr.hadoop.MapReduceIndexerTool \
> -D 'mapred.child.java.opts=-Xmx500m' --log4j \
> /opt/cloudera/parcels/CDH/share/doc/search/examples/solr-nrt/log4j.properties
> --morphline-file \
> /home/user/readCSV.conf \
> --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
> --go-live \
> --zk-host name-node.server.com:2181/solr --collection collection0 \
> hdfs://name-node.server.com:8020/user/solr/input
>
> This leads to the following exception:
>
> 2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1
> files using 1 real mappers into 1 reducers
> Error: java.io.IOException: Batch Write Failure
>         at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
> ..
> Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] unknown
> field 'file_path'
>         at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
>         at
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
>
> It appears to me that the schema does not have file_path. The collection is
> created through Hue and it properly identifies the two fields id and fruit.
> I found out that the search-mr tool has the following code that references
> the file_path:
>
> https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30
>
> I am not sure what to do in order to be able to index files on HDFS. I have
> two guesses:
>
> - add the fields definied in the search tool to the schema when I create it
> (not sure how that work through Hue)
> - disable the HDFS meatadata insertion when inserting data
>
> Has anybody seen this before?
>
> Thanks,
> Istvan
>
>
>
>
> --
> the sun shines for all
Reply | Threaded
Open this post in threaded view
|

Re: Indexing files from HDFS

István
Hi Erik,

The question is not about Hue but about why file_path is in the schema for
HDFS files when using search-mr. I am wondering what is the standard way of
indexing files on HDFS.

THanks,
Istvan

On Wed, Oct 11, 2017 at 4:53 PM, Erick Erickson <[hidden email]>
wrote:

> You probably get much more informed responses from
> the Cloudera folks, especially about Hue.
>
> Best,
> Erick
>
> On Wed, Oct 11, 2017 at 6:05 AM, István <[hidden email]> wrote:
> > Hi,
> >
> > I have Solr 4.10.3 part of a CDH5 installation and I would like to index
> > huge amount of CSV files on HDFS. I was wondering what is the best way of
> > doing that.
> >
> > Here is the current approach:
> >
> > data.csv:
> >
> > id, fruit
> > 10, apple
> > 20, orange
> >
> > Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.
> jar
> >
> > hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
> > /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.
> 0-cdh5.11.1-job.jar
> > \
> > org.apache.solr.hadoop.MapReduceIndexerTool \
> > -D 'mapred.child.java.opts=-Xmx500m' --log4j \
> > /opt/cloudera/parcels/CDH/share/doc/search/examples/
> solr-nrt/log4j.properties
> > --morphline-file \
> > /home/user/readCSV.conf \
> > --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
> > --go-live \
> > --zk-host name-node.server.com:2181/solr --collection collection0 \
> > hdfs://name-node.server.com:8020/user/solr/input
> >
> > This leads to the following exception:
> >
> > 2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  -
> Indexing 1
> > files using 1 real mappers into 1 reducers
> > Error: java.io.IOException: Batch Write Failure
> >         at org.apache.solr.hadoop.BatchWriter.throwIf(
> BatchWriter.java:239)
> > ..
> > Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100]
> unknown
> > field 'file_path'
> >         at
> > org.apache.solr.update.DocumentBuilder.toDocument(
> DocumentBuilder.java:185)
> >         at
> > org.apache.solr.update.AddUpdateCommand.getLuceneDocument(
> AddUpdateCommand.java:78)
> >
> > It appears to me that the schema does not have file_path. The collection
> is
> > created through Hue and it properly identifies the two fields id and
> fruit.
> > I found out that the search-mr tool has the following code that
> references
> > the file_path:
> >
> > https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/
> search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30
> >
> > I am not sure what to do in order to be able to index files on HDFS. I
> have
> > two guesses:
> >
> > - add the fields definied in the search tool to the schema when I create
> it
> > (not sure how that work through Hue)
> > - disable the HDFS meatadata insertion when inserting data
> >
> > Has anybody seen this before?
> >
> > Thanks,
> > Istvan
> >
> >
> >
> >
> > --
> > the sun shines for all
>



--
the sun shines for all
Reply | Threaded
Open this post in threaded view
|

Re: Indexing files from HDFS

Shawn Heisey
On 10/12/2017 2:04 AM, István wrote:
> The question is not about Hue but about why file_path is in the schema for
> HDFS files when using search-mr. I am wondering what is the standard way of
> indexing files on HDFS.

The error in your original post indicates that at least one document in
the update request contains a "file_path" field, but the active schema
on the Solr index does NOT have that field, so Solr is not able to
handle the indexing request.

It appears that you are using Cloudera software to do the indexing.  If
you cannot tell why the indexing requests have that field, then you will
need to talk to Cloudera about how their software works.

One idea that might work is to add the file_path field to your schema
with a correct type so the indexing requests that are being sent will be
handled correctly.

Thanks,
Shawn