Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

Jay Hill
I've got a very difficult project to tackle. I've been tasked with using
schemaless mode to index json files that we receive. The structure of the
json files will always be very different as we're receiving files from
different customers totally unrelated to one another. We are attempting to
build a "one size fits all" approach to receiving documents from a wide
variety of sources and then index them into Solr.

We're running in Solr 5.3. The schemaless approach works well enough -
until it doesn't. It seems to fail on type guessing and also gets confused
indexing to different shards. If it was reliable it would be the perfect
solution for our task. But the larger the JSON file the more likely it is
to fail. At a certain size it just doesn't work.

I've been advised by some experts and committers that schemaless is a good
tool for prototyping, but risky to run in production, but we thought we
would try it by doing offline indexing using the Cloudera
MapReduceIndexerTool to build offline indexes - but still using managed
schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
that pipes together a series of commands to transform data. For example a
JSON or CSV file can be processed and loaded into a Solr index with a
"readJSON" command piped to a "loadSolr" command, for a simple example.

But the kite-sdk that manages the morphlines only seems to offer as they're
latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
4.10.3)

So I can't see any way to integrate schemaless (which has dependencies
after 4.10.3) with the morphlines.

But I thought I would ask here: Anybody had ANY experience with morphlines
to index to Solr? Any info would help me make sense of this.

Cheers to all!
Reply | Threaded
Open this post in threaded view
|

Re: Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

Erick Erickson
Hey Jay!

All I can say is "good luck with that". I do know Morphlines uses
EmbeddedSolrServer to do its work. So I don't really see a good way to
pluck just what you'd need for schemaless.

The MapReduceIndexerTool is carried right along with Solr though. IIRC
the Morphlines stuff is mostly the ETL process. Have you tried just
running an MRIT job with a current Solr? I have no idea whether it'd
work, but it seem like it "should"...

Erick

On Fri, Mar 17, 2017 at 5:51 PM, Jay Hill <[hidden email]> wrote:

> I've got a very difficult project to tackle. I've been tasked with using
> schemaless mode to index json files that we receive. The structure of the
> json files will always be very different as we're receiving files from
> different customers totally unrelated to one another. We are attempting to
> build a "one size fits all" approach to receiving documents from a wide
> variety of sources and then index them into Solr.
>
> We're running in Solr 5.3. The schemaless approach works well enough -
> until it doesn't. It seems to fail on type guessing and also gets confused
> indexing to different shards. If it was reliable it would be the perfect
> solution for our task. But the larger the JSON file the more likely it is
> to fail. At a certain size it just doesn't work.
>
> I've been advised by some experts and committers that schemaless is a good
> tool for prototyping, but risky to run in production, but we thought we
> would try it by doing offline indexing using the Cloudera
> MapReduceIndexerTool to build offline indexes - but still using managed
> schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
> that pipes together a series of commands to transform data. For example a
> JSON or CSV file can be processed and loaded into a Solr index with a
> "readJSON" command piped to a "loadSolr" command, for a simple example.
>
> But the kite-sdk that manages the morphlines only seems to offer as they're
> latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
> 4.10.3)
>
> So I can't see any way to integrate schemaless (which has dependencies
> after 4.10.3) with the morphlines.
>
> But I thought I would ask here: Anybody had ANY experience with morphlines
> to index to Solr? Any info would help me make sense of this.
>
> Cheers to all!