external indexer for Solr Cloud

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

external indexer for Solr Cloud

Lee Chunki
Hi,

Is there any way to run external indexer for solar cloud?


my situation is :

* running two indexer ( for fail over ) and two searcher.
* just use two searcher for service.
* have plan to move on Solr Cloud

however I wonder that if I run indexing job on one of the solr cloud server, the server’s load would be higher than other nodes.
so, I want to build index out of sold cloud but….

Please tell me your case or experience.

Thanks,
Chunki.
Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Jack Krupansky-2
What exactly are you referring to by the term "external indexer"?

-- Jack Krupansky

-----Original Message-----
From: Lee Chunki
Sent: Friday, August 29, 2014 7:21 AM
To: [hidden email]
Subject: external indexer for Solr Cloud

Hi,

Is there any way to run external indexer for solar cloud?


my situation is :

* running two indexer ( for fail over ) and two searcher.
* just use two searcher for service.
* have plan to move on Solr Cloud

however I wonder that if I run indexing job on one of the solr cloud server,
the server’s load would be higher than other nodes.
so, I want to build index out of sold cloud but….

Please tell me your case or experience.

Thanks,
Chunki.=

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Shawn Heisey-4
In reply to this post by Lee Chunki
On 8/29/2014 5:21 AM, Lee Chunki wrote:
> Is there any way to run external indexer for solar cloud?

Jack asked an excellent question.  What do you mean by this?  Unless
you're using the dataimport handler, all indexing is external to Solr.

> my situation is :
>
> * running two indexer ( for fail over ) and two searcher.
> * just use two searcher for service.
> * have plan to move on Solr Cloud
>
> however I wonder that if I run indexing job on one of the solr cloud server, the server’s load would be higher than other nodes.
> so, I want to build index out of sold cloud but….

In SolrCloud, every shard replica will be indexing -- it's not like
old-style replication, where the master indexes everything and the
slaves copy the completed index.  The leader of each shard will be
working slightly harder than the other replicas, but you really don't
need to worry too much about sending all your updates to one server --
those requests get duplicated to the other servers and they all index
them, almost in parallel.

For my setup (non-cloud, but sharded), I use Pacemaker to ensure that
only one of my servers is running my indexing program and haproxy (plus
its shared IP address).

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Jack Krupansky-2
My other thought was that maybe he wants to do index updates outside of the
cluster that is handling queries, and then copy in the completed index.
Or... maybe take replicas out of the query rotation while they are updated.
Or... maybe this is yet another X-Y problem!

-- Jack Krupansky

-----Original Message-----
From: Shawn Heisey
Sent: Friday, August 29, 2014 11:19 AM
To: [hidden email]
Subject: Re: external indexer for Solr Cloud

On 8/29/2014 5:21 AM, Lee Chunki wrote:
> Is there any way to run external indexer for solar cloud?

Jack asked an excellent question.  What do you mean by this?  Unless
you're using the dataimport handler, all indexing is external to Solr.

> my situation is :
>
> * running two indexer ( for fail over ) and two searcher.
> * just use two searcher for service.
> * have plan to move on Solr Cloud
>
> however I wonder that if I run indexing job on one of the solr cloud
> server, the server’s load would be higher than other nodes.
> so, I want to build index out of sold cloud but….

In SolrCloud, every shard replica will be indexing -- it's not like
old-style replication, where the master indexes everything and the
slaves copy the completed index.  The leader of each shard will be
working slightly harder than the other replicas, but you really don't
need to worry too much about sending all your updates to one server --
those requests get duplicated to the other servers and they all index
them, almost in parallel.

For my setup (non-cloud, but sharded), I use Pacemaker to ensure that
only one of my servers is running my indexing program and haproxy (plus
its shared IP address).

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Lee Chunki
Hi Shawn and Jack,

Thank you for your reply.

Yes, I want to run data import hander independently and sync it to Solr Cloud.
because current my DIH node do not only DB fetch & join but also many preprocessing.

Thanks,
Chunki.


On Aug 30, 2014, at 1:34 AM, Jack Krupansky <[hidden email]> wrote:

> My other thought was that maybe he wants to do index updates outside of the cluster that is handling queries, and then copy in the completed index. Or... maybe take replicas out of the query rotation while they are updated. Or... maybe this is yet another X-Y problem!
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Friday, August 29, 2014 11:19 AM
> To: [hidden email]
> Subject: Re: external indexer for Solr Cloud
>
> On 8/29/2014 5:21 AM, Lee Chunki wrote:
>> Is there any way to run external indexer for solar cloud?
>
> Jack asked an excellent question.  What do you mean by this?  Unless
> you're using the dataimport handler, all indexing is external to Solr.
>
>> my situation is :
>>
>> * running two indexer ( for fail over ) and two searcher.
>> * just use two searcher for service.
>> * have plan to move on Solr Cloud
>>
>> however I wonder that if I run indexing job on one of the solr cloud server, the server’s load would be higher than other nodes.
>> so, I want to build index out of sold cloud but….
>
> In SolrCloud, every shard replica will be indexing -- it's not like
> old-style replication, where the master indexes everything and the
> slaves copy the completed index.  The leader of each shard will be
> working slightly harder than the other replicas, but you really don't
> need to worry too much about sending all your updates to one server --
> those requests get duplicated to the other servers and they all index
> them, almost in parallel.
>
> For my setup (non-cloud, but sharded), I use Pacemaker to ensure that
> only one of my servers is running my indexing program and haproxy (plus
> its shared IP address).
>
> Thanks,
> Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Jack Krupansky-2
Okay, but please clarify further - do you simply wish to run DIH externally,
but still sending each document to SolrCloud for indexing, or... are you
expecting to generate the index completely external to the cluster and then
somehow "merge" that DIH "index" into the SolrCloud index?

It would be great to have a "standalone DIH" that runs as a separate server
and then sends standard Solr update requests to a Solr cluster.

-- Jack Krupansky

-----Original Message-----
From: Lee Chunki
Sent: Sunday, August 31, 2014 8:55 PM
To: [hidden email]
Subject: Re: external indexer for Solr Cloud

Hi Shawn and Jack,

Thank you for your reply.

Yes, I want to run data import hander independently and sync it to Solr
Cloud.
because current my DIH node do not only DB fetch & join but also many
preprocessing.

Thanks,
Chunki.


On Aug 30, 2014, at 1:34 AM, Jack Krupansky <[hidden email]> wrote:

> My other thought was that maybe he wants to do index updates outside of
> the cluster that is handling queries, and then copy in the completed
> index. Or... maybe take replicas out of the query rotation while they are
> updated. Or... maybe this is yet another X-Y problem!
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Friday, August 29, 2014 11:19 AM
> To: [hidden email]
> Subject: Re: external indexer for Solr Cloud
>
> On 8/29/2014 5:21 AM, Lee Chunki wrote:
>> Is there any way to run external indexer for solar cloud?
>
> Jack asked an excellent question.  What do you mean by this?  Unless
> you're using the dataimport handler, all indexing is external to Solr.
>
>> my situation is :
>>
>> * running two indexer ( for fail over ) and two searcher.
>> * just use two searcher for service.
>> * have plan to move on Solr Cloud
>>
>> however I wonder that if I run indexing job on one of the solr cloud
>> server, the server’s load would be higher than other nodes.
>> so, I want to build index out of sold cloud but….
>
> In SolrCloud, every shard replica will be indexing -- it's not like
> old-style replication, where the master indexes everything and the
> slaves copy the completed index.  The leader of each shard will be
> working slightly harder than the other replicas, but you really don't
> need to worry too much about sending all your updates to one server --
> those requests get duplicated to the other servers and they all index
> them, almost in parallel.
>
> For my setup (non-cloud, but sharded), I use Pacemaker to ensure that
> only one of my servers is running my indexing program and haproxy (plus
> its shared IP address).
>
> Thanks,
> Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Shawn Heisey-4
On 9/1/2014 7:19 AM, Jack Krupansky wrote:
> It would be great to have a "standalone DIH" that runs as a separate
> server and then sends standard Solr update requests to a Solr cluster.

This has been discussed, and I thought we had an issue in Jira, but I
can't find it.

A completely standalone DIH app would be REALLY nice.  I already know
that the JDBC ResultSet is not the bottleneck for indexing, at least for
me.  I once built a simple single-threaded SolrJ application that pulls
data from JDBC and indexes it in Solr.  It works in batches, typically
500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
line (so input object manipulation, casting, and building of the
SolrInputDocument objects is still happening), it can read and
manipulate our entire database (99.8 million documents) in about 20
minutes, but if I leave that in, it takes many hours.

The bottleneck is that each DIH has only a single thread indexing to
Solr.  I've theorized that it should be *relatively* easy for me to
write an application that pulls records off the JDBC ResultSet with
multiple threads (say 10-20), have each thread figure out which shard
its document lands on, and send it there with SolrJ.  It might even be
possible for the threads to collect several documents for each shard
before indexing them in the same request.

As with most multithreaded apps, the hard part is figuring out all the
thread synchronization, making absolutely certain that thread timing is
perfect without unnecessary delays.  If I can figure out a generic
approach (with a few configurable bells and whistles available), it
might be something suitable for inclusion in the project, followed with
improvements by all the smart people in our community.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Jack Krupansky-2
Packaging SolrCell in the same manner, with parallel threads and able to
talk to multiple SolrCloud servers in parallel would have a lot of the same
benefits as well.

And maybe there could be some more generic Java framework for indexing as
well, that "external indexers" in general could use.

-- Jack Krupansky

-----Original Message-----
From: Shawn Heisey
Sent: Monday, September 1, 2014 11:42 AM
To: [hidden email]
Subject: Re: external indexer for Solr Cloud

On 9/1/2014 7:19 AM, Jack Krupansky wrote:
> It would be great to have a "standalone DIH" that runs as a separate
> server and then sends standard Solr update requests to a Solr cluster.

This has been discussed, and I thought we had an issue in Jira, but I
can't find it.

A completely standalone DIH app would be REALLY nice.  I already know
that the JDBC ResultSet is not the bottleneck for indexing, at least for
me.  I once built a simple single-threaded SolrJ application that pulls
data from JDBC and indexes it in Solr.  It works in batches, typically
500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
line (so input object manipulation, casting, and building of the
SolrInputDocument objects is still happening), it can read and
manipulate our entire database (99.8 million documents) in about 20
minutes, but if I leave that in, it takes many hours.

The bottleneck is that each DIH has only a single thread indexing to
Solr.  I've theorized that it should be *relatively* easy for me to
write an application that pulls records off the JDBC ResultSet with
multiple threads (say 10-20), have each thread figure out which shard
its document lands on, and send it there with SolrJ.  It might even be
possible for the threads to collect several documents for each shard
before indexing them in the same request.

As with most multithreaded apps, the hard part is figuring out all the
thread synchronization, making absolutely certain that thread timing is
perfect without unnecessary delays.  If I can figure out a generic
approach (with a few configurable bells and whistles available), it
might be something suitable for inclusion in the project, followed with
improvements by all the smart people in our community.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Siegfried Goeschl-2
Hi folks,

we are using Apache Camel but could use Spring Integration with the
option to upgrade to Apache BatchEE or Spring Batch later on -
especially Tikka document extraction can kill you server due to CPU
consumption, memory usage and plain memory leaks

AFAIK Douf Turnbull also improved the Camel Solr Integration

http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/99739

Cheers,

Siegfried Goeschl

On 01.09.14 18:05, Jack Krupansky wrote:

> Packaging SolrCell in the same manner, with parallel threads and able to
> talk to multiple SolrCloud servers in parallel would have a lot of the
> same benefits as well.
>
> And maybe there could be some more generic Java framework for indexing
> as well, that "external indexers" in general could use.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Monday, September 1, 2014 11:42 AM
> To: [hidden email]
> Subject: Re: external indexer for Solr Cloud
>
> On 9/1/2014 7:19 AM, Jack Krupansky wrote:
>> It would be great to have a "standalone DIH" that runs as a separate
>> server and then sends standard Solr update requests to a Solr cluster.
>
> This has been discussed, and I thought we had an issue in Jira, but I
> can't find it.
>
> A completely standalone DIH app would be REALLY nice.  I already know
> that the JDBC ResultSet is not the bottleneck for indexing, at least for
> me.  I once built a simple single-threaded SolrJ application that pulls
> data from JDBC and indexes it in Solr.  It works in batches, typically
> 500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
> line (so input object manipulation, casting, and building of the
> SolrInputDocument objects is still happening), it can read and
> manipulate our entire database (99.8 million documents) in about 20
> minutes, but if I leave that in, it takes many hours.
>
> The bottleneck is that each DIH has only a single thread indexing to
> Solr.  I've theorized that it should be *relatively* easy for me to
> write an application that pulls records off the JDBC ResultSet with
> multiple threads (say 10-20), have each thread figure out which shard
> its document lands on, and send it there with SolrJ.  It might even be
> possible for the threads to collect several documents for each shard
> before indexing them in the same request.
>
> As with most multithreaded apps, the hard part is figuring out all the
> thread synchronization, making absolutely certain that thread timing is
> perfect without unnecessary delays.  If I can figure out a generic
> approach (with a few configurable bells and whistles available), it
> might be something suitable for inclusion in the project, followed with
> improvements by all the smart people in our community.
>
> Thanks,
> Shawn

Reply | Threaded
Open this post in threaded view
|

Re: external indexer for Solr Cloud

Lee Chunki
Hi,

@Jack
the final goal is generate index out of Solr Cloud but run DIH externally is not bad

@Shawn
it sounds great to build a new application that work with multiple threads and send documents to their shards
please let me know the logic how can i decide which document should go to a shard ( i.e. matching rule for document and shard  )

Thanks,
Chunki.

On Sep 2, 2014, at 1:15 AM, Siegfried Goeschl <[hidden email]> wrote:

> Hi folks,
>
> we are using Apache Camel but could use Spring Integration with the option to upgrade to Apache BatchEE or Spring Batch later on - especially Tikka document extraction can kill you server due to CPU consumption, memory usage and plain memory leaks
>
> AFAIK Douf Turnbull also improved the Camel Solr Integration
>
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/99739
>
> Cheers,
>
> Siegfried Goeschl
>
> On 01.09.14 18:05, Jack Krupansky wrote:
>> Packaging SolrCell in the same manner, with parallel threads and able to
>> talk to multiple SolrCloud servers in parallel would have a lot of the
>> same benefits as well.
>>
>> And maybe there could be some more generic Java framework for indexing
>> as well, that "external indexers" in general could use.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Shawn Heisey
>> Sent: Monday, September 1, 2014 11:42 AM
>> To: [hidden email]
>> Subject: Re: external indexer for Solr Cloud
>>
>> On 9/1/2014 7:19 AM, Jack Krupansky wrote:
>>> It would be great to have a "standalone DIH" that runs as a separate
>>> server and then sends standard Solr update requests to a Solr cluster.
>>
>> This has been discussed, and I thought we had an issue in Jira, but I
>> can't find it.
>>
>> A completely standalone DIH app would be REALLY nice.  I already know
>> that the JDBC ResultSet is not the bottleneck for indexing, at least for
>> me.  I once built a simple single-threaded SolrJ application that pulls
>> data from JDBC and indexes it in Solr.  It works in batches, typically
>> 500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
>> line (so input object manipulation, casting, and building of the
>> SolrInputDocument objects is still happening), it can read and
>> manipulate our entire database (99.8 million documents) in about 20
>> minutes, but if I leave that in, it takes many hours.
>>
>> The bottleneck is that each DIH has only a single thread indexing to
>> Solr.  I've theorized that it should be *relatively* easy for me to
>> write an application that pulls records off the JDBC ResultSet with
>> multiple threads (say 10-20), have each thread figure out which shard
>> its document lands on, and send it there with SolrJ.  It might even be
>> possible for the threads to collect several documents for each shard
>> before indexing them in the same request.
>>
>> As with most multithreaded apps, the hard part is figuring out all the
>> thread synchronization, making absolutely certain that thread timing is
>> perfect without unnecessary delays.  If I can figure out a generic
>> approach (with a few configurable bells and whistles available), it
>> might be something suitable for inclusion in the project, followed with
>> improvements by all the smart people in our community.
>>
>> Thanks,
>> Shawn
>