Data Import

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Data Import

vishal jain
Hi,


I am new to Solr and am trying to move data from my RDBMS to Solr. I know
the available options are:
1) Post Tool
2) DIH
3) SolrJ (as ours is a J2EE application).

I want to know what is the recommended way for Data import in production
environment.
Will sending data via SolrJ in batches be faster than posting a csv using
POST tool?


Thanks,
Vishal
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Sujay Bawaskar-2
Hi Vishal,

As per my experience DIH is the best for RDBMS to solr index. DIH with
caching has best performance. DIH nested entities allow you to define
simple queries.
Also, solrj is good when you want your RDBMS updates make immediately
available in solr. DIH full import can be used for index all data first
time or restore index in case index is corrupted.

Thanks,
Sujay

On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]> wrote:

> Hi,
>
>
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know
> the available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment.
> Will sending data via SolrJ in batches be faster than posting a csv using
> POST tool?
>
>
> Thanks,
> Vishal
>



--
Thanks,
Sujay P Bawaskar
M:+91-77091 53669
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Shawn Heisey-2
In reply to this post by vishal jain
On 3/17/2017 3:04 AM, vishal jain wrote:
> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the available options are:
> 1) Post Tool
> 2) DIH
> 3) SolrJ (as ours is a J2EE application).
>
> I want to know what is the recommended way for Data import in production
> environment. Will sending data via SolrJ in batches be faster than posting a csv using POST tool?

I've heard that CSV import runs EXTREMELY fast, but I have never tested
it.  The same threading problem that I discuss below would apply to
indexing this way.

DIH is extremely powerful, but it has one glaring problem:  It's
single-threaded, which means that only one stream of data is going into
Solr, and each batch of documents to be inserted must wait for the
previous one to finish inserting before it can start.  I do not know if
DIH batches documents or sends them in one at a time.  If you have a
manually sharded index, you can run DIH on each shard in parallel, but
each one will be single-threaded.  That single thread is pretty
efficient, but it's still only one thread.

Sending multiple index updates to Solr in parallel (multi-threading) is
how you radically speed up the Solr part of indexing.  This is usually
done with a custom indexing program, which might be written with SolrJ
or even in a completely different language.

One thing to keep in mind with ANY indexing method:  Once the situation
is examined closely, most people find that it's not Solr that makes
their indexing slow.  The bottleneck is usually the source system -- how
quickly the data can be retrieved.  It usually takes a lot longer to
obtain the data than it does for Solr to index it.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Alexandre Rafalovitch
I feel DIH is much better for prototyping, even though people do use
it in production. If you do want to use DIH, you may benefit from
reviewing the DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports,
again useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already
have Java stack.

The choice is yours in the end.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey <[hidden email]> wrote:

> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in production
>> environment. Will sending data via SolrJ in batches be faster than posting a csv using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never tested
> it.  The same threading problem that I discuss below would apply to
> indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going into
> Solr, and each batch of documents to be inserted must wait for the
> previous one to finish inserting before it can start.  I do not know if
> DIH batches documents or sends them in one at a time.  If you have a
> manually sharded index, you can run DIH on each shard in parallel, but
> each one will be single-threaded.  That single thread is pretty
> efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading) is
> how you radically speed up the Solr part of indexing.  This is usually
> done with a custom indexing program, which might be written with SolrJ
> or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the situation
> is examined closely, most people find that it's not Solr that makes
> their indexing slow.  The bottleneck is usually the source system -- how
> quickly the data can be retrieved.  It usually takes a lot longer to
> obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Data Import

Liu, Daphne
I just want to share my recent project. I have successfully sent all our EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC Cassandra connector indexing our documents.
Since Cassandra is so fast for writing, compression rate is around 13% and all my documents can be keep in my Cassandra clusters' memory, we are very happy with the result.


Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / [hidden email]



-----Original Message-----
From: Alexandre Rafalovitch [mailto:[hidden email]]
Sent: Friday, March 17, 2017 9:54 AM
To: solr-user <[hidden email]>
Subject: Re: Data Import

I feel DIH is much better for prototyping, even though people do use it in production. If you do want to use DIH, you may benefit from reviewing the DIH-DB example I am currently rewriting in
https://issues.apache.org/jira/browse/SOLR-10312 (may need to change luceneMatchVersion in solrconfig.xml first).

CSV, etc, could be useful if you want to keep history of past imports, again useful during development, as you evolve schema.

SolrJ may actually be easiest/best for production since you already have Java stack.

The choice is yours in the end.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 08:56, Shawn Heisey <[hidden email]> wrote:

> On 3/17/2017 3:04 AM, vishal jain wrote:
>> I am new to Solr and am trying to move data from my RDBMS to Solr. I know the available options are:
>> 1) Post Tool
>> 2) DIH
>> 3) SolrJ (as ours is a J2EE application).
>>
>> I want to know what is the recommended way for Data import in
>> production environment. Will sending data via SolrJ in batches be faster than posting a csv using POST tool?
>
> I've heard that CSV import runs EXTREMELY fast, but I have never
> tested it.  The same threading problem that I discuss below would
> apply to indexing this way.
>
> DIH is extremely powerful, but it has one glaring problem:  It's
> single-threaded, which means that only one stream of data is going
> into Solr, and each batch of documents to be inserted must wait for
> the previous one to finish inserting before it can start.  I do not
> know if DIH batches documents or sends them in one at a time.  If you
> have a manually sharded index, you can run DIH on each shard in
> parallel, but each one will be single-threaded.  That single thread is
> pretty efficient, but it's still only one thread.
>
> Sending multiple index updates to Solr in parallel (multi-threading)
> is how you radically speed up the Solr part of indexing.  This is
> usually done with a custom indexing program, which might be written
> with SolrJ or even in a completely different language.
>
> One thing to keep in mind with ANY indexing method:  Once the
> situation is examined closely, most people find that it's not Solr
> that makes their indexing slow.  The bottleneck is usually the source
> system -- how quickly the data can be retrieved.  It usually takes a
> lot longer to obtain the data than it does for Solr to index it.
>
> Thanks,
> Shawn
>
This e-mail message is intended for the above named recipient(s) only. It may contain confidential information that is privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail by error, please immediately notify the sender by replying to this e-mail and deleting the message including any attachment(s) from your system. Thank you in advance for your cooperation and assistance. Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
OTH
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

OTH
In reply to this post by Sujay Bawaskar-2
>
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr.

How can SolrJ be used to make RDBMS updates immediately available?
Thanks

On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <[hidden email]>
wrote:

> Hi Vishal,
>
> As per my experience DIH is the best for RDBMS to solr index. DIH with
> caching has best performance. DIH nested entities allow you to define
> simple queries.
> Also, solrj is good when you want your RDBMS updates make immediately
> available in solr. DIH full import can be used for index all data first
> time or restore index in case index is corrupted.
>
> Thanks,
> Sujay
>
> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]> wrote:
>
> > Hi,
> >
> >
> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
> > the available options are:
> > 1) Post Tool
> > 2) DIH
> > 3) SolrJ (as ours is a J2EE application).
> >
> > I want to know what is the recommended way for Data import in production
> > environment.
> > Will sending data via SolrJ in batches be faster than posting a csv using
> > POST tool?
> >
> >
> > Thanks,
> > Vishal
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Alexandre Rafalovitch
One assumes by hooking into the same code that updates RDBMS, as
opposed to be reverse engineering the changes from looking at the DB
content. This would be especially the case for Delete changes.

Regards,
   Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:

>>
>> Also, solrj is good when you want your RDBMS updates make immediately
>> available in solr.
>
> How can SolrJ be used to make RDBMS updates immediately available?
> Thanks
>
> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <[hidden email]>
> wrote:
>
>> Hi Vishal,
>>
>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>> caching has best performance. DIH nested entities allow you to define
>> simple queries.
>> Also, solrj is good when you want your RDBMS updates make immediately
>> available in solr. DIH full import can be used for index all data first
>> time or restore index in case index is corrupted.
>>
>> Thanks,
>> Sujay
>>
>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]> wrote:
>>
>> > Hi,
>> >
>> >
>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
>> > the available options are:
>> > 1) Post Tool
>> > 2) DIH
>> > 3) SolrJ (as ours is a J2EE application).
>> >
>> > I want to know what is the recommended way for Data import in production
>> > environment.
>> > Will sending data via SolrJ in batches be faster than posting a csv using
>> > POST tool?
>> >
>> >
>> > Thanks,
>> > Vishal
>> >
>>
>>
>>
>> --
>> Thanks,
>> Sujay P Bawaskar
>> M:+91-77091 53669
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

vishal jain
In reply to this post by Liu, Daphne
Hi Daphne,

Are you using DSE?


Thanks & Regards,
Vishal

On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne <[hidden email]>
wrote:

> I just want to share my recent project. I have successfully sent all our
> EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import JDBC
> Cassandra connector indexing our documents.
> Since Cassandra is so fast for writing, compression rate is around 13% and
> all my documents can be keep in my Cassandra clusters' memory, we are very
> happy with the result.
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL
> 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 /
> [hidden email]
>
>
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[hidden email]]
> Sent: Friday, March 17, 2017 9:54 AM
> To: solr-user <[hidden email]>
> Subject: Re: Data Import
>
> I feel DIH is much better for prototyping, even though people do use it in
> production. If you do want to use DIH, you may benefit from reviewing the
> DIH-DB example I am currently rewriting in
> https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
> luceneMatchVersion in solrconfig.xml first).
>
> CSV, etc, could be useful if you want to keep history of past imports,
> again useful during development, as you evolve schema.
>
> SolrJ may actually be easiest/best for production since you already have
> Java stack.
>
> The choice is yours in the end.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 08:56, Shawn Heisey <[hidden email]> wrote:
> > On 3/17/2017 3:04 AM, vishal jain wrote:
> >> I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know the available options are:
> >> 1) Post Tool
> >> 2) DIH
> >> 3) SolrJ (as ours is a J2EE application).
> >>
> >> I want to know what is the recommended way for Data import in
> >> production environment. Will sending data via SolrJ in batches be
> faster than posting a csv using POST tool?
> >
> > I've heard that CSV import runs EXTREMELY fast, but I have never
> > tested it.  The same threading problem that I discuss below would
> > apply to indexing this way.
> >
> > DIH is extremely powerful, but it has one glaring problem:  It's
> > single-threaded, which means that only one stream of data is going
> > into Solr, and each batch of documents to be inserted must wait for
> > the previous one to finish inserting before it can start.  I do not
> > know if DIH batches documents or sends them in one at a time.  If you
> > have a manually sharded index, you can run DIH on each shard in
> > parallel, but each one will be single-threaded.  That single thread is
> > pretty efficient, but it's still only one thread.
> >
> > Sending multiple index updates to Solr in parallel (multi-threading)
> > is how you radically speed up the Solr part of indexing.  This is
> > usually done with a custom indexing program, which might be written
> > with SolrJ or even in a completely different language.
> >
> > One thing to keep in mind with ANY indexing method:  Once the
> > situation is examined closely, most people find that it's not Solr
> > that makes their indexing slow.  The bottleneck is usually the source
> > system -- how quickly the data can be retrieved.  It usually takes a
> > lot longer to obtain the data than it does for Solr to index it.
> >
> > Thanks,
> > Shawn
> >
> This e-mail message is intended for the above named recipient(s) only. It
> may contain confidential information that is privileged. If you are not the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this e-mail and any attachment(s) is strictly
> prohibited. If you have received this e-mail by error, please immediately
> notify the sender by replying to this e-mail and deleting the message
> including any attachment(s) from your system. Thank you in advance for your
> cooperation and assistance. Although the company has taken reasonable
> precautions to ensure no viruses are present in this email, the company
> cannot accept responsibility for any loss or damage arising from the use of
> this email or attachments.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

vishal jain
In reply to this post by Alexandre Rafalovitch
Thanks to all of you for the valuable inputs.
Being on J2ee platform I also felt using solrJ in a multi threaded
environment would be a better choice to index RDBMS data into SolrCloud.
I will try with a scheduler triggered micro service to do the job using
SolrJ.

Regards,
Vishal

On Fri, Mar 17, 2017 at 9:11 PM, Alexandre Rafalovitch <[hidden email]>
wrote:

> One assumes by hooking into the same code that updates RDBMS, as
> opposed to be reverse engineering the changes from looking at the DB
> content. This would be especially the case for Delete changes.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
> >>
> >> Also, solrj is good when you want your RDBMS updates make immediately
> >> available in solr.
> >
> > How can SolrJ be used to make RDBMS updates immediately available?
> > Thanks
> >
> > On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <[hidden email]
> >
> > wrote:
> >
> >> Hi Vishal,
> >>
> >> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >> caching has best performance. DIH nested entities allow you to define
> >> simple queries.
> >> Also, solrj is good when you want your RDBMS updates make immediately
> >> available in solr. DIH full import can be used for index all data first
> >> time or restore index in case index is corrupted.
> >>
> >> Thanks,
> >> Sujay
> >>
> >> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]>
> wrote:
> >>
> >> > Hi,
> >> >
> >> >
> >> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >> > the available options are:
> >> > 1) Post Tool
> >> > 2) DIH
> >> > 3) SolrJ (as ours is a J2EE application).
> >> >
> >> > I want to know what is the recommended way for Data import in
> production
> >> > environment.
> >> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >> > POST tool?
> >> >
> >> >
> >> > Thanks,
> >> > Vishal
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Sujay P Bawaskar
> >> M:+91-77091 53669
> >>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Erick Erickson
In reply to this post by Alexandre Rafalovitch
Or set a trigger on your RDBMS's main table to put the relevant
information in a different table (call it EVENTS) and have your SolrJ
consult the EVENTS table periodically. Essentially you're using the
EVENTS table as a queue where the trigger is the producer and the
SolrJ program is the consumer.

It's a polling solution though, so not event-driven. There's no
mechanism that I know of have, say, your RDBMS push an event to DIH
for instance.

Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
for this kind of problem..

Best,
Erick

On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
<[hidden email]> wrote:

> One assumes by hooking into the same code that updates RDBMS, as
> opposed to be reverse engineering the changes from looking at the DB
> content. This would be especially the case for Delete changes.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
>>>
>>> Also, solrj is good when you want your RDBMS updates make immediately
>>> available in solr.
>>
>> How can SolrJ be used to make RDBMS updates immediately available?
>> Thanks
>>
>> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <[hidden email]>
>> wrote:
>>
>>> Hi Vishal,
>>>
>>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>>> caching has best performance. DIH nested entities allow you to define
>>> simple queries.
>>> Also, solrj is good when you want your RDBMS updates make immediately
>>> available in solr. DIH full import can be used for index all data first
>>> time or restore index in case index is corrupted.
>>>
>>> Thanks,
>>> Sujay
>>>
>>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]> wrote:
>>>
>>> > Hi,
>>> >
>>> >
>>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I know
>>> > the available options are:
>>> > 1) Post Tool
>>> > 2) DIH
>>> > 3) SolrJ (as ours is a J2EE application).
>>> >
>>> > I want to know what is the recommended way for Data import in production
>>> > environment.
>>> > Will sending data via SolrJ in batches be faster than posting a csv using
>>> > POST tool?
>>> >
>>> >
>>> > Thanks,
>>> > Vishal
>>> >
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Sujay P Bawaskar
>>> M:+91-77091 53669
>>>
OTH
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

OTH
Could the database trigger not just post the change to solr?

On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson <[hidden email]>
wrote:

> Or set a trigger on your RDBMS's main table to put the relevant
> information in a different table (call it EVENTS) and have your SolrJ
> consult the EVENTS table periodically. Essentially you're using the
> EVENTS table as a queue where the trigger is the producer and the
> SolrJ program is the consumer.
>
> It's a polling solution though, so not event-driven. There's no
> mechanism that I know of have, say, your RDBMS push an event to DIH
> for instance.
>
> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> for this kind of problem..
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> <[hidden email]> wrote:
> > One assumes by hooking into the same code that updates RDBMS, as
> > opposed to be reverse engineering the changes from looking at the DB
> > content. This would be especially the case for Delete changes.
> >
> > Regards,
> >    Alex.
> > ----
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
> >>>
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr.
> >>
> >> How can SolrJ be used to make RDBMS updates immediately available?
> >> Thanks
> >>
> >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> [hidden email]>
> >> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >>> caching has best performance. DIH nested entities allow you to define
> >>> simple queries.
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr. DIH full import can be used for index all data first
> >>> time or restore index in case index is corrupted.
> >>>
> >>> Thanks,
> >>> Sujay
> >>>
> >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]>
> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >
> >>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >>> > the available options are:
> >>> > 1) Post Tool
> >>> > 2) DIH
> >>> > 3) SolrJ (as ours is a J2EE application).
> >>> >
> >>> > I want to know what is the recommended way for Data import in
> production
> >>> > environment.
> >>> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >>> > POST tool?
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Vishal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Sujay P Bawaskar
> >>> M:+91-77091 53669
> >>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

vishal jain
In reply to this post by Erick Erickson
Streaming the data through kafka would be a good option if near real time
data indexing is the key requirement.
In our application the RDBMS data is populated by an ETL job periodically
so we don't need real time data indexing for now.

Cheers,
Vishal

On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson <[hidden email]>
wrote:

> Or set a trigger on your RDBMS's main table to put the relevant
> information in a different table (call it EVENTS) and have your SolrJ
> consult the EVENTS table periodically. Essentially you're using the
> EVENTS table as a queue where the trigger is the producer and the
> SolrJ program is the consumer.
>
> It's a polling solution though, so not event-driven. There's no
> mechanism that I know of have, say, your RDBMS push an event to DIH
> for instance.
>
> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> for this kind of problem..
>
> Best,
> Erick
>
> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> <[hidden email]> wrote:
> > One assumes by hooking into the same code that updates RDBMS, as
> > opposed to be reverse engineering the changes from looking at the DB
> > content. This would be especially the case for Delete changes.
> >
> > Regards,
> >    Alex.
> > ----
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
> >>>
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr.
> >>
> >> How can SolrJ be used to make RDBMS updates immediately available?
> >> Thanks
> >>
> >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> [hidden email]>
> >> wrote:
> >>
> >>> Hi Vishal,
> >>>
> >>> As per my experience DIH is the best for RDBMS to solr index. DIH with
> >>> caching has best performance. DIH nested entities allow you to define
> >>> simple queries.
> >>> Also, solrj is good when you want your RDBMS updates make immediately
> >>> available in solr. DIH full import can be used for index all data first
> >>> time or restore index in case index is corrupted.
> >>>
> >>> Thanks,
> >>> Sujay
> >>>
> >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]>
> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> >
> >>> > I am new to Solr and am trying to move data from my RDBMS to Solr. I
> know
> >>> > the available options are:
> >>> > 1) Post Tool
> >>> > 2) DIH
> >>> > 3) SolrJ (as ours is a J2EE application).
> >>> >
> >>> > I want to know what is the recommended way for Data import in
> production
> >>> > environment.
> >>> > Will sending data via SolrJ in batches be faster than posting a csv
> using
> >>> > POST tool?
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Vishal
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Sujay P Bawaskar
> >>> M:+91-77091 53669
> >>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Walter Underwood
In reply to this post by OTH
That fails if Solr is not available.

To avoid dropping updates, you need some kind of persistent queue. We use Amazon SQS for our incremental updates.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On Mar 17, 2017, at 10:09 AM, OTH <[hidden email]> wrote:
>
> Could the database trigger not just post the change to solr?
>
> On Fri, Mar 17, 2017 at 10:00 PM, Erick Erickson <[hidden email]>
> wrote:
>
>> Or set a trigger on your RDBMS's main table to put the relevant
>> information in a different table (call it EVENTS) and have your SolrJ
>> consult the EVENTS table periodically. Essentially you're using the
>> EVENTS table as a queue where the trigger is the producer and the
>> SolrJ program is the consumer.
>>
>> It's a polling solution though, so not event-driven. There's no
>> mechanism that I know of have, say, your RDBMS push an event to DIH
>> for instance.
>>
>> Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
>> for this kind of problem..
>>
>> Best,
>> Erick
>>
>> On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
>> <[hidden email]> wrote:
>>> One assumes by hooking into the same code that updates RDBMS, as
>>> opposed to be reverse engineering the changes from looking at the DB
>>> content. This would be especially the case for Delete changes.
>>>
>>> Regards,
>>>   Alex.
>>> ----
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>>
>>>
>>> On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
>>>>>
>>>>> Also, solrj is good when you want your RDBMS updates make immediately
>>>>> available in solr.
>>>>
>>>> How can SolrJ be used to make RDBMS updates immediately available?
>>>> Thanks
>>>>
>>>> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
>> [hidden email]>
>>>> wrote:
>>>>
>>>>> Hi Vishal,
>>>>>
>>>>> As per my experience DIH is the best for RDBMS to solr index. DIH with
>>>>> caching has best performance. DIH nested entities allow you to define
>>>>> simple queries.
>>>>> Also, solrj is good when you want your RDBMS updates make immediately
>>>>> available in solr. DIH full import can be used for index all data first
>>>>> time or restore index in case index is corrupted.
>>>>>
>>>>> Thanks,
>>>>> Sujay
>>>>>
>>>>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]>
>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I am new to Solr and am trying to move data from my RDBMS to Solr. I
>> know
>>>>>> the available options are:
>>>>>> 1) Post Tool
>>>>>> 2) DIH
>>>>>> 3) SolrJ (as ours is a J2EE application).
>>>>>>
>>>>>> I want to know what is the recommended way for Data import in
>> production
>>>>>> environment.
>>>>>> Will sending data via SolrJ in batches be faster than posting a csv
>> using
>>>>>> POST tool?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Vishal
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Sujay P Bawaskar
>>>>> M:+91-77091 53669
>>>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Data Import

Liu, Daphne
In reply to this post by vishal jain
NO, I use the free version. I have the driver from someone else. I can share it if you want to use Cassandra.
They have modified it for me since the free JDBC driver I found will timeout when the document is greater than 16mb.

Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / [hidden email]



-----Original Message-----
From: vishal jain [mailto:[hidden email]]
Sent: Friday, March 17, 2017 12:42 PM
To: [hidden email]
Subject: Re: Data Import

Hi Daphne,

Are you using DSE?


Thanks & Regards,
Vishal

On Fri, Mar 17, 2017 at 7:40 PM, Liu, Daphne <[hidden email]>
wrote:

> I just want to share my recent project. I have successfully sent all
> our EDI documents to Cassandra 3.7 clusters using Solr 6.3 Data Import
> JDBC Cassandra connector indexing our documents.
> Since Cassandra is so fast for writing, compression rate is around 13%
> and all my documents can be keep in my Cassandra clusters' memory, we
> are very happy with the result.
>
>
> Kind regards,
>
> Daphne Liu
> BI Architect - Matrix SCM
>
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL
> 32256 USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 /
> [hidden email]
>
>
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[hidden email]]
> Sent: Friday, March 17, 2017 9:54 AM
> To: solr-user <[hidden email]>
> Subject: Re: Data Import
>
> I feel DIH is much better for prototyping, even though people do use
> it in production. If you do want to use DIH, you may benefit from
> reviewing the DIH-DB example I am currently rewriting in
> https://issues.apache.org/jira/browse/SOLR-10312 (may need to change
> luceneMatchVersion in solrconfig.xml first).
>
> CSV, etc, could be useful if you want to keep history of past imports,
> again useful during development, as you evolve schema.
>
> SolrJ may actually be easiest/best for production since you already
> have Java stack.
>
> The choice is yours in the end.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>
>
> On 17 March 2017 at 08:56, Shawn Heisey <[hidden email]> wrote:
> > On 3/17/2017 3:04 AM, vishal jain wrote:
> >> I am new to Solr and am trying to move data from my RDBMS to Solr.
> >> I
> know the available options are:
> >> 1) Post Tool
> >> 2) DIH
> >> 3) SolrJ (as ours is a J2EE application).
> >>
> >> I want to know what is the recommended way for Data import in
> >> production environment. Will sending data via SolrJ in batches be
> faster than posting a csv using POST tool?
> >
> > I've heard that CSV import runs EXTREMELY fast, but I have never
> > tested it.  The same threading problem that I discuss below would
> > apply to indexing this way.
> >
> > DIH is extremely powerful, but it has one glaring problem:  It's
> > single-threaded, which means that only one stream of data is going
> > into Solr, and each batch of documents to be inserted must wait for
> > the previous one to finish inserting before it can start.  I do not
> > know if DIH batches documents or sends them in one at a time.  If
> > you have a manually sharded index, you can run DIH on each shard in
> > parallel, but each one will be single-threaded.  That single thread
> > is pretty efficient, but it's still only one thread.
> >
> > Sending multiple index updates to Solr in parallel (multi-threading)
> > is how you radically speed up the Solr part of indexing.  This is
> > usually done with a custom indexing program, which might be written
> > with SolrJ or even in a completely different language.
> >
> > One thing to keep in mind with ANY indexing method:  Once the
> > situation is examined closely, most people find that it's not Solr
> > that makes their indexing slow.  The bottleneck is usually the
> > source system -- how quickly the data can be retrieved.  It usually
> > takes a lot longer to obtain the data than it does for Solr to index it.
> >
> > Thanks,
> > Shawn
> >
> This e-mail message is intended for the above named recipient(s) only.
> It may contain confidential information that is privileged. If you are
> not the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this e-mail and any
> attachment(s) is strictly prohibited. If you have received this e-mail
> by error, please immediately notify the sender by replying to this
> e-mail and deleting the message including any attachment(s) from your
> system. Thank you in advance for your cooperation and assistance.
> Although the company has taken reasonable precautions to ensure no
> viruses are present in this email, the company cannot accept
> responsibility for any loss or damage arising from the use of this email or attachments.
>
This e-mail message is intended for the above named recipient(s) only. It may contain confidential information that is privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail by error, please immediately notify the sender by replying to this e-mail and deleting the message including any attachment(s) from your system. Thank you in advance for your cooperation and assistance. Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
OTH
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

OTH
In reply to this post by vishal jain
Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)

@Wunder:
I'm assuming, that updating to Solr would fail if Solr is unavailable not
just if posting via say a DB trigger, but probably also if trying to post
through SolrJ?  (Which is what I'm using for now.)  So, even if using
SolrJ, it would be a good idea to use a queuing software?

Thanks

On Fri, Mar 17, 2017 at 10:12 PM, vishal jain <[hidden email]> wrote:

> Streaming the data through kafka would be a good option if near real time
> data indexing is the key requirement.
> In our application the RDBMS data is populated by an ETL job periodically
> so we don't need real time data indexing for now.
>
> Cheers,
> Vishal
>
> On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson <[hidden email]>
> wrote:
>
> > Or set a trigger on your RDBMS's main table to put the relevant
> > information in a different table (call it EVENTS) and have your SolrJ
> > consult the EVENTS table periodically. Essentially you're using the
> > EVENTS table as a queue where the trigger is the producer and the
> > SolrJ program is the consumer.
> >
> > It's a polling solution though, so not event-driven. There's no
> > mechanism that I know of have, say, your RDBMS push an event to DIH
> > for instance.
> >
> > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > for this kind of problem..
> >
> > Best,
> > Erick
> >
> > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> > <[hidden email]> wrote:
> > > One assumes by hooking into the same code that updates RDBMS, as
> > > opposed to be reverse engineering the changes from looking at the DB
> > > content. This would be especially the case for Delete changes.
> > >
> > > Regards,
> > >    Alex.
> > > ----
> > > http://www.solr-start.com/ - Resources for Solr users, new and
> > experienced
> > >
> > >
> > > On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
> > >>>
> > >>> Also, solrj is good when you want your RDBMS updates make immediately
> > >>> available in solr.
> > >>
> > >> How can SolrJ be used to make RDBMS updates immediately available?
> > >> Thanks
> > >>
> > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > [hidden email]>
> > >> wrote:
> > >>
> > >>> Hi Vishal,
> > >>>
> > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> with
> > >>> caching has best performance. DIH nested entities allow you to define
> > >>> simple queries.
> > >>> Also, solrj is good when you want your RDBMS updates make immediately
> > >>> available in solr. DIH full import can be used for index all data
> first
> > >>> time or restore index in case index is corrupted.
> > >>>
> > >>> Thanks,
> > >>> Sujay
> > >>>
> > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]>
> > wrote:
> > >>>
> > >>> > Hi,
> > >>> >
> > >>> >
> > >>> > I am new to Solr and am trying to move data from my RDBMS to Solr.
> I
> > know
> > >>> > the available options are:
> > >>> > 1) Post Tool
> > >>> > 2) DIH
> > >>> > 3) SolrJ (as ours is a J2EE application).
> > >>> >
> > >>> > I want to know what is the recommended way for Data import in
> > production
> > >>> > environment.
> > >>> > Will sending data via SolrJ in batches be faster than posting a csv
> > using
> > >>> > POST tool?
> > >>> >
> > >>> >
> > >>> > Thanks,
> > >>> > Vishal
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks,
> > >>> Sujay P Bawaskar
> > >>> M:+91-77091 53669
> > >>>
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Data Import

Mike Thomsen
If Solr is down, then adding through SolrJ would fail as well. Kafka's new
API has some great features for this sort of thing. The new client API is
designed to be run in a long-running loop where you poll for new messages
with a certain amount of defined timeout (ex: consumer.poll(1000) for 1s)
So if Solr becomes unstable or goes down, it's easy to have the consumer
just stop and either wait until Solr comes back up or save the data to
disk/commit the Kafka offsets to ZK and stop running.

On Fri, Mar 17, 2017 at 1:24 PM, OTH <[hidden email]> wrote:

> Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)
>
> @Wunder:
> I'm assuming, that updating to Solr would fail if Solr is unavailable not
> just if posting via say a DB trigger, but probably also if trying to post
> through SolrJ?  (Which is what I'm using for now.)  So, even if using
> SolrJ, it would be a good idea to use a queuing software?
>
> Thanks
>
> On Fri, Mar 17, 2017 at 10:12 PM, vishal jain <[hidden email]> wrote:
>
> > Streaming the data through kafka would be a good option if near real time
> > data indexing is the key requirement.
> > In our application the RDBMS data is populated by an ETL job periodically
> > so we don't need real time data indexing for now.
> >
> > Cheers,
> > Vishal
> >
> > On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson <
> [hidden email]>
> > wrote:
> >
> > > Or set a trigger on your RDBMS's main table to put the relevant
> > > information in a different table (call it EVENTS) and have your SolrJ
> > > consult the EVENTS table periodically. Essentially you're using the
> > > EVENTS table as a queue where the trigger is the producer and the
> > > SolrJ program is the consumer.
> > >
> > > It's a polling solution though, so not event-driven. There's no
> > > mechanism that I know of have, say, your RDBMS push an event to DIH
> > > for instance.
> > >
> > > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > > for this kind of problem..
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> > > <[hidden email]> wrote:
> > > > One assumes by hooking into the same code that updates RDBMS, as
> > > > opposed to be reverse engineering the changes from looking at the DB
> > > > content. This would be especially the case for Delete changes.
> > > >
> > > > Regards,
> > > >    Alex.
> > > > ----
> > > > http://www.solr-start.com/ - Resources for Solr users, new and
> > > experienced
> > > >
> > > >
> > > > On 17 March 2017 at 11:37, OTH <[hidden email]> wrote:
> > > >>>
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr.
> > > >>
> > > >> How can SolrJ be used to make RDBMS updates immediately available?
> > > >> Thanks
> > > >>
> > > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > > [hidden email]>
> > > >> wrote:
> > > >>
> > > >>> Hi Vishal,
> > > >>>
> > > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> > with
> > > >>> caching has best performance. DIH nested entities allow you to
> define
> > > >>> simple queries.
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr. DIH full import can be used for index all data
> > first
> > > >>> time or restore index in case index is corrupted.
> > > >>>
> > > >>> Thanks,
> > > >>> Sujay
> > > >>>
> > > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain <[hidden email]>
> > > wrote:
> > > >>>
> > > >>> > Hi,
> > > >>> >
> > > >>> >
> > > >>> > I am new to Solr and am trying to move data from my RDBMS to
> Solr.
> > I
> > > know
> > > >>> > the available options are:
> > > >>> > 1) Post Tool
> > > >>> > 2) DIH
> > > >>> > 3) SolrJ (as ours is a J2EE application).
> > > >>> >
> > > >>> > I want to know what is the recommended way for Data import in
> > > production
> > > >>> > environment.
> > > >>> > Will sending data via SolrJ in batches be faster than posting a
> csv
> > > using
> > > >>> > POST tool?
> > > >>> >
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Vishal
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Thanks,
> > > >>> Sujay P Bawaskar
> > > >>> M:+91-77091 53669
> > > >>>
> > >
> >
>
Loading...