Alternative for DIH

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Alternative for DIH

Srinivas Kashyap-2
Hello,

As we all know DIH is single threaded and has it's own issues while indexing.

Got to know that we can write our own API's to pull data from DB and push it into solr. One such I heard was Apache Kafka being used for the purpose.

Can any of you send me the links and guides to use apache kafka to pull data from DB and push into solr?

If there are any other alternatives please suggest.

Thanks and Regards,
Srinivas Kashyap
________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Reply | Threaded
Open this post in threaded view
|

Re: Alternative for DIH

Jörn Franke
I recommend to look at the underlying problem that you try to solve. Writing an own loader requires thorough technical design (eg recoverability in case of errors, stoping in case user requested it, proper multithreading without overloading the cluster etc) - I have not seen many that were well written.
Furthermore your performance issue might be due to how you configured solr.

You can multithread in DiH by having multiple dih working on part of the data.

Where the data (eg kafka ) does not matter , I really recommend to look in the problem you try to solve and then maybe let your design review here.

> Am 31.01.2019 um 11:55 schrieb Srinivas Kashyap <[hidden email]>:
>
> Hello,
>
> As we all know DIH is single threaded and has it's own issues while indexing.
>
> Got to know that we can write our own API's to pull data from DB and push it into solr. One such I heard was Apache Kafka being used for the purpose.
>
> Can any of you send me the links and guides to use apache kafka to pull data from DB and push into solr?
>
> If there are any other alternatives please suggest.
>
> Thanks and Regards,
> Srinivas Kashyap
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
> No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Reply | Threaded
Open this post in threaded view
|

Re: Alternative for DIH

Mikhail Khludnev-2
In reply to this post by Srinivas Kashyap-2
Hello,

I did this deck some time ago. It might be useful for choosing one.
https://docs.google.com/presentation/d/e/2PACX-1vQzi3QOZAwLh_t3zs1gH9EGCB2HKUgiN3WJRGHpULyA-GleCrQ41dIOINa18h_XG64BX5D_ZG6jKmXL/pub?start=false&loop=false&delayms=3000
Note, as far as I understand Lucidworks' answer to this is Spark.


On Thu, Jan 31, 2019 at 2:15 PM Srinivas Kashyap <[hidden email]>
wrote:

> Hello,
>
> As we all know DIH is single threaded and has it's own issues while
> indexing.
>
> Got to know that we can write our own API's to pull data from DB and push
> it into solr. One such I heard was Apache Kafka being used for the purpose.
>
> Can any of you send me the links and guides to use apache kafka to pull
> data from DB and push into solr?
>
> If there are any other alternatives please suggest.
>
> Thanks and Regards,
> Srinivas Kashyap
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender
> immediately by replying to the e-mail, and then delete it without making
> copies or using it in any way.
> No representation is made that this email or any attachments are free of
> viruses. Virus scanning is recommended and is the responsibility of the
> recipient.
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Alternative for DIH

Alexandre Rafalovitch
Apache NiFi may also be something of interest: https://nifi.apache.org/

Regards,
   Alex.

On Thu, 31 Jan 2019 at 11:15, Mikhail Khludnev <[hidden email]> wrote:

>
> Hello,
>
> I did this deck some time ago. It might be useful for choosing one.
> https://docs.google.com/presentation/d/e/2PACX-1vQzi3QOZAwLh_t3zs1gH9EGCB2HKUgiN3WJRGHpULyA-GleCrQ41dIOINa18h_XG64BX5D_ZG6jKmXL/pub?start=false&loop=false&delayms=3000
> Note, as far as I understand Lucidworks' answer to this is Spark.
>
>
> On Thu, Jan 31, 2019 at 2:15 PM Srinivas Kashyap <[hidden email]>
> wrote:
>
> > Hello,
> >
> > As we all know DIH is single threaded and has it's own issues while
> > indexing.
> >
> > Got to know that we can write our own API's to pull data from DB and push
> > it into solr. One such I heard was Apache Kafka being used for the purpose.
> >
> > Can any of you send me the links and guides to use apache kafka to pull
> > data from DB and push into solr?
> >
> > If there are any other alternatives please suggest.
> >
> > Thanks and Regards,
> > Srinivas Kashyap
> > ________________________________
> > DISCLAIMER:
> > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > If you are not the intended recipient, please notify the sender
> > immediately by replying to the e-mail, and then delete it without making
> > copies or using it in any way.
> > No representation is made that this email or any attachments are free of
> > viruses. Virus scanning is recommended and is the responsibility of the
> > recipient.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Alternative for DIH

Erick Erickson
Depending on how complicated you need this to be, you can just write
your own in SolrJ, see:

https://lucidworks.com/2012/02/14/indexing-with-solrj/

You haven't said a lot about the characteristics of your situation.
Are you talking 1B rows
from the DB? 1M? what is the pain point? Because until one gets to
massive amounts of
data, 9 times out of 10 poor indexing performance is a result of the
DB query being
used executing very slowly.

Before jumping to a solution, it'd be good to know
1> why you're dissatisfied with DIH, i.e. what is the problem you're seeing
2> some information about your situation, size of DB, how fast DIH
works now etc.

This latter is important, 'cause it's a totally different question if,
say, your problem
statement is
"it takes 8 hours to import 1,000,000,000 rows and the docs are 1M long"
.vs.
"it takes 8 hours to import 100,000 rows that are 1K each".

Until there are answers to questions like that it's not clear at all you even
_have_ a problem that's solvable by any of the suggestions so far.

Best,
Erick

On Thu, Jan 31, 2019 at 12:34 PM Alexandre Rafalovitch
<[hidden email]> wrote:

>
> Apache NiFi may also be something of interest: https://nifi.apache.org/
>
> Regards,
>    Alex.
>
> On Thu, 31 Jan 2019 at 11:15, Mikhail Khludnev <[hidden email]> wrote:
> >
> > Hello,
> >
> > I did this deck some time ago. It might be useful for choosing one.
> > https://docs.google.com/presentation/d/e/2PACX-1vQzi3QOZAwLh_t3zs1gH9EGCB2HKUgiN3WJRGHpULyA-GleCrQ41dIOINa18h_XG64BX5D_ZG6jKmXL/pub?start=false&loop=false&delayms=3000
> > Note, as far as I understand Lucidworks' answer to this is Spark.
> >
> >
> > On Thu, Jan 31, 2019 at 2:15 PM Srinivas Kashyap <[hidden email]>
> > wrote:
> >
> > > Hello,
> > >
> > > As we all know DIH is single threaded and has it's own issues while
> > > indexing.
> > >
> > > Got to know that we can write our own API's to pull data from DB and push
> > > it into solr. One such I heard was Apache Kafka being used for the purpose.
> > >
> > > Can any of you send me the links and guides to use apache kafka to pull
> > > data from DB and push into solr?
> > >
> > > If there are any other alternatives please suggest.
> > >
> > > Thanks and Regards,
> > > Srinivas Kashyap
> > > ________________________________
> > > DISCLAIMER:
> > > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > > If you are not the intended recipient, please notify the sender
> > > immediately by replying to the e-mail, and then delete it without making
> > > copies or using it in any way.
> > > No representation is made that this email or any attachments are free of
> > > viruses. Virus scanning is recommended and is the responsibility of the
> > > recipient.
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Alternative for DIH

Nish Karve
In reply to this post by Srinivas Kashyap-2
If you absolutely want to use Kafka after trying other mechanisms, I would
suggest Kafka Connect. Jeremy Custenborder has a good Kafka connector as a
sink to SOLR. You can define your own avro schemas on the Kafka topic that
adhere to your SOLR schema to give you that degree of control.

We have used Lucidworks Spark Connector to index 500 million documents to
SOLR within 4 hours. We had around 70 fields per document. This is a very
good choice if you want to sync data from a DB to SOLR. Have an interim
step of using an ETL tool like Ab Initio that will perform the basic joins
on your table, extract the data in CSV for the Spark Connector. All the
hardwork of opening and managing the connections with SOLR is done in the
connector. Please note that this connector indexed data to a live SOLR
cluster unlike offline indexing using Map Reduce.

Thanks
Nishant

On Thu, Jan 31, 2019, 5:15 AM Srinivas Kashyap <[hidden email]
wrote:

> Hello,
>
> As we all know DIH is single threaded and has it's own issues while
> indexing.
>
> Got to know that we can write our own API's to pull data from DB and push
> it into solr. One such I heard was Apache Kafka being used for the purpose.
>
> Can any of you send me the links and guides to use apache kafka to pull
> data from DB and push into solr?
>
> If there are any other alternatives please suggest.
>
> Thanks and Regards,
> Srinivas Kashyap
> ________________________________
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender
> immediately by replying to the e-mail, and then delete it without making
> copies or using it in any way.
> No representation is made that this email or any attachments are free of
> viruses. Virus scanning is recommended and is the responsibility of the
> recipient.
>