Full import alternatives

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Full import alternatives

Jesus Olivan
Hi!

we're trying to launch a full import of 375 millions of docs aprox. from a
MySQL database to our solrcloud cluster. Until now, this full import
process takes around 24/27 hours to finish due to an huge import query
(several group bys, left joins, etc), but after another import query
modification (adding more complexity), we're unable to execute this full
import from MySQL.

We've done some research about migrating to PostgreSQL, but this option is
now a real option at this time, because it implies a big refatoring from
several dev teams.

Is there some alternative ways to perform successfully this full import
process?

Any ideas are welcome :)

Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Shawn Heisey-2
On 4/13/2018 10:11 AM, Jesus Olivan wrote:

> we're trying to launch a full import of 375 millions of docs aprox. from a
> MySQL database to our solrcloud cluster. Until now, this full import
> process takes around 24/27 hours to finish due to an huge import query
> (several group bys, left joins, etc), but after another import query
> modification (adding more complexity), we're unable to execute this full
> import from MySQL.
>
> We've done some research about migrating to PostgreSQL, but this option is
> now a real option at this time, because it implies a big refatoring from
> several dev teams.
>
> Is there some alternative ways to perform successfully this full import
> process?

DIH is a capable tool, and for what it does, it's remarkably efficient.

It can't really be made any faster, because it's single threaded.  To
get increased index speed with Solr, you must index documents from
several sources/processes/threads at the same time.  Writing custom
software that can retrieve information from your source, build the
documents you require, and send several update requests simultaneously
will yield the best results.  The source itself may be a bottleneck
though -- this is frequently the case, and Solr is often MUCH faster
than the information source.

You said that you're unable to execute an updated import from MySQL. 
What exactly happens when you try?  Are there any errors in your solr
logfile?

I'm not going to debate whether MySQL or PostgreSQL is the better
solution.  For my indexes, my source data is in MySQL.  It works well,
but full rebuilds using DIH are slower than I would like -- because it's
single-threaded.  Our overall system architecture would probably be
improved by a switch to PostgreSQL, but it would be an extremely
time-consuming transition process.  We aren't having any real issues
with MySQL, so we have no incentive to spend the required effort.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Erick Erickson
In reply to this post by Jesus Olivan
_how_ are you importing? DIH? SolrJ?

Here's an article about using SolrJ
https://lucidworks.com/2012/02/14/indexing-with-solrj/

But without more details it's really impossible to say much. Things
I've done in the past:
1> use SolrJ and partition the job up amongst a bunch of clients each
of which works on a subset of docs. This requires, of course, that
there's a way to partition the import.
2> For joins and the like, I've sometimes been able to cache data in
local storage (SolrJ) and use that rather than using the joins. May
not be possible of course depending on the size of some of your
tables.
3> with DIH, there are some caching capabilities although I confess I
don't know the pros and cons.
4> Work with your DB administrator to tune your query. Sometimes this
means creating a view, sometimes adding indexes sometimes.....

Best,
Erick

On Fri, Apr 13, 2018 at 9:11 AM, Jesus Olivan <[hidden email]> wrote:

> Hi!
>
> we're trying to launch a full import of 375 millions of docs aprox. from a
> MySQL database to our solrcloud cluster. Until now, this full import
> process takes around 24/27 hours to finish due to an huge import query
> (several group bys, left joins, etc), but after another import query
> modification (adding more complexity), we're unable to execute this full
> import from MySQL.
>
> We've done some research about migrating to PostgreSQL, but this option is
> now a real option at this time, because it implies a big refatoring from
> several dev teams.
>
> Is there some alternative ways to perform successfully this full import
> process?
>
> Any ideas are welcome :)
>
> Thanks in advance!
Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Mikhail Khludnev-2
In reply to this post by Jesus Olivan
Jesus,
Usually zipper join (aka external merge in old ETL world) and explicit
partitioning is able to boost import.
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#entity-processors

On Fri, Apr 13, 2018 at 7:11 PM, Jesus Olivan <[hidden email]>
wrote:

> Hi!
>
> we're trying to launch a full import of 375 millions of docs aprox. from a
> MySQL database to our solrcloud cluster. Until now, this full import
> process takes around 24/27 hours to finish due to an huge import query
> (several group bys, left joins, etc), but after another import query
> modification (adding more complexity), we're unable to execute this full
> import from MySQL.
>
> We've done some research about migrating to PostgreSQL, but this option is
> now a real option at this time, because it implies a big refatoring from
> several dev teams.
>
> Is there some alternative ways to perform successfully this full import
> process?
>
> Any ideas are welcome :)
>
> Thanks in advance!
>



--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Jesus Olivan
In reply to this post by Shawn Heisey-2
Hi Shawn,

thanks for your answer. It happens that when we launch full import process
didn't finished (we wait for more than 60 hours last time, and we cancelled
it, because this is not an acceptable time for us) There weren't any errors
in solr logfile simply because it was working fine. The problem is that it
lasted eternally and didn't finish. We tried it on Aurora cluster under
AWS, and after 20 hours of work, it failed due to lack of space in Aurora
tmp folder.



2018-04-13 18:41 GMT+02:00 Shawn Heisey <[hidden email]>:

> On 4/13/2018 10:11 AM, Jesus Olivan wrote:
> > we're trying to launch a full import of 375 millions of docs aprox. from
> a
> > MySQL database to our solrcloud cluster. Until now, this full import
> > process takes around 24/27 hours to finish due to an huge import query
> > (several group bys, left joins, etc), but after another import query
> > modification (adding more complexity), we're unable to execute this full
> > import from MySQL.
> >
> > We've done some research about migrating to PostgreSQL, but this option
> is
> > now a real option at this time, because it implies a big refatoring from
> > several dev teams.
> >
> > Is there some alternative ways to perform successfully this full import
> > process?
>
> DIH is a capable tool, and for what it does, it's remarkably efficient.
>
> It can't really be made any faster, because it's single threaded.  To
> get increased index speed with Solr, you must index documents from
> several sources/processes/threads at the same time.  Writing custom
> software that can retrieve information from your source, build the
> documents you require, and send several update requests simultaneously
> will yield the best results.  The source itself may be a bottleneck
> though -- this is frequently the case, and Solr is often MUCH faster
> than the information source.
>
> You said that you're unable to execute an updated import from MySQL.
> What exactly happens when you try?  Are there any errors in your solr
> logfile?
>
> I'm not going to debate whether MySQL or PostgreSQL is the better
> solution.  For my indexes, my source data is in MySQL.  It works well,
> but full rebuilds using DIH are slower than I would like -- because it's
> single-threaded.  Our overall system architecture would probably be
> improved by a switch to PostgreSQL, but it would be an extremely
> time-consuming transition process.  We aren't having any real issues
> with MySQL, so we have no incentive to spend the required effort.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Shawn Heisey-2
On 4/13/2018 11:03 AM, Jesus Olivan wrote:
> thanks for your answer. It happens that when we launch full import process
> didn't finished (we wait for more than 60 hours last time, and we cancelled
> it, because this is not an acceptable time for us) There weren't any errors
> in solr logfile simply because it was working fine. The problem is that it
> lasted eternally and didn't finish. We tried it on Aurora cluster under
> AWS, and after 20 hours of work, it failed due to lack of space in Aurora
> tmp folder.

375 million documents importing from MySQL with one DIH import is going
to take quite a while.

The last full rebuild I did of my main index took 21.61 hours.  This is
an index where six large shards build simultaneously, using DIH, each
one having more than 30 million documents.  If I were to build it as a
single 180 million document import, it would probably take 5 days, maybe
longer.

We had another index (since retired) that had more than 400 million
total documents, built similarly with multiple shards at the same time. 
The last rebuild I can remember on that index took about two days.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Jesus Olivan
hi Shawn,

first of all, thanks for your answer.

How you import simultaneously these 6 shards?

2018-04-13 19:30 GMT+02:00 Shawn Heisey <[hidden email]>:

> On 4/13/2018 11:03 AM, Jesus Olivan wrote:
> > thanks for your answer. It happens that when we launch full import
> process
> > didn't finished (we wait for more than 60 hours last time, and we
> cancelled
> > it, because this is not an acceptable time for us) There weren't any
> errors
> > in solr logfile simply because it was working fine. The problem is that
> it
> > lasted eternally and didn't finish. We tried it on Aurora cluster under
> > AWS, and after 20 hours of work, it failed due to lack of space in Aurora
> > tmp folder.
>
> 375 million documents importing from MySQL with one DIH import is going
> to take quite a while.
>
> The last full rebuild I did of my main index took 21.61 hours.  This is
> an index where six large shards build simultaneously, using DIH, each
> one having more than 30 million documents.  If I were to build it as a
> single 180 million document import, it would probably take 5 days, maybe
> longer.
>
> We had another index (since retired) that had more than 400 million
> total documents, built similarly with multiple shards at the same time.
> The last rebuild I can remember on that index took about two days.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Full import alternatives

Shawn Heisey-2
On 4/13/2018 11:34 AM, Jesus Olivan wrote:
> first of all, thanks for your answer.
>
> How you import simultaneously these 6 shards?

I'm not running in SolrCloud mode, so Solr doesn't know that each shard
is part of a larger index.  What I'm doing would probably not work in
SolrCloud mode without making some significant changes.

On each of the cores representing a shard, I have a DIH config.  When I
do a full rebuild, I send a full-import DIH command to all of the
shards, and each one makes an SQL query to MySQL, all of them running in
parallel.

Thanks,
Shawn