Out of memory error during full import

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Out of memory error during full import

Srinivas Kashyap-2
Hello,

I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the child entities in data-config.xml. When i try to do full import, i'm getting OutOfMemory error(Java Heap Space). I increased the HEAP allocation to the maximum extent possible. Is there a workaround to do initial data load without running into this error?

I found that 'batchSize=-1' parameter needs to be specified in the datasource for MySql, is there a way to specify for others Databases as well?

Thanks and Regards,
Srinivas Kashyap
DISCLAIMER: E-mails and attachments from Bamboo Rose, Inc. are confidential. If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Reply | Threaded
Open this post in threaded view
|

Re: Out of memory error during full import

Shawn Heisey-2
On 2/4/2016 12:18 AM, Srinivas Kashyap wrote:
> I have implemented 'SortedMapBackedCache' in my SqlEntityProcessor for the child entities in data-config.xml. When i try to do full import, i'm getting OutOfMemory error(Java Heap Space). I increased the HEAP allocation to the maximum extent possible. Is there a workaround to do initial data load without running into this error?
>
> I found that 'batchSize=-1' parameter needs to be specified in the datasource for MySql, is there a way to specify for others Databases as well?

Setting batchSize to -1 in the DIH config translates to a 'setFetchSize'
on the JDBC object of Integer.MIN_VALUE.  This is how to turn on result
streaming in MySQL.

The method for doing this with other JDBC implementations is likely to
be different.  The Microsoft driver for SQL Server uses a URL parameter,
and newer versions of that particular driver have the streaming behavior
as default.  I have no idea how to do it for any other driver, you would
need to ask the author of the driver.

When you turn on caching (SortedMapBackedCache), you are asking Solr to
put all of the data received into memory -- very similar to what happens
if result streaming is not turned on.  When the SQL result is very
large, this can require a LOT of memory.  In situations like that,
you'll just have to remove the caching.  One alternative to child
entities is to do a query using JOIN in a single entity, so that all the
data you need is returned by a single SQL query, where the heavy lifting
is done by the database server instead of Solr.

The MySQL database that serves as the information source for *my* Solr
index is hundreds of gigabytes in size, so caching it is not possible
for me.  The batchSize=-1 option is the only way to get the import to work.

Thanks,
Shawn