Incrementally updating a VERY LARGE field - Is this possibe ?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Incrementally updating a VERY LARGE field - Is this possibe ?

vybe3142
This post was updated on .
Some days ago, I posted about an issue with SOLR running out of memory when attempting to index large text files (say 300 MB ). Details at http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html

Two things I need to point out:

1. I don't need Tika for content extraction as the files are already in plain text format.
2. The heap space error was caused by a futile Tika/SOLR attempt at creating the corresponding huge XML document in memory

I've decided to develop a custom handler that
1. reads the file text directly
2. attempts to create a SOLR document and directly add the text data to the corresponding field.

One approach I've taken is to read manageable chunks of text data sequentially from the file and process. We've used this approach sucessfully with Lucene in the past and I'm attempting to make it work with SOLR too. I got most of the work done yesterday, but need a bit of guidance w.r.t. point 2.

How can I achieve updating the same field multiple times. Looking at the SOLR source, processor.addField() merely
a. adds to the in-memory field map and
b. attempts to write EVERYTHING to the index later on.

In my situation, (a) eventually causes a heap space error.

Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuilder.append(StringBuilder.java:119)
        at com....solr.handler.text.BasicDocLoader.load(BasicDocLoader.java:204)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)


Here's part of the handler code.
        String fileDataChunk = .... //read chunk of data from file several times as needed
        doc.addField("text", fileDataChunk ); //==> need to add to the multi-val "text" field several times
        .....
       templateAdd.solrDoc = doc;
       processor.processAdd(templateAdd);




Thanks much

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Ravish Bhagdev
Updating a single field is not possible in solr.  The whole record has to
be rewritten.

300 MB is still not that big a file.  Have you tried doing the indexing (if
its only a one time thing) by giving it ~2 GB or xmx?

A single file with that size is strange!  May I ask what is it?

Rav

On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 <[hidden email]> wrote:

>
> Some days ago, I posted about an issue with SOLR running out of memory when
> attempting to index large text files (say 300 MB ). Details at
>
> http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
>
> Two things I need to point out:
>
> 1. I don't need Tika for content extraction as the files are already in
> plain text format.
> 2. The heap space error was caused by a futile Tika/SOLR attempt at
> creating
> the corresponding huge XML document in memory
>
> I've decided to develop a custom handler that
> 1. reads the file text directly
> 2. attempts to create a SOLR document and directly add the text data to the
> corresponding field.
>
> One approach I've taken is to read manageable chunks of text data
> sequentially from the file and process. We've used this approach
> sucessfully
> with Lucene in the past and I'm attempting to make it work with SOLR too. I
> got most of the work done yesterday, but need a bit of guidance w.r.t.
> point
> 2.
>
> How can I achieve updating the same field multiple times. Looking at the
> SOLR source, processor.addField() merely
> a. adds to the in-memory field map and
> b. attempts to write EVERYTHING to the index later on.
>
> In my situation, (a) eventually causes a heap space error.
>
>
>
>
> Here's part of the handler code.
>
>
>
> Thanks much
>
> Thanks
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Mikhail Khludnev
There is https://issues.apache.org/jira/browse/LUCENE-3837 but I suppose
it's too far from completion.

On Wed, Apr 4, 2012 at 2:48 PM, Ravish Bhagdev <[hidden email]>wrote:

> Updating a single field is not possible in solr.  The whole record has to
> be rewritten.
>
> 300 MB is still not that big a file.  Have you tried doing the indexing (if
> its only a one time thing) by giving it ~2 GB or xmx?
>
> A single file with that size is strange!  May I ask what is it?
>
> Rav
>
> On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 <[hidden email]> wrote:
>
> >
> > Some days ago, I posted about an issue with SOLR running out of memory
> when
> > attempting to index large text files (say 300 MB ). Details at
> >
> >
> http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
> >
> > Two things I need to point out:
> >
> > 1. I don't need Tika for content extraction as the files are already in
> > plain text format.
> > 2. The heap space error was caused by a futile Tika/SOLR attempt at
> > creating
> > the corresponding huge XML document in memory
> >
> > I've decided to develop a custom handler that
> > 1. reads the file text directly
> > 2. attempts to create a SOLR document and directly add the text data to
> the
> > corresponding field.
> >
> > One approach I've taken is to read manageable chunks of text data
> > sequentially from the file and process. We've used this approach
> > sucessfully
> > with Lucene in the past and I'm attempting to make it work with SOLR
> too. I
> > got most of the work done yesterday, but need a bit of guidance w.r.t.
> > point
> > 2.
> >
> > How can I achieve updating the same field multiple times. Looking at the
> > SOLR source, processor.addField() merely
> > a. adds to the in-memory field map and
> > b. attempts to write EVERYTHING to the index later on.
> >
> > In my situation, (a) eventually causes a heap space error.
> >
> >
> >
> >
> > Here's part of the handler code.
> >
> >
> >
> > Thanks much
> >
> > Thanks
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



--
Sincerely yours
Mikhail Khludnev
[hidden email]

<http://www.griddynamics.com>
 <[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Ravish Bhagdev
Yes, I think there are good reasons why it works like that.  Focus of
search system is to be efficient on query side at cost of being not that
efficient on storage.

You must however also note that by default a field's length is limited to
10000 words in solrconf.xml which you may also need to modify.  But I guess
if its going out of memory you might have already done this?

Ravish

On Wed, Apr 4, 2012 at 1:34 PM, Mikhail Khludnev <[hidden email]
> wrote:

> There is https://issues.apache.org/jira/browse/LUCENE-3837 but I suppose
> it's too far from completion.
>
> On Wed, Apr 4, 2012 at 2:48 PM, Ravish Bhagdev <[hidden email]
> >wrote:
>
> > Updating a single field is not possible in solr.  The whole record has to
> > be rewritten.
> >
> > 300 MB is still not that big a file.  Have you tried doing the indexing
> (if
> > its only a one time thing) by giving it ~2 GB or xmx?
> >
> > A single file with that size is strange!  May I ask what is it?
> >
> > Rav
> >
> > On Tue, Apr 3, 2012 at 7:32 PM, vybe3142 <[hidden email]> wrote:
> >
> > >
> > > Some days ago, I posted about an issue with SOLR running out of memory
> > when
> > > attempting to index large text files (say 300 MB ). Details at
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Solr-Tika-crashing-when-attempting-to-index-large-files-td3846939.html
> > >
> > > Two things I need to point out:
> > >
> > > 1. I don't need Tika for content extraction as the files are already in
> > > plain text format.
> > > 2. The heap space error was caused by a futile Tika/SOLR attempt at
> > > creating
> > > the corresponding huge XML document in memory
> > >
> > > I've decided to develop a custom handler that
> > > 1. reads the file text directly
> > > 2. attempts to create a SOLR document and directly add the text data to
> > the
> > > corresponding field.
> > >
> > > One approach I've taken is to read manageable chunks of text data
> > > sequentially from the file and process. We've used this approach
> > > sucessfully
> > > with Lucene in the past and I'm attempting to make it work with SOLR
> > too. I
> > > got most of the work done yesterday, but need a bit of guidance w.r.t.
> > > point
> > > 2.
> > >
> > > How can I achieve updating the same field multiple times. Looking at
> the
> > > SOLR source, processor.addField() merely
> > > a. adds to the in-memory field map and
> > > b. attempts to write EVERYTHING to the index later on.
> > >
> > > In my situation, (a) eventually causes a heap space error.
> > >
> > >
> > >
> > >
> > > Here's part of the handler code.
> > >
> > >
> > >
> > > Thanks much
> > >
> > > Thanks
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Incremantally-updating-a-VERY-LARGE-field-Is-this-possibe-tp3881945p3881945.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> [hidden email]
>
> <http://www.griddynamics.com>
>  <[hidden email]>
>
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

vybe3142
In reply to this post by Ravish Bhagdev
Thanks.

Increasing max. heap space is not a scalable option as it reduces the ability of the system to scale with multiple concurrent index requests.

The use case is indexing a set of text files which we have no control over i.e. could be small or large.
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

vybe3142
In reply to this post by Mikhail Khludnev

> Updating a single field is not possible in solr.  The whole record has to
> be rewritten.

Unfortunate. Lucene allows it.
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Yonik Seeley-2-2
On Wed, Apr 4, 2012 at 3:14 PM, vybe3142 <[hidden email]> wrote:
>
>> Updating a single field is not possible in solr.  The whole record has to
>> be rewritten.
>
> Unfortunate. Lucene allows it.

I think you're mistaken - the same limitations apply to Lucene.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

Walter Underwood
I believe we are talking about two different things. The original question was about incrementally building up a field during indexing, right?

After a document is committed, a field cannot be separately updated, that is true in both Lucene and Solr.

wunder

On Apr 4, 2012, at 12:20 PM, Yonik Seeley wrote:

> On Wed, Apr 4, 2012 at 3:14 PM, vybe3142 <[hidden email]> wrote:
>>
>>> Updating a single field is not possible in solr.  The whole record has to
>>> be rewritten.
>>
>> Unfortunate. Lucene allows it.
>
> I think you're mistaken - the same limitations apply to Lucene.
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10





Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

jmlucjav
In reply to this post by vybe3142
depending on you jvm version, -XX:+UseCompressedStrings would help alleviate the problem. It did help me before.

xab
Reply | Threaded
Open this post in threaded view
|

Re: Incremantally updating a VERY LARGE field - Is this possibe ?

vybe3142
In reply to this post by Yonik Seeley-2-2
 
Yonik Seeley-2-2 wrote
On Wed, Apr 4, 2012 at 3:14 PM, vybe3142 <[hidden email]> wrote:
>
>> Updating a single field is not possible in solr.  The whole record has to
>> be rewritten.
>
> Unfortunate. Lucene allows it.

I think you're mistaken - the same limitations apply to Lucene.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10
You're correct (and I stand corrected).

I looked at our older codebase that used lucene. I need to dig deeper to understand how come it doesn't crash when invoking addField() multiple times on each portion of the large text data whereas SOLR does. Speaking to the developer who wrote that code, we resorted to multiple addField() invocations to address the heap space issue.

I'll post back