Indexing longer documents using Solr...memory issue after index grows to about 800 MB...

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing longer documents using Solr...memory issue after index grows to about 800 MB...

Ravish Bhagdev
Hi,

The problem:

- I have about 11K html documents to index.
- I'm trying to index these documents (along with 3 more small string
fields) so that when I search within the "doc" field (field with the
html file content), I can get results with snippets or highlights as I
get when using nutch.
- While going through Wiki I noticed that if I need to do highlighting
in a particular field, I have to make sure it is indexed and stored.

But when I try to do the above, after indexing about 3K files which
creates index of about 800MB (which is fine as files are quite
lengthy) it keeps giving out of heap space errors.

Things I've tried without much help:

- Increase memory of tomcat
- Play around with settings like autoCommit (documents and time)
- Reducing mergefactor to 5
- Reducing maxBufferedDocs to 100

My question is also, if its required to store fields in index to be
able to do highlighting/returning field content, how does nutch/lucene
do it without that (because index for same documents created using
nutch is much much smaller)

But also when trying to query partially added documents, when I set
field highlight on (and a particular field) it doesn't seem to have
any effect.

As you can see I'm very confused how to proceed.  I hope I'm being
clear though :-S

Thanks,
Ravi
Reply | Threaded
Open this post in threaded view
|

Re: Indexing longer documents using Solr...memory issue after index grows to about 800 MB...

Mike Klaas
On 4-Sep-07, at 4:50 PM, Ravish Bhagdev wrote:

> - I have about 11K html documents to index.
> - I'm trying to index these documents (along with 3 more small string
> fields) so that when I search within the "doc" field (field with the
> html file content), I can get results with snippets or highlights as I
> get when using nutch.
> - While going through Wiki I noticed that if I need to do highlighting
> in a particular field, I have to make sure it is indexed and stored.
>
> But when I try to do the above, after indexing about 3K files which
> creates index of about 800MB (which is fine as files are quite
> lengthy) it keeps giving out of heap space errors.
>
> Things I've tried without much help:
>
> - Increase memory of tomcat
> - Play around with settings like autoCommit (documents and time)
> - Reducing mergefactor to 5
> - Reducing maxBufferedDocs to 100

Merge factor should not affect memory usage.  You say that you  
increased memory usage.. but to what?  I've found reducing  
maxBuffered Docs decreases my peak memory usage significantly.

> My question is also, if its required to store fields in index to be
> able to do highlighting/returning field content, how does nutch/lucene
> do it without that (because index for same documents created using
> nutch is much much smaller)

Are you sure that it doesn't?  According to:

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java?
view=markup

nutch does indeed take the stored text and re-analyses it when  
generating a summary.  Does nutch perhaps store less content of a  
document, or in a different store?

> But also when trying to query partially added documents, when I set
> field highlight on (and a particular field) it doesn't seem to have
> any effect.

Does the field contain a match against one of the terms you are  
querying for?

-Mike


Reply | Threaded
Open this post in threaded view
|

Re: Indexing longer documents using Solr...memory issue after index grows to about 800 MB...

Ravish Bhagdev
thanks for your reply, my response below:

On 9/5/07, Mike Klaas <[hidden email]> wrote:

> On 4-Sep-07, at 4:50 PM, Ravish Bhagdev wrote:
>
> > - I have about 11K html documents to index.
> > - I'm trying to index these documents (along with 3 more small string
> > fields) so that when I search within the "doc" field (field with the
> > html file content), I can get results with snippets or highlights as I
> > get when using nutch.
> > - While going through Wiki I noticed that if I need to do highlighting
> > in a particular field, I have to make sure it is indexed and stored.
> >
> > But when I try to do the above, after indexing about 3K files which
> > creates index of about 800MB (which is fine as files are quite
> > lengthy) it keeps giving out of heap space errors.
> >
> > Things I've tried without much help:
> >
> > - Increase memory of tomcat
> > - Play around with settings like autoCommit (documents and time)
> > - Reducing mergefactor to 5
> > - Reducing maxBufferedDocs to 100
>
> Merge factor should not affect memory usage.  You say that you
> increased memory usage.. but to what?  I've found reducing
> maxBuffered Docs decreases my peak memory usage significantly.
>

OK

> > My question is also, if its required to store fields in index to be
> > able to do highlighting/returning field content, how does nutch/lucene
> > do it without that (because index for same documents created using
> > nutch is much much smaller)
>
> Are you sure that it doesn't?  According to:
>
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-
> basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java?
> view=markup
>
> nutch does indeed take the stored text and re-analyses it when
> generating a summary.  Does nutch perhaps store less content of a
> document, or in a different store?
>

I am not sure what it does internally but my educated guess is it
doesn't store entire documents in index (going by index size).  It is
way too small when created using nutch to store entire documents
(pretty sure of this part)

> > But also when trying to query partially added documents, when I set
> > field highlight on (and a particular field) it doesn't seem to have
> > any effect.
>
> Does the field contain a match against one of the terms you are
> querying for?
>

Yup

> -Mike
>
>
>
Cheers,
Ravi