configuring solr3.6 for a large intensive index only run

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

configuring solr3.6 for a large intensive index only run

spredd1208
I am trying to do a very large insertion (about 68million documents) into a
solr instance.

Our schema is pretty simple. About 40 fields using these types:

   <types>
      <fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
      <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
         <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>
         <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
         </analyzer>
      </fieldType>
      <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
omitNorms="true" positionIncrementGap="0"/>
   </types>

We are running solrj clients from a hadoop cluster, and are struggling with
the merge process as time progresses.
As the number of documents grows, merging will eventually hog everything.

What we would really like to do is turn merging off and just do an index
run with a sparse solrconfig and then
start things back up with our runtime config which would kick off merging
when it starts.

Is there a way to do this?

I came close to finding an answer in this post, but did not find out how to
actually turn off merging.

Post by Mike McCandless:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
Student
Reply | Threaded
Open this post in threaded view
|

Re: configuring solr3.6 for a large intensive index only run

Lance Norskog-2
If you want to suppress merging, set the 'mergeFactor' very high.
Perhaps 100. Note that Lucene opens many files (50? 100? 200?) for
each segment. You would have to set the 'ulimit' for file descriptors
to 'unlimited' or 'millions'.

Later, you can call optimize with a 'maxSegments' value. Optimize will
stop at maxSegments instead of merging down to one. Lucene these days
does not need to have one segment, so merging down to 20 or 50 is
fine.

On Wed, May 23, 2012 at 11:19 AM, Scott Preddy <[hidden email]> wrote:

> I am trying to do a very large insertion (about 68million documents) into a
> solr instance.
>
> Our schema is pretty simple. About 40 fields using these types:
>
>   <types>
>      <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>      <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>         <analyzer type="index">
>            <tokenizer class="solr.StandardTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>         </analyzer>
>         <analyzer type="query">
>            <tokenizer class="solr.StandardTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>         </analyzer>
>      </fieldType>
>      <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
> omitNorms="true" positionIncrementGap="0"/>
>   </types>
>
> We are running solrj clients from a hadoop cluster, and are struggling with
> the merge process as time progresses.
> As the number of documents grows, merging will eventually hog everything.
>
> What we would really like to do is turn merging off and just do an index
> run with a sparse solrconfig and then
> start things back up with our runtime config which would kick off merging
> when it starts.
>
> Is there a way to do this?
>
> I came close to finding an answer in this post, but did not find out how to
> actually turn off merging.
>
> Post by Mike McCandless:
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: configuring solr3.6 for a large intensive index only run

Otis Gospodnetic-2
In reply to this post by spredd1208
Scott,

In addition to what Lance said, make sure your ramBufferSizeMB in solrconfig.xml is high. Try with 512MB or 1024MB.  Seeing Solr/Lucene index segment merging visualization in SPM for Solr is one of my favourite reports in SPM.  It's kind of "amazing" how much index size fluctuates!

Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



>________________________________
> From: Scott Preddy <[hidden email]>
>To: [hidden email]
>Sent: Wednesday, May 23, 2012 2:19 PM
>Subject: configuring solr3.6 for a large intensive index only run
>
>I am trying to do a very large insertion (about 68million documents) into a
>solr instance.
>
>Our schema is pretty simple. About 40 fields using these types:
>
>   <types>
>      <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>omitNorms="true"/>
>      <fieldType name="text_general" class="solr.TextField"
>positionIncrementGap="100">
>         <analyzer type="index">
>            <tokenizer class="solr.StandardTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>         </analyzer>
>         <analyzer type="query">
>            <tokenizer class="solr.StandardTokenizerFactory"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>         </analyzer>
>      </fieldType>
>      <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
>omitNorms="true" positionIncrementGap="0"/>
>   </types>
>
>We are running solrj clients from a hadoop cluster, and are struggling with
>the merge process as time progresses.
>As the number of documents grows, merging will eventually hog everything.
>
>What we would really like to do is turn merging off and just do an index
>run with a sparse solrconfig and then
>start things back up with our runtime config which would kick off merging
>when it starts.
>
>Is there a way to do this?
>
>I came close to finding an answer in this post, but did not find out how to
>actually turn off merging.
>
>Post by Mike McCandless:
>http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: configuring solr3.6 for a large intensive index only run

Shawn Heisey-4
In reply to this post by Lance Norskog-2
On 5/23/2012 12:27 PM, Lance Norskog wrote:
> If you want to suppress merging, set the 'mergeFactor' very high.
> Perhaps 100. Note that Lucene opens many files (50? 100? 200?) for
> each segment. You would have to set the 'ulimit' for file descriptors
> to 'unlimited' or 'millions'.

My installation (Solr 3.5.0) creates 11 files per segment, and there is
often a 12th file for deletes.  I have termvectors turned on for some of
my fields.  If you aren't using termvectors at all, the last three files
in my list are not created:

_26n_2.del  _26n.fdt  _26n.fdx  _26n.fnm  _26n.frq  _26n.nrm  _26n.prx  
_26n.tii  _26n.tis  _26n.tvd  _26n.tvf  _26n.tvx

I have yet to try 3.6, but I would imagine that it isn't a lot different
than 3.5.  I use a fairly high mergeFactor of 35, and I am considering
raising it even higher so that during normal operation there will never
be a merge that's not under my control.  When I do a full index rebuild,
there is so much data added that it will still do automatic merges.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: configuring solr3.6 for a large intensive index only run

nanshi
1) In SolrConfig.xml, find ramBufferSizeMB and change to:
 <ramBufferSizeMB>1024</ramBufferSizeMB>

2) Also, try decrease the mergefactor to see if it will give you less segments. In my experiment, it does.